i've finally done with my little project to write a better binary encoder for events
it has one distinctive functional feature, tags that contain hex data are recognised and encoded as compact binary data, which is a fairly hefty saving in storage size for any event with e (event), p (pubkey) or a (naddr) tags, just over half as much data compared to the others, with very little cost in processing
it is the late evening so now i have it working, i want to test and benchmark it, but that will be a task for tomorrow
thanks to nostr:npub180cvv07tjdrrgpa0j7j7tmnyl2yr6yr7l8j4s3evf6u64th6gkwsyjh6w6 for alerting me to the fact that my use of the Gob encoder incurs a big cost in decoding, this decoder will be fast, maybe not *quite* as fast as the simple one he wrote but likely saves more than 30 bytes on every event, often 60 without any compression
also it also does not copy memory for string fields, simply "unsafe" snipping them out of the buffer, meaning that once the event has been re-coded to JSON the garbage collector will be able to free it, as it is memory utilisation is typically around 50Mb which has been a massive improvement since before i found the resource leak bugs it had before
ah yes, it also uses varints everywhere for field lengths, for tags and for content fields, this means that if there is under 128 characters, only 1 byte is used, varint encoding basically makes the 8th bit an overflow indicator, so it encodes up to 32kb of data with a 2 byte prefix and due to the size limitation of 500kb will never be more than 3 bytes long (this would permit 4Mb)
i have filed an issue about the question of how many tags should be permitted in events, and i honestly don't see how it would make any sense to make events with more than 256 events and each event with more than 256 fields, so the binary encoder only uses one byte to signify the length of the tags field and the number of fields in each tag, so i have suggested this be specified in the protocol to keep a sane cap on this field...
i guess if later it ever proves to be a problem i can change the tag field counter prefixes to a varint, then it won't be hard to add a varint encoding to that field, but i am highly skeptical it will ever be breached
tomorrow this gets deployed and benchmarked, i'm very keen to do the comparison, i will use an external repo so i can easily pull in fiatjaf's code to side by side it
what i wrote is stringently modular, each filed has its own reader and writer function, and there is a read buffer and a write buffer, and in almost all cases the encoder does copy operations directly from the source with no intermediate memory into the destination, and in the case of decoding, strings are not copied at all, except for those shorter binary encoded tags, and i doubt it can be made any faster except for by tweaking things with unsafe copy-free techniques beyond what i already have written, bespoke hexadecimal encoding that takes off safety checks due to the fixed format, and suchlike
i've not actually done much in the way of benchmarking my binary encoding work in the past, so it should be interesting
anyhow, that's me, GN y'all, the wild mleku now is disappeared
#deardiary #devstr #gn
of course, as i lay trying to sleep, some more thoughts come to me
specifically, in most cases, the binary encoding wants to go directly to json, and the way it's architected it has to go to an intermediate, in-memory form... i'm probably going to look at how that logic works and see if the intermediate step can be shortcut, probably it can't be, it just depends on how the filters are written... for sure when a match is on the event ID, for example, there is no reason why it needs an intermediate form, but other matches may need to be filtered over multiple parts of the data and thus be basically impossible to filter this way
walking the structure without decoding it can be done too, this is another possibility in terms of memory copy avoidance... it could be interesting to add a progressive scanning field accessor then maybe the direct binary to json thing could be workable, since the filtering by field could be relatively fast without needing memory copy
haha, well, hell, i'm gonna do it though - field accessor scanning and direct json codec, after i get the benchmark done tomorrow
this kind of dramatic reduction in memory allocation requirements can have dramatic effects on processing capacity and reduction of overhead when under load means more throughput and more total capacity, and it opens up the possibility of high traffic gossip event propagation with low latency, the more minimal the time and memory required to filter through data, things like interactive collaborative document editing, for example
Showing page 1 of
1 pages