mleku @mleku - 1y
ohshit... so, i'm up to the last piece of the puzzle on how to make garbage collection work on the badger event store, and i'm reminded of an inconvenient fact about the layout of the indexes that nostr:npub180cvv07tjdrrgpa0j7j7tmnyl2yr6yr7l8j4s3evf6u64th6gkwsyjh6w6 designed for the eventstore/badger yes, the event serials are suffixes the efficient iteration algorithms for streaming parallel iterations require prefixes! woo hoo now, prepare to break everything omg lol #devstr #database #databass #wompwomp #deardiary #replicatr
actually, no, i can't do it that way either, because the whole point of the indexes is they use those prefixes to find things like the ID search, which needs the prefix to be a specific byte and then the first 8 bytes of the event ID it needs to be done as a loop on each index prefix with a scan that yanks the matches from the last bytes well, it is what it is
been trying to avoid using maps unnecessarily, as linear iterations and sort functions are substantially more efficient but in this case i think i now need to use a couple of maps for the unpruned and pruned counts (with the last access timestamp) so that each pass on the different filter index types can quickly locate which record needs to have the size value incremented it's only the pruned events that need this so that means i only need one map... more memory, but whatever, has to be done
making a fast transform from the database serials of events to a fast to compute uint64 value was essential in all this, it made everything a lot neater and right now it makes the map iteration a lot more efficient because this will just be 2N extra data to deal with having this map... i'm not sure if there is any way to thin down how much data has to be generated in the count stage but anyhow i think it's still way under 500mb for this process, pretty sure it will work out to about 128mb at most for this task, and possibly it can be made into a Stop The World GC pass with a mutex later if the GC blows up the memory utilisation when it runs beyond my target 500mb
the purpose of all this is maintaining as much as possible search intelligence in the relay event store to speed up locating results of a request filter, while maintaining storage usage in both the raw events as well as the pruned events that exist on the L2 (etc) anyway, it's gonna be the best nostr relay database implementation by far, manages its storage, doesn't need a #nukening or unimplemented complex mechanism to decide how to delete old stuff or move it to archive
8 seconds to determine the size of all events stored in a 14Gb database that includes a filter index? is bad? you just don't know how to use it, nor do you know how to write binary encoders properly
for comparison, it takes longer than that on a 512Gb filesystem to run `du` normally and size tables are independently stored in a filesystem