#realy #devstr #progressreport
exploiting the benefits of my custom streaming varint codec, i have now changed the timestamps in database indexes to be these compact varints. this has resulted in a shrinking of the size of indexes from 64mb down to 59 megabytes, for a stash of events of 204mb as minified wire encoded JSON
the current unix timestamps are still under 4 bytes of value, and will remain at this size until 2038, so for now they will only be 4 bytes long, but expand quickly to standard 64 bit signed integer timestamps which covers until 2140.
in other notes about development, i have reorganised the documentation of the index type enums so that they are in groups logically ordered, with the metadata keys - pubkey, kind and created_at timestamp in a group of 6 with 3 being the individual, two being kind/created_at pubkey/created_at, and a three way combination of kind/pubkey/created_at
these 6 indexes cover all of the primary metadata searches of events themselves, and then i have a series of tag indexes that cover a-tags, which have kind/pubkey/identifier, e and p tags, the p tags can tolerate having hashtags stored in them as appears in some clients broken implementations of follow lists (why they don't use t-tags idk, but it is what it is), standard other tags of single letter (which would include things like the mimetype and other categories), the d-tag, which is the identifier used on addressable (parameterized replaceable), and nonstandard tags also get indexed, so they will actually be searchable, though not via standard hash-letter style filters, it will require an extension of this in the HTTP API
lastly there is some cache-management GC type indexes that store first seen, last accessed and access count, which will be usable for calculating whether to prune events out of the store or not, for an archive/cache two layer architecture.
i've also been thinking about how to do query forwarding, and for this, http API clients will thus open one subscription using SSE, which will send any out-of-order events or subscription filter matches. this will also mean that standard filter queries will return a relay-generated filter identifier, when query forwarding is implemented at the top of the list of event IDs that are returned, so that when a forwarded query returns the identifier is prefixed to the results that come from the forwarded query, enabling the client to recognise the client that made the query and send those results to the client over the subscription SSE connection.
a second distributed relay architecture feature that i am designing that will allow relays to subscribe to other relays latest events (usually just an empty query that continues to forward all new events to the client relay/cache relay) will entail building a subscription queue management system, and an event ack message so that the archive or source relay that is receiving events that get forwarded, tolerate the subscription connection dropping and they will recover the connection and send events that arrived after the subscription dropped and continue. this is an out of band state, that enables the implementation of a full relay distribution strategy
last thing that would be required also is auth proxying. for this, clients need to be able to understand that they are authing to a different address than the relay they send a query to, and this would enable a cache type relay to serve a client exclusively to eliminate replication of traffic that the current architecture forces, and causes big problems with huge bandwidth usage currently. with this, you will be able to deploy a cluster of relays, with a few archives, and many caches, and clients will connect to only one of them at a time, and the cache relay will store all query results that they get via the forwarded query protocol, which requires this auth proxy protocol, and additionally, cache relays in a cluster would have subscriptions to each other so that when forwarded queries get new events, as well as returning the result to the client, they would propagate them horizontally, meaning that from that point any other client on the cluster would quickly be able to see these events that did not get published to the specific cache relay in the cluster that triggered the event query forward by the cache.
with all of these new strategies in place, both geographical distribution and consortium style collaboration become possible and further decrease the attack surface possible to suppress transit of events on nostr.
#realy #devstr #progressreport
so i changed my VPS to a new one. i unfortunately paid ahead quite a bit for the one in bulgaria so i've begged them to refund me my balance. no idea if they will, probably not. meh.
the thing is, the new one, which seems to be running with the same exact management system (identical UI anyway) which is based in Belgrade now, instead of Sofia, and i've enabled namecheap's PremiumDNS service as well, and am using it also as my DNS for my wireguard tunnel that goes through that VPS now, and i'm not seeing the same problem as before
expected behavior, based on what was happening with my local test instance:
refresh page
post or broadcast a note
server sends back CLOSED auth-required
client auths
client resends event
happiness
actual behaviour on old VPS
all of the above except for some reason on the test realy it would auth instantly, and of course then any event publish works, but on the one running on the vps, it wouldn't auth at first, and randomly would sometimes auth and then start to accept until that socket closed and the merry go round starts again
on the new VPS
same as test version, no problems, not my bug, fuck you vps.bg
#realy #devstr #progressreport
so, i wasn't satisfied at all with the algorithm i designed for sorting results by relevance. it seemed way too haphazard to be effective, so i'm revising it to work a different way:
- first, it scans the groups of results matching a single event, and creates a map that contains the distinct matching terms, and a count of how many times each match appears
- then it eliminates the results that don't have at least 1 of each term in the search query, because why the fuck should it return anything that doesn't have a complete set of the search terms? this is a baseline IMO, so using the sets and counts of terms above lets me eliminate everything that lacks the complete set
- next, it sorts the groups of matches for each event by their sequence in the search terms, so that they are in the same order that they appear in the match
- then it repeatedly scans forward through the matched terms in their now sorted order, and gathers every possible group that has a complete set, not necessarily in the same order, but complete
- these sets are then evaluated by their distance between first and last of the (possibly several) instances of the complete set, stepping forward from the previous step's starting point, and then generates the minimum distance between first and last of a complete set in the results, and storing this number in each event's result group
- the minimum distance of the minimum length set of each then becomes the criteria to sort all of the groups, and the top results will be the ones that come closest to matching the sequence, albeit they can be in a different order, but order is not so important as completeness in the relevance of a result; and if the terms appear in the exact order, their distance will be the minimum of the result set and thus have the most relevance.
most full text search algorithms i have seen in action include results that only match one term, but this requires them to have giant dictionaries for the language being searched that contain all of the low relevance terms like "the" and "a" and "an" and "of" and so forth... but i want to make a search that doesn't need this expensive and time consuming list set for every language it can support, instead, it will support all languages, and simply treats each term in the search like a distinct number and searches for the shortest set of matches of these terms as this is the most relevant matches, and there is zero point in keeping any without the complete set, because if the ones that don't match have all of the words that are not distinctive and are very common among all documents, they are useless results and eliminating them saves a lot of time to produce a more efficient result
the user can then see if they have put too many minimal relevance terms in their search query to maybe trim out these common words, but the common words can be important - because very often really a pair or even phrase of the search term should appear as a unit, like articles and nouns usually will appear together, and even, in some languages, like nordic and slavic languages, the article and noun combine together, and in other cases, the adjective and noun combine together to become a composite word (eg, bezplatno, schtatite, which mean free as in gratis and the state), but in the english, this term could be more muddled and be a multiplicity, as in "free of charge" and "the state", or "sim pagamento" or "gratis, or "o stados".
#realy #devstr #progressreport
the full text search is theoretically now implemented
it iterates the new fulltext index searching for word matches
then it checks that the rest of the filter criteria match, and eliminating candidates that don't have at least one of the criteria, i.e.:
- eliminates results outside of since<->until,
- that aren't one of the kinds requested,
- that aren't published by one of the specified pubkeys,
- that don't have the requested language tag, and
- that don't have one of the tags in the filter
then it groups the result fulltext index entries by same event into a unit and creates a map of them
then it converts the random iterating map into a straight array
it iterates the array and calculates the distance between the first and last word match in the sequence of the text of the event
then it segments the results by number of words that match in each event, grouping the ones with the same number of word matches into groups
then it calculates the distance between the first and last word matches in the text (the terms can appear multiple times but are sorted by their appearance, this is just an approximation, since words can appear multiple times)
then it gets a list of all the terms in the event by their sequence number in the original search, and with this array it counts how many of those matches are in ascending order matching the search terms
with these two metrics, we can calculate relevance, as the higher the number of items in matching sequence the more relevant, and the closer they are together, the more likely they are to match the search text
then we sort the groups of same-count of words from the search text by distance AND sequence, meaning the top results have the lowest distance and the highest sequence
then with the segments individually sorted, we zip them back together into a list, extract their event id, pubkey and timestamp and return that to the caller (this same result is used by the filter in the HTTP API so it can filter out pubkeys and sort the results by timestamp descending or ascending according to the query parameters).
and finally, then it trims the list down to the number of results requested in the limit *or* the configured max limit (512, generally, but the http endpoint will allow 1000 for unauthed users and 10000 for authed users, as it only returns the event id, so the client can paginate them on their side)
i know that all sounds complicated but it's now all written and this will enable a fairly decent relevance sorted full text search for nostr text events.
the hard part now is going to be testing it, i will probably make two endpoints for this, one will be disabled later, but that one will return the full events in the result, as an array, so i can see them in the api, and squash any bugs or logic errors in the search code... and also see how long it takes for producing the results.
#realy #devstr #progressreport
i'm getting to the last steps in writing the fulltext query filter handling parts, and i've written the handle for p tags, that was easy, there's an index for that
but there isn't an index for e tags, this is what the "extrafilter" is about in fiatjaf's query function in the badger database
i'm looking this up and down and i'm like, uh, there's no index for e tags, which are commonly needed to search for replies!
i think i have to go to the indexes now and create a new index for e-tags to make this work, because i really don't see how i can filter on them with the fulltext search with a full filter
and if i'm gonna make a new e tag index, then i can speed up normal filter searches a lot too
hilarious, but kinda annoying, i was all homed in on the fulltext search function and now i have to go sideways and add a new index and that will mean needing to rescan the indexes when i'm done building that, before i can expect this fulltext search to work
i'm kinda glad i'm digging deep into this key value store database indexing design stuff... i'm sure this is going to be valuable. i have already learned enough to build a reasonable set of indexes for my fiat mine data analysis stuff, but this is getting a step further advanced, after i've built this i can say i have built a fully capable database engine for nostr that is optimized to search for events... with that e tag index added it will slash the search time by maybe as much as 50% for regular kind 1 or general threaded queries.
#realy #devstr #progressreport
the responsibilities in the fiat mine in the last few days has been pretty low... this afternoon i probably need to spend a few hours adding a text embeddings similarity score based on public forum/chat messages, but i am mostly ahead of my front end and back end peers in regards to meeting the release target functions, which is funny because i was hounded randomly once a month or so during the building process by the CTO for being "too slow" and actually i'm faster than the rest of them. lol!
anyway, i don't mind because it means i can add this feature, i'd almost forgotten about it for the last few weeks it was on my todos, and will make #realy a compelling option for people wanting to deploy relays, because it will have a full text search, and as far as i know, maybe one, proprietary relay implementation being built by an esteemed fellow black sheep is able to do this properly as well. of course i'm gonna probably at some point be competitive about this and do something loony like build a little benchmarker to see how fast it is... meh, probably not, but i'm going to make mine as fast as possible
it seems i forgot another todo on the fiat mine task tho, gonna have to put this realy work aside for the morning, continue the fulltext search algorithm later
#realy #devstr #progressreport
so, now i am up to the stage of actually writing a search function, i realise that the fulltext indexes need a lot more data in them
the fulltext index will have all this :
[ 15 ][ word ][ 32 bytes eventid.T ][ 32 bytes pubkey ][ 8 bytes timestamp.T ][ 2 bytes kind ][ 4 bytes sequence number of word in text ][ 8 bytes Serial ]
with this as the result from the word query, i can now then filter out by timestamps, pubkeys and kind immediately upon getting them, and with the sequence number in the text i can quickly determine sequence similarity by the words being in query order and then compared by the number of them that are in the same sequence
with all of these extra fields in the keys, almost all of the necessary filtering required for a fulltext search can be done effectively, and the amount of data in the index is not so great but the main thing is that the word matches can be found immediately, filtered by their timestamp and kind, grouped by their common presence in an event, and then sorted in their sequence order within the event.
it is a lot of data but all of these fields are needed to evaluate and without them being in one index i would make a lot more queries and decoding to find them, and this trades off some extra size in the database for speed at finding, filtering and sorting matches
gonna need to make a specific data structure for it, well, probably a set of functions that pick out the field and decode it, that would be more efficient since i can then quickly jump to the field that needs to be filtered according to the other, non-fulltext part of the search, eg, the kinds, the pubkey, the timestamp range... not really exactly sure which of those the process should be ordered in... probably pubkey, kind, timestamp, in that order, depending on the presence of those in the filter
this may take longer than i hoped but i honestly didn't really have any idea what would be involved... at day 3 working on this now, and i only have the index writing done, but with this scheme designed i should progress quickly to a working search function
#realy #devstr #progressreport
so, i had really written the full scanning thing wrong, didn't need to be keeping track of the events that were already indexed...
they now index when being saved, and because this disrupts the import function, it now buffers the upload in a temp file and then lets the http client finish and disconnect, as it was stalling due to the extra delay from the time it takes to build the indexes
further, the rescan admin function now also regenerates these indexes as well, so, just gotta eat me some lunch, then i need to revise the delete/replace code to remove the indexes when events are removed, and finally get to the last step of the task, implementing the search
in other news, my double vision/convergence problem at close distance is definitely reducing slowly, i can now actually read small text but it's just a bit doubled if i am looking at text more less than about 50cm from my eyes. the glasses help me read easier but i feel like it's just a matter of continuing the potassium/calcium/magnesium and probably continuing to dial back the alcohol consumption to basically none. i think my kidneys are still weak and the alcohol is slowing the recovery. and incidentally i'm getting a lot more work done.
#realy #devstr #progressreport
so, i finally have some kind of indexing happening
it's quite incredible how big these full text indexes are, just on the face of the size of the database directory it looks like it takes more space for the index than the data
i probably need to try and put some kind of a further limit on how many symbols get an inverted index perhaps, i mean, gotta be some of those records are literally the whole event store for virtually every note because it has the word "the" in it, but how do i do this correctly without getting the problem of either making giant lists of prepositions in every language...
well, i'll see i guess
i mean, it's kinda worth it if i can implement fast full text searching of notes, even if the storage space is greatly increased... it just is a lot of space
anyhow
in the process of doing this i ran into a nasty thing in badger where it just kills the process completely, without any feedback. i had to use a go workspace, cloned the badger repo into it and then i was able to insert a panic message and actually get a traceback, and eventually figured out i was using iterators for something that just needed the use of a simple Get call to see if the literal exact key existed.
#realy #bunk #devstr #progressreport
so, fyne is a bit buggy in places, you have to sometimes force it to refresh and stuff when you do things in a sequence it doesn't expect, like dragging the window to another screen it sometimes resizes the window and widgets overflow and you have to resize it to get them to fit the window again
i think i've got an inkling about how to avoid these glitches and adding some more calls to the function that repaints the window, so i will make it behave like a good little GUI app given time.
i've now learned how to make it render text to the set theme (which defaults to whatever your desktop manager says it is, which is nice) and so far this is what it does:
- open start screen with nsec input and one relay input that you paste in your nsec and type in the relay you want to use
- then you can click on the "next" button and it flips the display to the main display which shows the text of the bunker URL nicely wrapped to stay a decent size, and a button that you click and it puts the bunker URL into the clipboard
really basic steps so far, but as i get more familiar with how to work with it and the quirks i need to handle, i think in a few days i will have a working signer bunker app at least for linux, but i think actually i can build apps to push to android and ios as well, and probably can build windows and mac versions as well, which will cover all bases.
idk about properly covering ios and mac platforms but i can test it on windows at least as well
should be usable soon, once i learn all the tricks how to make it behave itself, fortunately the demo app has a lot of examples which allowed me to get this far, and i'm sure will help me break through until the MVP.
and yes, i'm calling it "bunk" because i have an ironic sense of humor, why else did i call my relay "realy" a typo that i often make... that would be because i know i'm a noob at a lot of things and when i am, i make fun of myself. proof of non-psycho.
#realy #devstr #progressreport
yesterday i spent a lot of time writing a novel (i think - it's the reverse of regular variable length integer encodings in that it uses the 8th bit as a terminal marker instead of one on each byte to indicate "there's more") variable integer encoder to write a binary encoding for use in the database that is written using the #golang io.Writer/io.Reader interface (so it can be used to stream the binary data over wire or to disk)
i figure the way the varint encoder works it's just a little faster because it only adds the 8th bit to the last byte so that's potentially as much as 1/7th of this operation, not huge but ok, anyway it was required for the streaming read/decode anyway, to make a new one.
it's not on by default, you have to set BINARY in the environment to make it do this, and there is no handling for this - if binary false, it tries to decode json, and if true, binary, so to change over, you need to export the events, nuke or just delete the database files ( ~/.local/share/realy ) and flip it to true and then import the events back (and if you deleted, you want to re-set your configuration, you really should keep a copy of the configuration json that gets stored in the database for reasons like this).
it's definitely noticably faster, the binary encoder i am quite sure will prove to be the fastest nostr database binary encoder, and it's also extremely simple, and the results are about a 30% reduction in event storage size, and maybe 1 microsecond to perform the encoding.
#realy #devstr #progressreport
i have got a new trick now for when i make an app that rarely does very memory/disk intensive operations like importing events, or my fiat mine job, doing an (N-1)N number of operations involving comparing all of a collection of JSON to each other (this blew up the memory on a heroku instance to just over 1gb and their server kills the process and calls it crashed
there is a manual trigger function in the "debug" library called debug.FreeOSMemory() which does this
so i just run that frequently through import and export, and voila, i can disable the swap on my VPS and realy doesn't blow up the memory.
in slightly unrelated matters, i learned about zigzag matrix traversal, and just now found this:
https://algocademy.com/blog/matrix-traversal-mastering-spiral-diagonal-and-zigzag-patterns/
spiral, diagonal and zigzag are three common methods that are used to optimize iterations on specific types of operation
in my fiat mine comparison matrix, the optimization is about reducing the number of decode operations on the JSON of the entries at each point in the matrix. the optimization that helps here is where the the path is such that you can cache some number of pre-decoded values from each of the elements of the array (which is being compared exhaustively)
spiral wouldn't do the trick, because it only saves you decoding one again for each circuit
diagonal would be no different from left-right-top-bottom path (scanline pattern)
zigzag is like diagonal except instead of starting again from the the same side, you move to the next non-visited node and then take a reverse diagonal
with zigzag iteration i will be able to tune the number of decoded elements in each comparison to reduce the number of times the data has to be marshaled again to be compared
i will likely make it a divide and conquer multi level zigzag too depending on the number of entries that need to have this comparison, that will traverse in a scanline pattern because that means that all of one of the segments in the grid i cut the matrix into will stay hot the whole time it passes left-right-top-bottom simple iteration
i am aiming to have a comparison algorithm that can efficiently and quickly render the results of such a comparison of all elements with each other in an array, to 100k and beyond, without needing tens of gigabytes of memory to do the job
it will also use a sync.Pool for this purpose, that keeps a limited size cache of the decoded elements, and when the cache reaches that size it will discard some percentage of the least recently used elements
probably i can concurrently run it too, so like zigzag of sub-matrixes left to right, if i can use 4 cores then i could run a grid in 4 sections and zigzag each section in parallel and then move down to the next row and so on until the computation is complete
Showing page 1 of
4 pages