Querying for random record
I was wondering if there is a standard way for querying for a random record.
Some ways I have thought about achieving this:
- Counting number of records and generating a number between 0 - n (falls over on deleted records and relies on a lengthy count call)
- Creating indexes that reoorder and then query the indexes.
18 Replies
hi! the standard high performance way in a database is (1) generate a random float number
[0...1)
and insert it into every record in the table; (2) at query time generate a random float number in the same range and return the first record gte
it. if you get no records, return the first record lt
it. if you get no records, your table is empty 😉
(this actually would be a good convex component when we release those... and could probably be written in about an hour 🤔 )Thanks for the response, that’s a pretty neat trick to generate the random number during insertion. Will give it a shot
Interesting @jamwt , we've spent quite a lot of time fiddling with different ways to query data at random using Convex, and in an efficient way.
The approach you describe seems very simple and nice, but not very performant: for querying just one document at random, we must fetch the entire table and, on average, iterate over 50% of the elements. In the worst case I guess it's even O(n), right?
if you have an index on the random number, you have to look at a single row, not the whole table
yep, you use an index
then it's basically log n time (the btree traversal)
Oh, I don't think I fully understand the index implementation then!! Can you traverse an index in an ordered manner? I mean, it's very unlikely that the random float generated at run-time corresponds exactly to the same value as one already in the database, so the index for that specific value probably wouldn't exist?
yes! indexes are not hash tables, they're btrees
aka the keys are ordered
so once you do log n lookup to find entry X, finding the next record < or > X is O(1)
this is what actually powers efficient
sort
and things like that
(or ORDER BY
in SQL, which can either take infinitely long to copy the entire table if there is no index, or can be instant if the ordered field is indexed)How about that maplit crate huh ❤️
ah in rust?
Haha yeah I'm just messing around
Ohh what the heck!! Cool!! I just found this clearly described in the docs as well: https://docs.convex.dev/database/indexes/#sorting-with-indexes
Is this a recent addition (past few months) to the system/docs, or have I just been blind?! That's super nice, will be very useful for us.
Could your approach be extended to the case where we want to fetch N documents at random? I suppose we could just generate N random floats instead and do the procedure for each of them, and add the results to a hashset to disallow duplication of results (which we want in our use case)
Indexes | Convex Developer Hub
Indexes are a data structure that allow you to speed up your
^ haven't typechecked the above, but that's the idea
first comparison is
gte
in case you get insanely lucky and you actually generate an exact float that matches
Is this a recent addition (past few months) to the system/docs, or have I just been blind?! That's super nice, will be very useful for us.Nope! been in there since we first shipped indexes, which was pre 1.0
Cool man thanks, it makes sense, I assumed that indexes were hash tables and somehow missed that docs section! Will definitely have some use cases for this.
What's the downside of storing indexes as binary trees instead of hash tables? I guess, in practice, that insertion and lookup are slower? But the obvious upside is then this sorted nature, I guess. Were those your concerns when deciding which structure to use?
we didn't strongly consider hashtables -- but mostly b/c we copied heavily from convention here. almost all databases use btrees for indexes (like postgres, mysql, sql server, oracle, etc)
so it's the natural default
otherwise you can't power things like ordering
and algorithms like this one
also, databases are often managing really large amounts of data
so hashtable resizes would be deadly
and the structures are mapped on disk
btrees evolve/balance with smaller maintenance operations than hash tables
so they just are in general a better structure for large dataset, disk-persistent indexing
only real tradeoff is O(1) vs. O(log n) for simple queries
but as long as you keep your trees not too deep it's worth it
Thanks @jamwt that's interesting to read, good to learn about practical considerations for large-scale systems!!
What would be your suggestion for querying N random documents from a table (preferably without duplication, but I understand that this gets harder as N grows relative to table size)?
Back then, we tried maintaining an insertion order field on the documents and generating random unique numbers in that range, but maintenance of the insertion order grew to be super lame, not least since we actually wanted to be able to categorize our documents (videos) in several ways: by category, by creator, etc. So we ended up having sort of compound insertion orders which just became stupid haha.
I guess that your random float-approach should also work for such subsets of tables, since the floats should be uniformly distributed such that the full table and all of its subsets should have the same distribution of random floats, or is my math wrong?
just repeat the above process N times
you could be tempted to
take(5)
instead of doing first()
but that's not truly random since when your generated values end up close to each other upon different invocations of asking for a random set
you'll tend to return the same "random values"
b/c they happen to be adjacent in the index
so doing N point lookups using the above method is the more accurate one, vs. turning it into a range readnote doing the above process N times can also be skewed non-random. If e.g. there are 2 documents and the floats picked are 0.9 and 1.0, then the first document is 9 times more likely to be picked when sampled repeatedly
The way to get uniform randomness with repeated sampling is to do what you mention in the original question: count records, and keep rank between 0-n for each document. To do this efficiently isn't super easy, as you discovered, so we're about to release a solution as a component on top of convex. stay tuned 🎶
Yeah for very small subsets I guess that approach totally collapses, since the document-floats won't be uniformly distributed in the range. But... if the subsets are small enough for that to be an issue, they may be small enough to warrant just fetching entire thing and shuffling the array! 😄
Sick @lee, cant wait for that !! You guys rock!!
@jamwt We implemented this random float field, and it works great! Is there a way to build this into a paginated query such that we automatically can repeat the process with
loadMore
? Otherwise, I guess we'll have to use the convex client to imperatively call the function again upon a client event, but we want to keep our client as lean as possible