rhys1790•12mo ago

Querying for random record

I was wondering if there is a standard way for querying for a random record. Some ways I have thought about achieving this: - Counting number of records and generating a number between 0 - n (falls over on deleted records and relies on a lengthy count call) - Creating indexes that reoorder and then query the indexes.

18 Replies

jamwt•12mo ago

hi! the standard high performance way in a database is (1) generate a random float number [0...1) and insert it into every record in the table; (2) at query time generate a random float number in the same range and return the first record gte it. if you get no records, return the first record lt it. if you get no records, your table is empty 😉 (this actually would be a good convex component when we release those... and could probably be written in about an hour 🤔 )

rhys1790OP•12mo ago

Thanks for the response, that’s a pretty neat trick to generate the random number during insertion. Will give it a shot

djbalin•12mo ago

Interesting @jamwt , we've spent quite a lot of time fiddling with different ways to query data at random using Convex, and in an efficient way. The approach you describe seems very simple and nice, but not very performant: for querying just one document at random, we must fetch the entire table and, on average, iterate over 50% of the elements. In the worst case I guess it's even O(n), right?

lee•12mo ago

if you have an index on the random number, you have to look at a single row, not the whole table

jamwt•12mo ago

yep, you use an index then it's basically log n time (the btree traversal)

djbalin•12mo ago

Oh, I don't think I fully understand the index implementation then!! Can you traverse an index in an ordered manner? I mean, it's very unlikely that the random float generated at run-time corresponds exactly to the same value as one already in the database, so the index for that specific value probably wouldn't exist?

jamwt•12mo ago

yes! indexes are not hash tables, they're btrees aka the keys are ordered so once you do log n lookup to find entry X, finding the next record < or > X is O(1) this is what actually powers efficient sort and things like that (or ORDER BY in SQL, which can either take infinitely long to copy the entire table if there is no index, or can be instant if the ordered field is indexed)

v•12mo ago

How about that maplit crate huh ❤️

jamwt•12mo ago

ah in rust?

v•12mo ago

Haha yeah I'm just messing around

djbalin•12mo ago

Ohh what the heck!! Cool!! I just found this clearly described in the docs as well: https://docs.convex.dev/database/indexes/#sorting-with-indexes Is this a recent addition (past few months) to the system/docs, or have I just been blind?! That's super nice, will be very useful for us. Could your approach be extended to the case where we want to fetch N documents at random? I suppose we could just generate N random floats instead and do the procedure for each of them, and add the results to a hashset to disallow duplication of results (which we want in our use case)

Indexes | Convex Developer Hub

Indexes are a data structure that allow you to speed up your

jamwt•12mo ago

const generated = Math.random();
let next = await ctx.db.query("whatever").withIndex("by_rand", q => q.gte("rand", generated)).first();
if (next === null) {
  next = await ctx.db.query("whatever").withIndex("by_rand", q => q.lt("rand", generated)).order('desc').first();
  if (next === null) {
     throw ConvexError("Can't get a random record from an empty table");
  }
 }
return next;

const generated = Math.random();
let next = await ctx.db.query("whatever").withIndex("by_rand", q => q.gte("rand", generated)).first();
if (next === null) {
  next = await ctx.db.query("whatever").withIndex("by_rand", q => q.lt("rand", generated)).order('desc').first();
  if (next === null) {
     throw ConvexError("Can't get a random record from an empty table");
  }
 }
return next;

^ haven't typechecked the above, but that's the idea first comparison is gte in case you get insanely lucky and you actually generate an exact float that matches

Is this a recent addition (past few months) to the system/docs, or have I just been blind?! That's super nice, will be very useful for us.

Nope! been in there since we first shipped indexes, which was pre 1.0

djbalin•12mo ago

Cool man thanks, it makes sense, I assumed that indexes were hash tables and somehow missed that docs section! Will definitely have some use cases for this. What's the downside of storing indexes as binary trees instead of hash tables? I guess, in practice, that insertion and lookup are slower? But the obvious upside is then this sorted nature, I guess. Were those your concerns when deciding which structure to use?

jamwt•12mo ago

we didn't strongly consider hashtables -- but mostly b/c we copied heavily from convention here. almost all databases use btrees for indexes (like postgres, mysql, sql server, oracle, etc) so it's the natural default otherwise you can't power things like ordering and algorithms like this one also, databases are often managing really large amounts of data so hashtable resizes would be deadly and the structures are mapped on disk btrees evolve/balance with smaller maintenance operations than hash tables so they just are in general a better structure for large dataset, disk-persistent indexing only real tradeoff is O(1) vs. O(log n) for simple queries but as long as you keep your trees not too deep it's worth it

djbalin•12mo ago

Thanks @jamwt that's interesting to read, good to learn about practical considerations for large-scale systems!! What would be your suggestion for querying N random documents from a table (preferably without duplication, but I understand that this gets harder as N grows relative to table size)? Back then, we tried maintaining an insertion order field on the documents and generating random unique numbers in that range, but maintenance of the insertion order grew to be super lame, not least since we actually wanted to be able to categorize our documents (videos) in several ways: by category, by creator, etc. So we ended up having sort of compound insertion orders which just became stupid haha. I guess that your random float-approach should also work for such subsets of tables, since the floats should be uniformly distributed such that the full table and all of its subsets should have the same distribution of random floats, or is my math wrong?

jamwt•12mo ago

just repeat the above process N times you could be tempted to take(5) instead of doing first() but that's not truly random since when your generated values end up close to each other upon different invocations of asking for a random set you'll tend to return the same "random values" b/c they happen to be adjacent in the index so doing N point lookups using the above method is the more accurate one, vs. turning it into a range read

lee•12mo ago

note doing the above process N times can also be skewed non-random. If e.g. there are 2 documents and the floats picked are 0.9 and 1.0, then the first document is 9 times more likely to be picked when sampled repeatedly The way to get uniform randomness with repeated sampling is to do what you mention in the original question: count records, and keep rank between 0-n for each document. To do this efficiently isn't super easy, as you discovered, so we're about to release a solution as a component on top of convex. stay tuned 🎶

djbalin•12mo ago

Yeah for very small subsets I guess that approach totally collapses, since the document-floats won't be uniformly distributed in the range. But... if the subsets are small enough for that to be an issue, they may be small enough to warrant just fetching entire thing and shuffling the array! 😄 Sick @lee, cant wait for that !! You guys rock!! @jamwt We implemented this random float field, and it works great! Is there a way to build this into a paginated query such that we automatically can repeat the process with loadMore? Otherwise, I guess we'll have to use the convex client to imperatively call the function again upon a client event, but we want to keep our client as lean as possible

Querying for random record

Did you find this page helpful?