Implications of per query document read limits
I discovered this today:
Does this mean that any kind of aggregate query using
AsyncIterable
will fail when performed on a table that has more than 16384 documents?22 Replies
Hi RJ! If you are scanning through the whole table, yes. This is because Convex is optimized for transactions on small amounts of data. Though other databases allow this, they end up making a different implicit tradeoff since full scans are slow and can potentially slow down other operations in unforseen ways.
Generally there are a few ways to deal to work around this in Convex:
1. Calculate our aggregates incrementally
2. Use scheduled functions to iterate through the table to stay within the transaction limits
3. Using our streaming export mechanism through airbyte to do large analyticial queries in a different system
This may not be the most satisfying answer, so do let us know your use case so we can find ways to improve this.
We have strongly considered providing our own "OLAP" (https://www.snowflake.com/data-cloud-glossary/olap/) offering that's outside of the core transactional path for folks building dashboards or running other analytical queries like aggreates.
We had a similar conversation recently you can look at Ian's suggestions here: https://discord.com/channels/1019350475847499849/1094081206280274040/1095872973111558275
I see! That's good to know in general—I expected performance to degrade gradually over time, it's good to know that there's a hard (and not very high) limit.
I have to wrap some stuff up and then I'll explain in more detail how I encountered this limit
I have an example of batching using pagination in my recent post on writing migration functions: https://stack.convex.dev/migrating-data-with-mutations
you can either loop in an action, client-side, recursively with the scheduler, though a pattern like a daily rollup cron I think would be sufficient for a lot of analytics-style things, when you don't have an OLAP database linked up yet
Migrating Data With Mutations
Using mutations to migrate data in Convex.
So the issue I encountered was this.
Say you have a schema that looks like so:
And say you want to get (a reasonably small subset of) the list of all dogs whose owner's age is greater than
n
. If the dogs
table has > 16384 records in it, it seems like it would be pretty challenging to write this query, because you can't express something like SELECT * FROM dogs WHERE dogs.ownerId IN (SELECT owners.id FROM owners WHERE owners.age > 18)
in the Convex query language (if I'm remembering correctly how one might accomplish this in SQL), and need instead to compare in-memory (whether all at once or using AsyncIterable
) each dogs.ownerId
with the owners._id
s whose age
> n.
I guess what I find surprising is that this seems like a very simple query, but it seems quite tricky to express using the current Convex API. Assuming I'm right that you do need some of these more involved strategies to execute this query, I wonder how I can better understand where you all see the line between OLTP and OLAP functionality/features?
This is helpful, thank you!
I should amend this to say "seems quite tricky to express in a way that is not vulnerable to individual document read limits"Ah, for that query I'd check out this post: https://stack.convex.dev/functional-relationships-helpers
I think the optimized query would be something like:
Functional Relationships: Helpers
In this post, we’ll look at some helper functions to help write code to traverse relationships in a readable, predictable, and debuggable way.
Or without the
mapAsync
and getManyFrom
sugar:
Assuming you are ok iterating over adults, and a batch of adult has <16k dogs. A side-effect of the age filter is that you'll be paginating over the adults in age order. If there aren't many adults, you could grab all of them and then fetch a different ordering.
The denormalized / firestore / dynamo way of solving this would be to duplicate the owner's age into the dogs model. And you'd be in charge of keeping them consistent (which is easier in Convex b/c it's always transactional, but still possible to have developer errors)This gets around the document read count limitations, but in exchange you're loading a lot more data into memory, it seems like. It's especially tricky if you wanted instead a sorted (sub) list of dogs (whose owners are older than a certain age), right? Say, the oldest 10 dogs owned by owners older than 18.
Still, it's helpful to see these alternative ways of going about things!
And it's helpful to know that's how the document db folks do things, my DB background is really in SQL
My guess is SQL would also be loading all the users in your example, though they may just hold onto the id for comparison, and the query planner would probably then do a lookup like this or a huge chained "OR" of ownerId === adults[0] OR ownerId === adults[1] OR...
Yeah for sure—I'm just wary of writing queries which might one day break suddenly and completely. I'd rather a query like that which might end up being kinda slow at a certain scale, and to have the option to optimize it if/when that happens, then to have to either go out of my way to write the query in a more complex way, or else to risk sudden production breakage if I write it the easier way
The SQL query planner's unpredictable behavior in queries like this is one of the reasons we designed ours the way we did - at scale you could end up with a query locking up the whole dogs table b/c it's doing a table scan. The equivalent is still possible with convex, but it's more explicitly your job. E.g.:
So I guess that's really my main concern. Perhaps it would be nice to be notified if my Convex queries are running at or near a resource limit like this one, so that I could plan to address it before it becomes a production problem?
Yes, that makes a lot of sense and we're thinking about this. What would the right avenues for notifying you be? in the console logs, in function logs, alerts, emails? the latter may be further out, just curious
Yeah that makes sense and is helpful context for me, thanks!
I think my preference would be a push notification of some sort. So I guess a generic webhook notification option would be great, or an email
Yeah that seems right. We'll probably be a ways away from push and email, and start with logging of some sort, but good to validate the ideal case.
Logging would be more useful to me if it were scopeable per-function. I get nervous that in a high-volume production environment, it would be really hard/impossible to see the important notifications in the "everything" log stream. (That's not a feature now, is it?)
For your specific usecase, I'd recommend denormalizing the owner age, or having a boolean like "ownedByAdult" that you can index on. then you could have something like:
I'm personally so loath to denormalize at this point in my application's life! Denormalization is purely a performance optimization, and I really don't want to be optimizing for performance over correctness right now.
Have you all considered supporting something like materialized views?
I will denormalize if it ends up being the easiest thing to do with Convex, which perhaps is what y'all are going for, but I don't like that tradeoff right now and it kinda feels like it chafes with the other aspects of Convex the database that are particularly easy to use/get started with. Just my two cents!
Totally get it RJ! How important is that this query be on the perfectly up to date and part of your normal usage of the app. Depending on how you’re using this, we could give you a way to query this outside of normal Convex queries (say a separate database that can query large data sets but is a delayed snapshot).
Yup, makes sense. I’m hearing a few things:
1. The sql behavior of just getting slower for big queries (where at scale it’d eventually fall over) is a gentler path when there is a lot of data but not much usage yet.
2. Denormalization feels like a performance optimization, and a burden for early development.
3. Materialized views (in some respect automated denormalization) are more acceptable.
4. It’s not obvious how to make efficient queries over large amounts of data and the record limit comes as a surprise. Logs and alerts on this would help.
Does that sound right? I do think for your query, this one should work unless the first 16k old dogs are all owned by youngsters, and you could paginate over the dogs’ age (rather than using the async iterator) if so: https://discord.com/channels/1019350475847499849/1107750770935333045/1107837719004061846
I actually have need of this in a few places, and in some it's not critical that it be real-time. In others I would like it to be, and denormalization seems more justified/reasonable to me in those cases. I also think that from a DX perspective, if you all offered a first-class OLAP querying path as well, that would be really incredible.
I think that's a great summary! And good point that materialized views are basically automated denormalization. I'm really just wary of opening the gates to correctness errors/the need for manually-maintained invariants across tables, particularly this early in my project's life.
That query looks good to me too, but I decided to go with denormalizing the field. In the main case I was concerned with, I think it'll be easy to maintain the integrity of the denormalized values (or rather, unlikely to mess them up), and I think the simplified querying experience is therefore worth the tradeoff. But I will keep that strategy in mind in the future, too!
Some other thoughts related to denormalization:
- Something I've often been tempted to do in the past (even before Convex) is build a way to label particular fields as "cached" (denormalized). I've often wanted to know when looking at a DB or table schema which fields are cached and which are "original" values.
- In the past I've sometimes written scripts to run on a regular schedule to check for constraints which aren't encodable by the database's schema definition language—a script checking that denormalized values are correct is an example. The idea is that, although you might not be able to ensure the bad data goes in, at least you can know if it's there! And this way you have also documented, through the presence of these scripts and their contents, what you expect the shape of the data in your database to look like.
- I guess you could probably build database triggers on top of the Convex TS client as middleware, right? If you coupled that with a way to label particular tables as "views" (or "cached" or whatever, per my first bullet), then maybe you can build your own materialized views features with the primitives already present?
Yeah this also got me thinking the same thing - similar to Lee's RLS abstraction, you could write your triggers based on the rows you modify.
Or have functions to modify objects that you, as a convention, use rather than direct DB access. and those do the bookkeeping
And yes to background consistency checkers, as safeguards against developer errors - maybe on a cron that paginates over the data overnight
Triggers (or similar concepts) is something we've definitely talked about! That would make building denormalized fields easier.
@ian I really like the idea of using the RLS pattern for doing triggers in user space! That might be another great stack article for us to write. Let's toss that in our pile of interesting user space patterns to share.