How to count a large set of results from a query?
Hi, suppose I have a query such that the full set of results would be over the 16k convex limit. What is the best way/or are there any suggested techniques for counting this result set?
7 Replies
@Adam Harris hi! if you need it to be high performance, generally you should denormalize and keep the counts on insertion
any kind of orm-type table wrapper could help you do this without so much code duplication. we might provide some packages soon that wrap these patterns up to make it easier
Hi @jamwt, thanks for your suggestions! it doesn’t need to be high performance and ideally we need to be able to do this kind of thing ad-hoc for reporting, so keeping counts insertion won’t really work for us here unfortunately. We’re really missing sql-style ad-hoc aggregations at the moment. Can you please expand a bit on your orm-type table wrapper that you mentioned?
so, a couple of options then. no database will truly do high performance counts of anything without denormalization. however, you can often use something like an OLAP engine to pretty quickly do ad-hoc queries of counts and things like that.
option 1: manually denormalize counts into running aggregates upon mutation. very very fast, but if you want to do it for aribrary queries, agree, this option doesn't work.
option 2: rely on some sort of olap-style columnar query system that will do a pretty good job running artibrary aggregates at high speed
for option 1, I'll try to find some threads of people solving this before.
for option 2, we plan on shipping some kind of SQL-based inline OLAP system later this year for arbitrary read-only queries (slightly delayed) at high speed. we don't have a great option yet, unless you want to use the fivetran connector and sync some of your tables into e.g. bigtable. then you can hammer away at aggregates to your heart's content. but it's more work to set up! and the replicate delay can be several minutes etc. depending on how much money you want to spend
sounds like you all want option 2, which you're not alone. it's a really common request from our customers right now, and I'm bummed we don't have it solved yet. it's on the roadmap to ship something soon to address this without all the integration headaches, but for now, you really do have to stream to some other system that runs aribtrary aggregates in a better way than convex. that's what many other customers are doing
more info on the streaming support people are using currently: https://docs.convex.dev/production/integrations/streaming-import-export#streaming-export
Streaming Data in and out of Convex | Convex Developer Hub
Streaming Data in and out of Convex
fivetran to clickhouse cloud or motherduck are probably a pretty good lightweight options
our future OLAP offering is likely to be built on an embedded version of duckdb
Thanks very much for the detailed explanation. Yes it’s option 2 that we’re trying to figure out and from reading around other posts on the discord we were thinking we might need to push the data to somewhere like ClickHouse, but it’s great to have the confirmation here - thanks again
no problem. we'll take a note to reach back out when we ship the embedded analytics engine. I think it's going to be a cleaner solution to a lot of use cases like this