Gary, el Pingüino Artefacto•4mo ago

How to aggregate documents (more than 16K)?

Hi, I know aggregations, joins, etc may not be the most performance queries but something the developer have no other option. How can I count all the documents of a table? Also, I need to count the documents by month, for example: All the invoices per year, per month, per week, etc. I would not like to export to elasticsearch because would make me lose the realtime capabilities.

11 Replies

Gary, el Pingüino ArtefactoOP•4mo ago

Also denormalizing isn't an option because we also have tags, clients, departments, etc. for the invoices and need to also filter by that and count them.

Web Dev Cody•4mo ago

how many records? is it possible to just do a db.query then count the array length?

jamalsoueidan•4mo ago

You have the option to stream the data out from convex and do the advanced aggreation somewhere else.

conradkoh•4mo ago

another option is to maintain the counter on write rather than on reads. so keep a counter in a separate table and increment it when a new record is added. this is safe to do in convex because of optimistic locking I believe. broadly 1. add the code to start incrementing the counter from a hard partition (e.g. if current time > 12pm on 11 Sep, then increment counter) => increment the col called doc_count 2. wait for the time to pass until the deadline set in step 1 3. count all records that were created before the deadline => increment the col called doc_count_historical once you are satisfied, with the historical count, you could just increment doc_count by that number, and delete the doc_count_historical col. as for aggregation by different granularities, you could probably create something like a doc_count table. you could have something like this

Doc Count Records:
//weekly
- Aggregation Key = 2024-week50-docs
- Aggregation Type = weekly
- Count = 0

//monthly
- Aggregation Key = 2024-01-docs
- Aggregation Type = monthly
- Count = 0

//yearly
- Aggregation Key = 2024-docs
- Aggregation Type = yearly
- Count = 0

Doc Count Records:
//weekly
- Aggregation Key = 2024-week50-docs
- Aggregation Type = weekly
- Count = 0

//monthly
- Aggregation Key = 2024-01-docs
- Aggregation Type = monthly
- Count = 0

//yearly
- Aggregation Key = 2024-docs
- Aggregation Type = yearly
- Count = 0

to control the complexity, you may want to just use the monthly to roll up into the year's ones, especially since it is just reading 12 records.

Gary, el Pingüino ArtefactoOP•4mo ago

I fucked up the title, it was meant to be more than 16K (fixed) My current best option is to export the data maybe to elasticsearch and then create endpoints to fetch it. Also I may configure a 5s pooler to keep the realtime capabilities although it isn't realtime Our dashboard is very dynamic, users can add or remove conditionals and we keep adding more by the time, so this idea becomes unpractical for our use case

jamalsoueidan•4mo ago

That sound good, thats the way to go, since convex doesnt allow advanced aggreation as fare as I can see in their documentation.

Web Dev Cody•4mo ago

Another approach might be to add a bucket index on your data, and then if you need to count a lot of records, you could query by bucket 1,2,3,4 so that each bucket returns less than 16k then you sum it up. I’d assume it would run decently fast. But I guess you’d need a lot of buckets depending on your dataset size. It would also fetch a lot of data each time which means you’ll pay for it, but you won’t get the real-time with elasticsearch. if you're able to at least determine a single index or group of index, you may be able to do something like

.index("tag_bucket_index", ["tagId", "bucketId"])
.index("client_bucket_index", ["clientId", "bucketId"])
.index("department_bucket_index", ["departmentId", "bucketId"])

.index("tag_bucket_index", ["tagId", "bucketId"])
.index("client_bucket_index", ["clientId", "bucketId"])
.index("department_bucket_index", ["departmentId", "bucketId"])

then if you wanted to count stuff for a client, do

await Promise.all([1,2,3,4,5,6,7,8,9,10].map(bucketId =>
  query('items').withIndex('client_bucket_index', 
    q.eq(q.field("client"), clientId).eq(q.field("bucketId"), bucketId)
  )
).filter(possiblyMoreThan16kItems => doSomething(...)

await Promise.all([1,2,3,4,5,6,7,8,9,10].map(bucketId =>
  query('items').withIndex('client_bucket_index', 
    q.eq(q.field("client"), clientId).eq(q.field("bucketId"), bucketId)
  )
).filter(possiblyMoreThan16kItems => doSomething(...)

this would overfetch, so you'd need to then run filters in your javascript code to reduce the resultset and get the accurate count. and just assign random bucketIds to entries when you add them to help with equal shard distribution but it's hard to say because you didn't say how many total records you have in a table I think adding in elasticsearch with a stream will just increase project complexity, but maybe it'll be needed due to your requirements @Gary, el Pingüino Artefacto

nipunn•4mo ago

If you just need length of the full table (no filters/indexes) - check this out https://github.com/get-convex/convex-backend/issues/10 There's an undocumented .count() API on the query builder. It sounds like it won't be sufficient for your use case, but thought I'd mention it.

GitHub

How to get the count (length) of a table · Issue #10 · get-convex/c...

Hey I'm evaluation convex for an upcoming project and I need to count all entries in a table (or by index). There doesn't seem to be a dedicated api function, collect() doesn't work for...

Gary, el Pingüino ArtefactoOP•4mo ago

The idea of buckets is still under discussion, we planned using a computed bucket based on the created timestamp of the invoices and only computed the aggregation of the N nearest buckets and the do the sum on the client, where N can be any range that the user providers. Its also means that N convex queries will be triggered and the sum the results on the client to avoid the 16K limit We need to do a cost charts because that implementation may be more expensive than running an elasticsearch cluster and do pooling to retreive the data almost in real time This would be an option but I saw you can't chain it to a filter, so sadly it's wouldn't work for us

Son•2mo ago

will something like this be possible soon?

ctx.db
.query("eventMoment")
.withIndex("by_creatorId",(q) => q.eq("creatorId",partnerId))
.filter((q) => q.eq(q.field("hasPartnerSeen"),false))
.count(),

ctx.db
.query("eventMoment")
.withIndex("by_creatorId",(q) => q.eq("creatorId",partnerId))
.filter((q) => q.eq(q.field("hasPartnerSeen"),false))
.count(),

ctx.db
.query("eventMoment")
withIndex("by_creatorId_hasPartnerSeen",(q) => q.eq("creatorId",partnerId).eq("hasPartnerSeen",false)).count(),

ctx.db
.query("eventMoment")
withIndex("by_creatorId_hasPartnerSeen",(q) => q.eq("creatorId",partnerId).eq("hasPartnerSeen",false)).count(),

Clever Tagline•2mo ago

Now that the aggregate component is available, that's probably the first place I'd start.

Aggregate

Keep track of sums and counts in a denormalized and scalable way.

How to aggregate documents (more than 16K)?

Did you find this page helpful?