Group by without hitting database limits
Hi team! I'm trying to group by a specific parameter on my table and return to the client the top most ocurrencies. But I'm getting the limited by the database saying
Server Error
Uncaught Error: Too many bytes read in a single function execution (limit: 8388608 bytes). Consider using smaller limits in your queries, paginating your queries, or using indexed queries with a selective index range expressions.
at async handler
Is there a way around this?
8 Replies
You've got several options here. Are you familiar with pagination, indexes, or denormalization? What the right approach here is depends on your use case, could you say more about
group by a specific parameter on my table and return to the client the top most occurrenceslike why you want to do that, how often it changes, how the records in your table change — or what this table stores?
This related Stack post went out today: https://stack.convex.dev/select-distinct
SELECT DISTINCT without SQL
You'll often want to retrieve unique records from your database. While SELECT DISTINCT is the traditional way to do this in SQL databases, we've impro...
My use case is that I scrape a website every 15 minutes. And I wan to group the results by a specific field. I thought I could make the db group that for me. But I need to do that aggregation in memory. I’m afraid of using pagination and missing some records
The table is big, holding thousands of records
I would use an index to organize your data. This is a good mental model of how databases are structured and what they can do "for you": https://stack.convex.dev/databases-are-spreadsheets
Databases are Spreadsheets
I want to share my mental model of databases:
- Databases are just big spreadsheets
- An index is just a view of the spreadsheet sorted by one or mor...
You're trying to find the group with the most elements? Or the entry in each group that has some maximal value? Figuring out which records you need to read in order to get the result would help us point you in the right direction.
I have a document like this:
And I want to group by
creativeId
and order by the most occurencies. I will then expose this on my client side. I took a look at the indexes but those seems like a perfomant way to query data, not to do what I want.
My fear of using pagination is to miss on these aggregation. Say one page comes like:
Then I would miss the creativeId = 1 from page 2, if I load only the first page.
I though about doing another table where I would store the reference and when inserting on the table1 I would also insert on table 2 with this data already processed, like:
Then I could create a query from table 2 sorting from the count and grabbing the info I want from table 2 according to the results.
I wonder if there's another way I can do that with convex.
I tried to do this:
And it didn't do what I wanted, it keep the most creativeId occurencies out of the queryAn index on creativeIds seems useful, you could combine that with your Table2 denormalization; but don't track tableIds, just keep a count. Use Table2 to find the most popular, then an
adsByCreative(creativeId)
paginated query could return all the ads for a given creative.yeah that sounds like the most scalable approach - some denormalization into tables optimized for the sorting you want. Eventually I want to make a little component that makes it easy to track aggregates like this, but it's pretty straightforward: just a table with creativeId and count, indexed on both fields so you can update the count when you add/ remove an ad, and indexed on count so you can see the highest ranking.
If this is only for one-off analytics, you could also collect a result from an action that occasionally scans them all via pagination, or stream it out to a BA platform for analytics, but for an in-product optimized feed you can't do better than this imo