machadokd.•12mo ago

Group by without hitting database limits

Hi team! I'm trying to group by a specific parameter on my table and return to the client the top most ocurrencies. But I'm getting the limited by the database saying Server Error Uncaught Error: Too many bytes read in a single function execution (limit: 8388608 bytes). Consider using smaller limits in your queries, paginating your queries, or using indexed queries with a selective index range expressions. at async handler Is there a way around this?

8 Replies

ballingt•12mo ago

You've got several options here. Are you familiar with pagination, indexes, or denormalization? What the right approach here is depends on your use case, could you say more about

group by a specific parameter on my table and return to the client the top most occurrences

like why you want to do that, how often it changes, how the records in your table change — or what this table stores?

ian•12mo ago

This related Stack post went out today: https://stack.convex.dev/select-distinct

SELECT DISTINCT without SQL

You'll often want to retrieve unique records from your database. While SELECT DISTINCT is the traditional way to do this in SQL databases, we've impro...

machadokd.OP•12mo ago

My use case is that I scrape a website every 15 minutes. And I wan to group the results by a specific field. I thought I could make the db group that for me. But I need to do that aggregation in memory. I’m afraid of using pagination and missing some records The table is big, holding thousands of records

ian•12mo ago

I would use an index to organize your data. This is a good mental model of how databases are structured and what they can do "for you": https://stack.convex.dev/databases-are-spreadsheets

Databases are Spreadsheets

I want to share my mental model of databases: - Databases are just big spreadsheets - An index is just a view of the spreadsheet sorted by one or mor...

ian•12mo ago

You're trying to find the group with the most elements? Or the entry in each group that has some maximal value? Figuring out which records you need to read in order to get the result would help us point you in the right direction.

machadokd.OP•12mo ago

I have a document like this:

{
  adArchiveID: 1568306660630991,
  ad_creation_time: "2024-04-13",
  ad_delivery_start_time: "2024-04-13",
  ad_delivery_stop_time: "2024-05-07",
  ad_snapshot_url: "https://www.facebook.com/ads/archive/render_ad/?id=1568306660630991",
  creativeId: 120207068126630300,
  spend: "R$6 mil a R$7 mil",
  startDate: 1712991600,
  stateMediaRunLabel: null
}

{
  adArchiveID: 1568306660630991,
  ad_creation_time: "2024-04-13",
  ad_delivery_start_time: "2024-04-13",
  ad_delivery_stop_time: "2024-05-07",
  ad_snapshot_url: "https://www.facebook.com/ads/archive/render_ad/?id=1568306660630991",
  creativeId: 120207068126630300,
  spend: "R$6 mil a R$7 mil",
  startDate: 1712991600,
  stateMediaRunLabel: null
}

And I want to group by creativeId and order by the most occurencies. I will then expose this on my client side. I took a look at the indexes but those seems like a perfomant way to query data, not to do what I want. My fear of using pagination is to miss on these aggregation. Say one page comes like:

page1: [{creativeId: 1, ...}, {creativeId: 2,...},...]
page2: [{creativeId: 10, ...}, {creativeId: 1,...},...]

page1: [{creativeId: 1, ...}, {creativeId: 2,...},...]
page2: [{creativeId: 10, ...}, {creativeId: 1,...},...]

Then I would miss the creativeId = 1 from page 2, if I load only the first page. I though about doing another table where I would store the reference and when inserting on the table1 I would also insert on table 2 with this data already processed, like:

Table1
{
  _id: 1,
  adArchiveID: 1568306660630991,
  ad_creation_time: "2024-04-13",
  ad_delivery_start_time: "2024-04-13",
  ad_delivery_stop_time: "2024-05-07",
  ad_snapshot_url: "https://www.facebook.com/ads/archive/render_ad/?id=1568306660630991",
  creativeId: 120207068126630300,
  spend: "R$6 mil a R$7 mil",
  startDate: 1712991600,
  stateMediaRunLabel: null
}

Table1
{
  _id: 1,
  adArchiveID: 1568306660630991,
  ad_creation_time: "2024-04-13",
  ad_delivery_start_time: "2024-04-13",
  ad_delivery_stop_time: "2024-05-07",
  ad_snapshot_url: "https://www.facebook.com/ads/archive/render_ad/?id=1568306660630991",
  creativeId: 120207068126630300,
  spend: "R$6 mil a R$7 mil",
  startDate: 1712991600,
  stateMediaRunLabel: null
}

Table2
{
  creativeId: 120207068126630300,
  table1Ids: [1],
  count: 1
}

Table2
{
  creativeId: 120207068126630300,
  table1Ids: [1],
  count: 1
}

Then I could create a query from table 2 sorting from the count and grabbing the info I want from table 2 according to the results. I wonder if there's another way I can do that with convex. I tried to do this:

export const getMostCreative = query({
  args: { paginationOpts: paginationOptsValidator },
  handler: async (ctx, args) => {
    const data = await ctx.db
      .query("scraped_ads")
      .withIndex("byCreativeId")
      .order("desc")
      .paginate(args.paginationOpts)

    // Group the data by creativeId
    const groupedByCreativeId = Object.groupBy(data.page, ad => ad.creativeId);

    // Convert the grouped data into an array of arrays
    const formattedData = Object.entries(groupedByCreativeId);

    // Sort the array by the number of occurrences (length of the inner arrays)
    formattedData.sort((a, b) => b[1].length - a[1].length);

    return {
      ...data,
      page: formattedData,
    }
  },
});

export const getMostCreative = query({
  args: { paginationOpts: paginationOptsValidator },
  handler: async (ctx, args) => {
    const data = await ctx.db
      .query("scraped_ads")
      .withIndex("byCreativeId")
      .order("desc")
      .paginate(args.paginationOpts)

    // Group the data by creativeId
    const groupedByCreativeId = Object.groupBy(data.page, ad => ad.creativeId);

    // Convert the grouped data into an array of arrays
    const formattedData = Object.entries(groupedByCreativeId);

    // Sort the array by the number of occurrences (length of the inner arrays)
    formattedData.sort((a, b) => b[1].length - a[1].length);

    return {
      ...data,
      page: formattedData,
    }
  },
});

And it didn't do what I wanted, it keep the most creativeId occurencies out of the query

ballingt•12mo ago

An index on creativeIds seems useful, you could combine that with your Table2 denormalization; but don't track tableIds, just keep a count. Use Table2 to find the most popular, then an adsByCreative(creativeId) paginated query could return all the ads for a given creative.

ian•12mo ago

yeah that sounds like the most scalable approach - some denormalization into tables optimized for the sorting you want. Eventually I want to make a little component that makes it easy to track aggregates like this, but it's pretty straightforward: just a table with creativeId and count, indexed on both fields so you can update the count when you add/ remove an ad, and indexed on count so you can see the highest ranking. If this is only for one-off analytics, you could also collect a result from an action that occasionally scans them all via pagination, or stream it out to a BA platform for analytics, but for an in-product optimized feed you can't do better than this imo

Group by without hitting database limits

Did you find this page helpful?