vors•3y ago

Limit on number of documents read in a query

Hi friends! We were doing some pretty innocent queries and we hit a 16384 documents in a query limit.... I created a standalone repro for your convenience. https://github.com/vors/convex-load-test Please help us understand is that a fundamental limitation of the platform or am I just stupid with how I write my queries. The meat is that we do this

await db
    .query("messages")
    .filter((q) => q.eq(q.field("_id"), id))
    .collect();

await db
    .query("messages")
    .filter((q) => q.eq(q.field("_id"), id))
    .collect();

Which is obviously stupid way of doing db.get but it will do for the illustration pusposes I think. Imagine that it's some other, non-id field. UPD: non-id field with a index on it.

32 Replies

ari•3y ago

Hey Sergei! Yes, we do have a limit under the hood here for how many documents a single query can read. My best suggestion here would be to switch to using a paginated query (https://docs.convex.dev/using/pagination) for use cases with really large tables (in this case over that ~16k) limit

Paginated Queries | Convex Developer Hub

Load paginated queries.

ari•3y ago

Even if you don't want to paginate on the client side, you could loop calls to loadMore until you exhaust the data. I'm happy to help code up the refactor in your test repository if it would be helpful!

vorsOP•3y ago

Thank you for the lighting fast response, Arnold. Sorry I think I may be somewhat a DB noob here, but should not it roughtly speaking translate into:

SELECT * FROM my_table WHERE ID=my_id

SELECT * FROM my_table WHERE ID=my_id

In the sense that if that would be Postgress with a index on ID column it would be handling it with no problem with 10M rows.... (again ID is a bad example name) I think the real-world speaking we plumb down the id's in most places to avoid relaying on indexes, however for our BI, we would need the agregated quries on the whole database. (I kind of was hoping to build something quickly on convex isntead of exporting it into something like BigQuery)

ari•3y ago

Yeah! No worries, I'm a database noob myself -- I just make things pretty 🙂 Conceptually, your Convex query would be pretty close to that SQL query, the biggest difference being that Convex doesn’t do query planning (intentionally) so you’d have to query indexes with .withIndex (I believe the index by_id is builtin by default) I can’t speak for this concretely as I mostly work on the UI, but the ~16k limit is a hardcoded value on our end as a safety precaution. Someone that works on the backend here can give you a better answer about why the limit is there and support for scaling. In general though, paginating a query where you expect millions of a result would likely be important both for database and client-side perf

vorsOP•3y ago

Just to double check: my current alarmism is not about the case when I want to return 20,000 messages to the client -- this makes perfect sense to paginate. It's about the case when I have a single table with 20,000 entries "Name" and I need to find the "Boris" in this table (and there are 10 people like that only). This seems like something that should not require the pagination.... I should look more into pagination I guess, but ergonomically-wise I'm concerned 🙂

ballingt•3y ago

@vors are you familiar with Indexes in Convex?

vorsOP•3y ago

Probably not! 😆

ballingt•3y ago

If we're imagining this isn't ID then in Postgres I'd use an Index — and same in Convex

vorsOP•3y ago

ok let me try to change the example to use index

ballingt•3y ago

see https://docs.convex.dev/using/indexes#defining-indexes

Indexes | Convex Developer Hub

Indexes are a data structure that allow you to speed up your

ballingt•3y ago

But re the philosophy of query planning that Arnold mentioned — this is something we'd love to discuss, it's definitely an opinionated part of Convex! An opinion formed by DB folks here over a long time (some of it at Dropbox on infra teams) that is worth discussing

james•3y ago

Yep generally you don’t want to be doing SELECT * (edit: I mean a full table scan) from a large table in the live request path on any database, Convex included. Seems indexes will solve your problem here but happy to chat more if you have a use-case that falls outside of this!

vorsOP•3y ago

@ari I think ~50k documents in a single table make the dashboard to halt

vorsOP•3y ago

hmmm

vorsOP•3y ago

ok that's all nits sorry Working on an example with indexes (hopefully it just works!) Ok I think that real-time part explains it 🙂 I should just have the limiations in mind about what the database can and cannot do I kind of just treating convex like a magic tool that solves everything 🙂

james•3y ago

Almost everything 😉 but yeah Convex is explicitly an OLTP database, meaning it’s designed for high performance on the kind of efficient queries you want to have in user-facing code. If there’s ever a need for a non-indexed scan of a full database table (usually just used for analytics) then this is the domain of an OLAP database and for now we’d recommend streaming Convex into Snowflake etc via AirByte for these analytics-style queries.

vorsOP•3y ago

makes sense. Ok so for the indexes part -- I pushed changes to https://github.com/vors/convex-load-test I'm using the

export default query(async ({ db }) => {
  const messages = await db
    .query("messages")
    .filter((q) => q.eq(q.field("randomNumber"), 1))
    .collect();
  return messages.length;
});

export default query(async ({ db }) => {
  const messages = await db
    .query("messages")
    .filter((q) => q.eq(q.field("randomNumber"), 1))
    .collect();
  return messages.length;
});

And my schema is

import { defineSchema, defineTable, s } from "convex/schema";

export default defineSchema({
  messages: defineTable({
    body: s.string(),
    author: s.string(),
    randomNumber: s.number(),
  }).index("by_randomNumber", ["randomNumber"]),
});

import { defineSchema, defineTable, s } from "convex/schema";

export default defineSchema({
  messages: defineTable({
    body: s.string(),
    author: s.string(),
    randomNumber: s.number(),
  }).index("by_randomNumber", ["randomNumber"]),
});

Hitting the same problem.... Am I doing something wrong? ohhhh There is withIndex sorry trying now

ballingt•3y ago

ah yeah, this is part of the "no query planning;" you know you are (or aren't) using an index based on the query

vorsOP•3y ago

got it sorry cool it works Minor suggestion: maybe we can surface the withIndexes in the error message on the convex side something like please RTFM about indexes

Indy•3y ago

Hahah. Yep good idea!

vorsOP•3y ago

and the UI limits for few thousands entries in a table would be good to improve 🙂 we are already over them. yeah just checked, one of our prod tables is already failing to open in UI

Indy•3y ago

Curious on how you're using the dashboard UI. The UI wasn't really designed to be an important part of a long term workflow with a "big" project. We always thought it'd be far more useful for early in your project. We should definitely not break though 😉

vorsOP•3y ago

yeah I didn't notice that it's unresponsive until I tried out today we do plan to build some BI, I was hoping that convex can just do it, but based on James' comments seems like a separate BI tool is a way to go from the very beginning which is a little bit disappointing to be honest, because I feel like a light-weight BI solution would be nice to have 🙂

Indy•3y ago

Yea, Convex wasn't really designed for big BI queries. This is our escape hatch: https://docs.convex.dev/using/integrations/airbyte

Using Convex with Airbyte | Convex Developer Hub

Analyze your Convex data by exporting via Airbyte.

vorsOP•3y ago

even if it works against the prod to beging with I guess the wish-list item is to have a small-scale BI integrated in convex maybe I should just get over it and buy snowflake but our stack is already like 5 different vendors

Indy•3y ago

Yea thanks for the feedback on expectations here. The question really becomes what "light weight BI" really means. Because in my experience working with data eng folks very quickly we'd drop into custom queries, where nothing "light weight" would truly work.

vorsOP•3y ago

true And now something completely different (but I think it may be related to me trying to open the big table): all other queries to prod are stuck 😅 it took my normal prod page with "projects" minutes to load now

Indy•3y ago

this is in the dashboard?

vorsOP•3y ago

no on prod the web ok maybe it's unreleasted let me ask my teammates to test

jamwt•3y ago

My recommendation for simple BI is hosted clickhouse. Airbyte can hook up to it pretty easily.

vorsOP•3y ago

I think since we are on GCP we would probably go with BigQuery

jamwt•3y ago

That works too. More complex but definitely capable

Limit on number of documents read in a query

Did you find this page helpful?