Kenni
Kenni3mo ago

Is Convex Fundamentally Limited for Large-Scale Enterprise Data? (1M+ Records per Table)

Hi Convex community! I'm evaluating Convex for enterprise applications and hitting what seem like fundamental scalability walls. Need honest feedback on whether I'm missing something or if Convex isn't suitable for large-scale data. Real-World Scale: - 50,000+ customers - 1,000,000+ invoices - 500,000+ products - Users need to search across ALL this data The Fundamental Problem: With Convex's 16,384 document scan limit, array size limit and other limits, it seems impossible to: 1. Search invoices by customer name when popular customers have 10,000+ invoices 2. Search products by category when categories contain 50,000+ items
3. Find invoices in date ranges when busy months have 100,000+ invoices 4. Any text search that might match more than 16k documents Critical Questions: 1. Is Convex enterprise-ready? Can it handle million-record datasets that enterprises routinely work with? 2. Search at scale: How do you implement search functionality when result sets can unpredictably exceed scan limits? The Real Question: Is Convex positioned as a "small-to-medium scale" solution, or am I fundamentally misunderstanding how to architect large-scale applications on the platform? Examples I'm struggling with: - Search 1M invoices by customer name - Filter 500k products by multiple criteria
- Generate reports across 100k+ records - Paginate through large unfiltered dataset I need to know: Should I be looking at traditional databases for the heavy lifting, or is there a "Convex way" to handle enterprise-scale data that I'm missing? This is make-or-break for platform adoption. Any honest guidance about Convex's intended scale and architectural patterns would be incredibly helpful.
20 Replies
Convex Bot
Convex Bot3mo ago
Thanks for posting in <#1088161997662724167>. Reminder: If you have a Convex Pro account, use the Convex Dashboard to file support tickets. - Provide context: What are you trying to achieve, what is the end-user interaction, what are you seeing? (full error message, command output, etc.) - Use search.convex.dev to search Docs, Stack, and Discord all at once. - Additionally, you can post your questions in the Convex Community's <#1228095053885476985> channel to receive a response from AI. - Avoid tagging staff unless specifically instructed. Thank you!
zjcurtis
zjcurtis3mo ago
Hey this doesn’t fully answer your question, but I did see this discord thread that talks about some of it. I’m sure there’s a more nuanced answer today. https://discord.com/channels/1019350475847499849/1107750770935333045 It seems like the answer at least then is to somewhat build around the scan limit, and it was seen as a trade off to hard limit it vs having gradual performance degradation as scan size increases.
Kenni
KenniOP3mo ago
We decided to go with Supabase. With Convex, we constantly ran into limitations—it often felt like we were building around constraints rather than with the platform. It’s a shame, because I still think Convex is an awesome tool.
jamwt
jamwt2mo ago
Sorry to hear it. Good luck with the project! We'll have better support for OLAP soon, so hopefully we'll feel more viable in the future.
Hal
Hal2mo ago
hi Jamie, can you share more about this OLAP support? I am also in the same position of evaluating Convex for an enterprise app. we would need to generate reports based on parameters given by the user. despite using index and all it can easily returns 20k+ rows that needed to be massaged into the suitable report format. Is there nuances to convex's limit or is it hard limit per query?
jamwt
jamwt2mo ago
hi. it's a hard limit because currently queries are part of a transactional + reactive scope. we're going to ship support for "stale" queries that can run for much longer and scan much longer record sets. in addition, a bit further down the road, we'll embed a duckdb engine into convex to be able to run arbitrary ad-hoc analytics
Hal
Hal2mo ago
cool! I've heard good things about duckdb (no exp with it though). any estimation when this will launch?
jamwt
jamwt2mo ago
staleQuery, (long running JS query) maybe within a month or two; embedded duckdb, probably 3-6 months
nizar
nizar2mo ago
hey jamie, i'm running a statistics website for the game aoe2:de and have recently rebuilt the site using convex (big fan - been using it for a bunch of projects lately) I am currently evaluating and testing wether it is actually feasible to use convex for the whole statistics pipeline: 1. importing parquet files (about 20mb MAX) 2. processing these and importing key columns into convex 3. use the aggregate component for the actual stats this seems to work fine for smaller samples (although i am still kind of figuring out the best way to do the parquet processing as they can have up to 500k rows, maybe you have a recommendation for this as well?), but the real issue is that in total the "facts" table will probably have about 64 million rows Is convex able to handle datasets this large using the aggregate component efficiently? Or would it be best to go another route for olap And sorry, another thing: - i also display live matches which works great with convex, but say im display 500 matches currently - if 1 then finishes and its only 499 left, does convex send down the diff to all the clients or resend all 499 matches, because this would make a massive difference for the bandwidth used
jamwt
jamwt2mo ago
it might work! but I'd probably suggest a dedicated OLAP engine for this. we really really want to ship embedded duckdb into convex soon so you can go crazy on fast analytics on your convex tables. But for now, the right solution is probably to use an outside columnar engine like clickhouse for this
nizar
nizar2mo ago
Alright thank you, i think i will go for an external service for now, but i'd definitely switch to convex once it's available! What about the other question? Should i use convex for this (just a query to list these live matches which is very nice dx wise) or continue to use my own websockets to send down diffs to the clients
Gary, el Pingüino Artefacto
duckdb in convex will blow the internet
Sebastian Hindhede
DuckDB can already run in-browser, ingesting fact tables from Convex.
Sebastian Hindhede
Convex and DuckDB already work together amazingly well for most user facing analytics cases 😀
jamwt
jamwt4w ago
Woah....cool!
Sebastian Hindhede
Yeah this is super lightweight and breaks once you reach 8192 data points for a single fact table, which is not uncommon for on a dashboard for any serious business. But you can take a number of ways to remedy this, such as but not limited to: 1. Background polling across the entire range using some kind of cursor 2. #1 with OPFS or some other browser storage and staleness eviction logic. 3. Convex actions running DuckDB (either built-in by you guys) or in Node.js environment using the WASM edition, writing to a parquet file in S3, which we load in the browser. 4. Fully managed DuckDB. Will take a lot of elastic compute and routing dedicated to spinning up ducks on your end. Motherduck and a bunch of others do this. I'd love to know your thoughts on this - OLAP definitely feels like the missing piece for Convex.
jamwt
jamwt4w ago
yeah, our intention is to actually build in duckdb at a pretty fundamental layer into the backend so there will be a replica set of your tables streamed into e.g. parquet or something and you'll have a interface in actions that can run artibrary SQL ctx.runSQL or ctx.runOLAP or something where ctx = ActionCtx it won't be inhibited by the current query context b/c it's not part of the transactional layer (meaning, 8k row reads or whatever) you can map it over all your data at will
Sebastian Hindhede
Brilliant. Have you considered an arrow flight interface? This will allow for zero-copy transport of the parquet data.
jamwt
jamwt4w ago
yeah. we're a rust shop, so we've looked at lot at that sort of datafusion/arrow world. the decision about how much we "just use duckdb" or "just use clickhouse" vs. perhaps something a bit more tailored using arrow directly or datafusion... we'll let the eng team sort that out once we get into details. it will probably be all about how much effort vs. "good enough" for this particular system. I don't want us to try to split the atom or something and take a long time to ship it, and end up with something really bespoke we have to maintain we try to reserve the really clever engineering for the core differentiator -- the reactive OLTP database itself
Sebastian Hindhede
Yeah I hear ya. I am sure the tech decisions are in good hands over there, can’t wait to find out what you go with 😃

Did you find this page helpful?