Is Convex Fundamentally Limited for Large-Scale Enterprise Data? (1M+ Records per Table)
Hi Convex community!
I'm evaluating Convex for enterprise applications and hitting what seem like fundamental scalability walls. Need honest feedback on whether I'm missing something or if Convex isn't suitable for large-scale data.
Real-World Scale:
- 50,000+ customers
- 1,000,000+ invoices
- 500,000+ products
- Users need to search across ALL this data
The Fundamental Problem:
With Convex's 16,384 document scan limit, array size limit and other limits, it seems impossible to:
1. Search invoices by customer name when popular customers have 10,000+ invoices
2. Search products by category when categories contain 50,000+ items
3. Find invoices in date ranges when busy months have 100,000+ invoices 4. Any text search that might match more than 16k documents Critical Questions: 1. Is Convex enterprise-ready? Can it handle million-record datasets that enterprises routinely work with? 2. Search at scale: How do you implement search functionality when result sets can unpredictably exceed scan limits? The Real Question: Is Convex positioned as a "small-to-medium scale" solution, or am I fundamentally misunderstanding how to architect large-scale applications on the platform? Examples I'm struggling with: - Search 1M invoices by customer name - Filter 500k products by multiple criteria
- Generate reports across 100k+ records - Paginate through large unfiltered dataset I need to know: Should I be looking at traditional databases for the heavy lifting, or is there a "Convex way" to handle enterprise-scale data that I'm missing? This is make-or-break for platform adoption. Any honest guidance about Convex's intended scale and architectural patterns would be incredibly helpful.
3. Find invoices in date ranges when busy months have 100,000+ invoices 4. Any text search that might match more than 16k documents Critical Questions: 1. Is Convex enterprise-ready? Can it handle million-record datasets that enterprises routinely work with? 2. Search at scale: How do you implement search functionality when result sets can unpredictably exceed scan limits? The Real Question: Is Convex positioned as a "small-to-medium scale" solution, or am I fundamentally misunderstanding how to architect large-scale applications on the platform? Examples I'm struggling with: - Search 1M invoices by customer name - Filter 500k products by multiple criteria
- Generate reports across 100k+ records - Paginate through large unfiltered dataset I need to know: Should I be looking at traditional databases for the heavy lifting, or is there a "Convex way" to handle enterprise-scale data that I'm missing? This is make-or-break for platform adoption. Any honest guidance about Convex's intended scale and architectural patterns would be incredibly helpful.
20 Replies
Thanks for posting in <#1088161997662724167>.
Reminder: If you have a Convex Pro account, use the Convex Dashboard to file support tickets.
- Provide context: What are you trying to achieve, what is the end-user interaction, what are you seeing? (full error message, command output, etc.)
- Use search.convex.dev to search Docs, Stack, and Discord all at once.
- Additionally, you can post your questions in the Convex Community's <#1228095053885476985> channel to receive a response from AI.
- Avoid tagging staff unless specifically instructed.
Thank you!
Hey this doesn’t fully answer your question, but I did see this discord thread that talks about some of it. I’m sure there’s a more nuanced answer today. https://discord.com/channels/1019350475847499849/1107750770935333045 It seems like the answer at least then is to somewhat build around the scan limit, and it was seen as a trade off to hard limit it vs having gradual performance degradation as scan size increases.
We decided to go with Supabase.
With Convex, we constantly ran into limitations—it often felt like we were building around constraints rather than with the platform.
It’s a shame, because I still think Convex is an awesome tool.
Sorry to hear it. Good luck with the project! We'll have better support for OLAP soon, so hopefully we'll feel more viable in the future.
hi Jamie, can you share more about this OLAP support? I am also in the same position of evaluating Convex for an enterprise app. we would need to generate reports based on parameters given by the user. despite using index and all it can easily returns 20k+ rows that needed to be massaged into the suitable report format. Is there nuances to convex's limit or is it hard limit per query?
hi. it's a hard limit because currently queries are part of a transactional + reactive scope. we're going to ship support for "stale" queries that can run for much longer and scan much longer record sets. in addition, a bit further down the road, we'll embed a duckdb engine into convex to be able to run arbitrary ad-hoc analytics
cool! I've heard good things about duckdb (no exp with it though). any estimation when this will launch?
staleQuery
, (long running JS query) maybe within a month or two; embedded duckdb, probably 3-6 monthshey jamie, i'm running a statistics website for the game aoe2:de and have recently rebuilt the site using convex (big fan - been using it for a bunch of projects lately)
I am currently evaluating and testing wether it is actually feasible to use convex for the whole statistics pipeline:
1. importing parquet files (about 20mb MAX)
2. processing these and importing key columns into convex
3. use the aggregate component for the actual stats
this seems to work fine for smaller samples (although i am still kind of figuring out the best way to do the parquet processing as they can have up to 500k rows, maybe you have a recommendation for this as well?), but the real issue is that in total the "facts" table will probably have about 64 million rows
Is convex able to handle datasets this large using the aggregate component efficiently? Or would it be best to go another route for olap
And sorry, another thing:
- i also display live matches which works great with convex, but say im display 500 matches currently
- if 1 then finishes and its only 499 left, does convex send down the diff to all the clients or resend all 499 matches, because this would make a massive difference for the bandwidth used
it might work! but I'd probably suggest a dedicated OLAP engine for this. we really really want to ship embedded duckdb into convex soon so you can go crazy on fast analytics on your convex tables. But for now, the right solution is probably to use an outside columnar engine like clickhouse for this
Alright thank you, i think i will go for an external service for now, but i'd definitely switch to convex once it's available!
What about the other question? Should i use convex for this (just a query to list these live matches which is very nice dx wise) or continue to use my own websockets to send down diffs to the clients
duckdb in convex will blow the internet
DuckDB can already run in-browser, ingesting fact tables from Convex.
Convex and DuckDB already work together amazingly well for most user facing analytics cases 😀
Woah....cool!
Yeah this is super lightweight and breaks once you reach 8192 data points for a single fact table, which is not uncommon for on a dashboard for any serious business. But you can take a number of ways to remedy this, such as but not limited to:
1. Background polling across the entire range using some kind of cursor
2. #1 with OPFS or some other browser storage and staleness eviction logic.
3. Convex actions running DuckDB (either built-in by you guys) or in Node.js environment using the WASM edition, writing to a parquet file in S3, which we load in the browser.
4. Fully managed DuckDB. Will take a lot of elastic compute and routing dedicated to spinning up ducks on your end. Motherduck and a bunch of others do this.
I'd love to know your thoughts on this - OLAP definitely feels like the missing piece for Convex.
yeah, our intention is to actually build in duckdb at a pretty fundamental layer into the backend
so there will be a replica set of your tables streamed into e.g. parquet or something
and you'll have a interface in actions that can run artibrary SQL
ctx.runSQL
or ctx.runOLAP
or something where ctx = ActionCtx
it won't be inhibited by the current query context b/c it's not part of the transactional layer
(meaning, 8k row reads or whatever)
you can map it over all your data at willBrilliant. Have you considered an arrow flight interface? This will allow for zero-copy transport of the parquet data.
yeah. we're a rust shop, so we've looked at lot at that sort of datafusion/arrow world. the decision about how much we "just use duckdb" or "just use clickhouse" vs. perhaps something a bit more tailored using arrow directly or datafusion... we'll let the eng team sort that out once we get into details. it will probably be all about how much effort vs. "good enough" for this particular system. I don't want us to try to split the atom or something and take a long time to ship it, and end up with something really bespoke we have to maintain
we try to reserve the really clever engineering for the core differentiator -- the reactive OLTP database itself
Yeah I hear ya. I am sure the tech decisions are in good hands over there, can’t wait to find out what you go with 😃