Convex processing large datasets and use cases
The more I work with it, the more I feel like convex is not meant to process large datasets. My opinion is based on:
- I migrated my most demanding real scenario to convex
- There are no examples for advanced use cases
- There no examples for data processing
When I ask in #general I get told that there a large code bases that use convex, but when I ask for examples or open source solutions nobody responds.
When I ask clear questions in the support-community I get no answers (probably because not many people use advanced use cases) or dont know how to execute it in Convex.
---
I just ported over a simple piece of the process that just loop trough all the data in batches and at this point in time it is still going for over 25 minutes, while the whole process in a MySQL query takes me 3 minutes.
I am also open for feedback and examples, but so far, nobody responded.
That was my vent, amen!
9 Replies
Thanks so much for this honest feedback — it helps.
You're right that we need more public examples showcasing Convex in advanced and large-scale use cases. We'll explore ways to highlight Convex better for large data workloads and complex applications. They exist.
That said, we also have customers using Convex at scale, serving millions of users and handling large, complex data. Those implementations aren't public (yet), but we hear you: the need for more examples, answers, and transparency is clear.
We're a growing community and appreciate our community to answer support community questions when they have time.
We appreciate the nudge on documents and advanced use cases. Thank you for completing the feedback form. Keep the feedback coming—we're listening.
I see a lot of different answers, U are from the convex team, what is your reaction to these answers?
Hi 10 million rows is a lot. What are you cooking? I see you're using Claude, which may not have all the latest recommendations for Convex. I highly recommend using "Ask AI" in Convex docs for questions on Convex. https://docs.convex.dev/home
To answer your question on " 10+ million row dataset from MySQL to Convex".
Note, Convex isn’t built like MySQL—it’s optimized for real-time OLTP workloads, not big analytical queries that scan millions of rows.
That’s why large full-table scans can be slow unless you’re using proper indexes. For big datasets, we recommend:
- Using
- Denormalizing data for things like counts
- Paginating through large results
- Offloading heavy analytics to an OLAP database (Convex integrates with Fivetran for this) The team’s also working on embedded analytics with DuckDB and a SQL-style inline query system. Convex can handle complex apps and large data, but it needs a different approach than SQL. Also Ref: https://stack.convex.dev/translate-sql-into-convex-queries and https://stack.convex.dev/merging-streams-of-convex-data stack post
withIndex()
for efficient queries- Denormalizing data for things like counts
- Paginating through large results
- Offloading heavy analytics to an OLAP database (Convex integrates with Fivetran for this) The team’s also working on embedded analytics with DuckDB and a SQL-style inline query system. Convex can handle complex apps and large data, but it needs a different approach than SQL. Also Ref: https://stack.convex.dev/translate-sql-into-convex-queries and https://stack.convex.dev/merging-streams-of-convex-data stack post
These where reactions in #general not my text
https://discord.com/channels/1019350475847499849/1019350478817079338/1360607247801253898
Sorry, I was looking at the Claude questions.
yeah, sorry, we've been overwhelmed here
but convex is opinionated. taht means it does OLTP stuff in ways really really scaled for OLTP and OLAP stuff in specialized ways for OLAP, as others have said
meaning: it does not like batch work that much in general, but OLTP databases are happier with lots of little interspersed mutations that do not hold locks on very many rows
for doing batch passes on stuff, the situation today is possible, but it's not great
what teams do is just use the Fivetran connector to like bigquery or clickhosue
running large queries over your OLTP database is actually an antipattern, so it's better to do these in OLAP engines and then just merge the aggregation back in with a mutation
the more ergonomic way we want to solve this, after we get through some more chef stuff, perhaps later this summer
is by just embedding an OLAP engine
probably duckdb
so in an action, you'd be able to execute a really high performance aggregate (even faster than MySQL!) across all your table data
(a slightly delayed replica, maybe a few seconds behind)
and then write back out any expensive calcuations with a mutation into whatever aggregate collection you want
this is how bigger companies actually manage OLTP vs. OLAP workloads -- convex would just give you nice ways to do it out of the box
I understand the continued suspicion that maybe no one is actually doing anything big on convex 🙂
unfortunately, those code bases are not open source
and very seldom are
I totally acknowledge that convex is confusing in this regard right now, I apologize. the teams that know how to do this well right now have learned through either prior experience with production systems, and/or chatting with the convex team
there are two big thing we owe everyone as soon as we can
1. a book called "Real World Convex", that lays out project organization, solving common problems in practice, etc etc. and basically documents what all bigger teams have discovered with the convex team are the patterns for success when scale gets reals
2. An in-product OLAP capability. b/c it's a PITA wiring up a data warehouse for just running simple aggregates every now and then.
all these things are possible right now, but not very "discoverable"
@puchesjr actually did a great job answering some of these questions for us, thanks!
wow a “Real World Convex” would be fantastic !