Eternal Mori•2mo ago

What should I use for processing a large dataset?

I have a dataset containing a +5 million rows. I need to do a lot of calculations on those rows. What should I use? a mutation, action, scheduled job, workflow? I am self hosting and the functions takes more than 10 minutes. That is why I think I need a workflow. Is this right?

5 Replies

Convex Bot•2mo ago

Thanks for posting in <#1088161997662724167>. Reminder: If you have a Convex Pro account, use the Convex Dashboard to file support tickets. - Provide context: What are you trying to achieve, what is the end-user interaction, what are you seeing? (full error message, command output, etc.) - Use search.convex.dev to search Docs, Stack, and Discord all at once. - Additionally, you can post your questions in the Convex Community's <#1228095053885476985> channel to receive a response from AI. - Avoid tagging staff unless specifically instructed. Thank you!

bobjoneswins•2mo ago

I am interested in the answer to this also, as I have a similar dataset. I would like to know if there are limits or constraints on how much data you can work with.

Eternal MoriOP•2mo ago

The more I work with it, the more I feel like convex is not meant to process large datasets. My opinion is based on: - I migrated my most demanding real scenario to convex - There are no examples for advanced use cases - There no examples for data processing When I ask in #general I get told that there a large code bases that use convex, but when I ask for examples or open source solutions nobody responds. When I ask clear questions in the #support-community I get no answers (probably because not many people use advanced use cases) or dont know how to execute it in Convex. --- I just ported over a simple piece of the process that just loop trough all the data in batches and at this point in time it is still going for over 25 minutes, while the whole process in a MySQL query takes me 3 minutes. I am also open for feedback and examples, but so far, nobody responded. That was my vent, amen!

jamwt•2mo ago

hi! sorry for the delay. short answer is these workloads are best run in a cloud data warehouse, and you stream your convex tables out with the fivetran adapter. if you're looking for something lightweight, I'd recommend clickhouse cloud. in the longer run, we'll just embed duckdb right into the product for really high performance OLAP!

bobjoneswins•2mo ago

Thank you for sharing your experience. I can see that their Chef requires people a lot smarter and skilled than me to build apps with their product. It's a shame. I was excited for a quick minute about the possibilities...

What should I use for processing a large dataset?

Did you find this page helpful?