Workflow Scalability
I’m building a product where our customers create little “AI programs” (basically chains of llm calls + tool execution). Depending on the complexity of the program, runs last anywhere up to about 20 minutes (maybe longer), and users can fire them manually or schedule them (e.g., “every day at 9 AM”).
I’ve been prototyping with workflows as a way to manage the execution of a program run, but I’d love a sanity-check before we go too far down that path. A few things I’m chewing on:
1. Is Workflow the right choice?
There are two main ways a user runs a program...
- On a schedule — basically a cron job of sorts. For this scenario we need durability, retries, and background processing — workflows seems perfect.
- In the app — a user can run a program directly from the UI. Right now we are planning on still using workflows here, but i could see an argument to reach for something else.
2. Parallelism & caps
The docs suggest keeping the sum of maxParallelism ≲ 50. I'm trying to understand if this a hard cap or just best practices? My concern is that many of our programs might be long running, meaning that we might hit this cap quite easily. What are my options here? Any way to scale beyond this easily?
3. General scalability
If a few thousand customers all schedule 9 AM runs, we’ll spike concurrency for an hour and really hammer the workflow system. Obviously there is built in queuing to spread out the load a bit, but we also need to trigger the runs approximately around the scheduled time, so too long of a queue starts to effect customer experience. Is Convex + workflows designed for this challenge? Or should we be reaching for something else?
I've looked into other solutions (like Temporal), but would love to keep everything on Convex. The DX of the platform has been great and it' been a joy so far.
Let me know if more details are helpful, thanks!
