Function execution timed out (120s)
One of our larger actions that imports a ton of data from an external source takes quite a while to execute and we're hitting the timeout.
Is there any suggested patterns we should employ?
the flow is as follows:
1. fetch all listingId, fetch and store all reservationIds
2. for each reservationId, fetch and store all conversationIds
2. for each conversationId, fetch and store all messages
26 Replies
Should I also be scheduling this action instead of calling it from the client outright?
If I schedule the action, the client is free to do anything else correct?
I'd suggest chunking up this work and recording your progress in the DB. Each chunk should ideally run for less than a minute, then when it gets to a stopping post record your progress in the database and schedule that function to run again. This way even if there's a failure, you can resume since you've been recording progress in the DB. You can write a query that reports progress too for a client-side progress bar.
If I schedule the action, the client is free to do anything else correct?I don't quite follow this, could you say more? Whether you schedule the action or call it directly from the client, the client is free to do whatever it wants. Mutations run one by one in Convex but actions don't block anything, see https://blog.convex.dev/announcing-convex-0-14-0/#breaking-actions-are-now-parallelized for more
Convex News
Announcing Convex 0.14.0
Meet the brand new Convex Rust client! We’ve also open sourced the JS client, added schema validation, and more. 0.14.0 is the best version of Convex yet!
You might consider something like calling a function (mutation or action) that schedules an action for every reservationId. that action does steps 2 & 3 for the reservationId, then marks that reservationId as done in a table. This Stack post has some tips on patterns for tracking job state: https://stack.convex.dev/background-job-management
Background Job Management
Implement asynchronous job patterns using a table to track progress. Fire-and-forget, cancelation, timeouts, and more.
Ahh that makes sense. Thanks a lot for the explanations + examples I'll try a few different approaches. A bit new to writing production code with lots of data so haven't come across these issues very often.
I'm getting a lot of TokenExpired errors after calling the action from the client although the same actions work fine on the dashboard. Seems like for actions that take a while to execute (20+ secs), the token for the websocket connection is lost?
Ah gotcha, we'll look into that. As a workaround for now and as a generally more resilient approach (since it doesn't require connectivity through the duration of the action), scheduling these actions to run immediately is a nice way to do things.
We will look into fixing the auth to only be checked at the beginning of the action. Another workaround in the meantime is to make the Clerk expiry longer than 1 minute. We recommend 1h as our default setting.
{"code":"Overloaded","message":"InternalServerError: Your request couldn't be completed. Try again later."}
Would this be due to scheduling with too little delay between each action?Yes, this is likely. I think this has resulted in many concurrent actions that all queued up queries/mutations at the same time?
What is your instance name?
https://knowing-emu-505.convex.cloud
we mightve overloaded it since a lot of our scheduled actions are hanging
this is p0 for us rn since its blocking prod activity. is there a way to clear all scheduled actions?
You can do that from the dashboard
Go to the functions tab and find the action you have scheduled => Scheduled Runs => "cancel all"
Hmm there doesn't seem to be any scheduled fns
Hmm... I also don't see any issues with knowing-emu. What is hanging exactly?
The scheduler doesn't seem to run the action after the delay. Line 124 in
actions/pmsData:fetchAndStoreReservations
thought it was auto-blocked due to congestion
it runs fine on dev instance https://mellow-elephant-424.convex.cloudHmm... I don't see anything backlog or any errors for knowing-emu-505. You are positive scheduling is happening? If it is happening, it must show in the dashboard as pending (if scheduled in the future) or as executed in the logs (if already executed)
the same logic is deployed on dev and prod but the prod scheduler doesn't fire the action after 1000ms
Yes, I don't see any scheduled executions, but also don't see any errors. How are you scheduling the functions? Is it from a cron or mutation trigger?
You have to await runAfter
It is a async operations.
Ah okay let me try again. Had been working okay with development but that might be it
Yeah, it is a race. Since it runs in Node.js we can't gurantee we wait for all futures to complete.
sweet thank you totally missed that. appreciate the help!
working now
No worries. I see why having to await the scheduling might be confusing.This is a simple case of dangling promises we can likely throw an error for so it fails more loudly. We wouldn't be able to do that if you had other nested promises but we can detect the most basic/common case.
Gotcha that makes sense. Only other issue we're having is the overloaded error
Where do you see those errors? Is it in the action logs or when you call it from the browser? Is that the dev or prod instance?
Yeah, I saw the error on our side. The transactions are failing since they are execute concurrently && they conflict with each other. Is it possible the transactions conflict with each other (they read/modify the same rows)? Do you see OptimisticConcurrencyControlFailure in the Dashboard logs? The error message should have more details.
A stop gap solution might be to add some delay between the mutations. A proper fix is to make sure the mutations don’t read the entire table for which a common solution is to use an index.
action logs and in the ErrorMessage potion of the scheduled job
I think they read the same rows but don't modify them
And the error message is just this
got it I'm using indexes for most queries now so will keep you updated