punn
punn2y ago

Function execution timed out (120s)

One of our larger actions that imports a ton of data from an external source takes quite a while to execute and we're hitting the timeout. Is there any suggested patterns we should employ? the flow is as follows: 1. fetch all listingId, fetch and store all reservationIds 2. for each reservationId, fetch and store all conversationIds 2. for each conversationId, fetch and store all messages
26 Replies
punn
punnOP2y ago
Should I also be scheduling this action instead of calling it from the client outright? If I schedule the action, the client is free to do anything else correct?
ballingt
ballingt2y ago
I'd suggest chunking up this work and recording your progress in the DB. Each chunk should ideally run for less than a minute, then when it gets to a stopping post record your progress in the database and schedule that function to run again. This way even if there's a failure, you can resume since you've been recording progress in the DB. You can write a query that reports progress too for a client-side progress bar.
ballingt
ballingt2y ago
If I schedule the action, the client is free to do anything else correct?
I don't quite follow this, could you say more? Whether you schedule the action or call it directly from the client, the client is free to do whatever it wants. Mutations run one by one in Convex but actions don't block anything, see https://blog.convex.dev/announcing-convex-0-14-0/#breaking-actions-are-now-parallelized for more
Convex News
Announcing Convex 0.14.0
Meet the brand new Convex Rust client! We’ve also open sourced the JS client, added schema validation, and more. 0.14.0 is the best version of Convex yet!
ian
ian2y ago
You might consider something like calling a function (mutation or action) that schedules an action for every reservationId. that action does steps 2 & 3 for the reservationId, then marks that reservationId as done in a table. This Stack post has some tips on patterns for tracking job state: https://stack.convex.dev/background-job-management
Background Job Management
Implement asynchronous job patterns using a table to track progress. Fire-and-forget, cancelation, timeouts, and more.
punn
punnOP2y ago
Ahh that makes sense. Thanks a lot for the explanations + examples I'll try a few different approaches. A bit new to writing production code with lots of data so haven't come across these issues very often. I'm getting a lot of TokenExpired errors after calling the action from the client although the same actions work fine on the dashboard. Seems like for actions that take a while to execute (20+ secs), the token for the websocket connection is lost?
ballingt
ballingt2y ago
Ah gotcha, we'll look into that. As a workaround for now and as a generally more resilient approach (since it doesn't require connectivity through the duration of the action), scheduling these actions to run immediately is a nice way to do things.
presley
presley2y ago
We will look into fixing the auth to only be checked at the beginning of the action. Another workaround in the meantime is to make the Clerk expiry longer than 1 minute. We recommend 1h as our default setting.
punn
punnOP2y ago
{"code":"Overloaded","message":"InternalServerError: Your request couldn't be completed. Try again later."} Would this be due to scheduling with too little delay between each action?
presley
presley2y ago
Yes, this is likely. I think this has resulted in many concurrent actions that all queued up queries/mutations at the same time? What is your instance name?
punn
punnOP2y ago
https://knowing-emu-505.convex.cloud we mightve overloaded it since a lot of our scheduled actions are hanging this is p0 for us rn since its blocking prod activity. is there a way to clear all scheduled actions?
presley
presley2y ago
You can do that from the dashboard Go to the functions tab and find the action you have scheduled => Scheduled Runs => "cancel all"
punn
punnOP2y ago
Hmm there doesn't seem to be any scheduled fns
presley
presley2y ago
Hmm... I also don't see any issues with knowing-emu. What is hanging exactly?
punn
punnOP2y ago
The scheduler doesn't seem to run the action after the delay. Line 124 in actions/pmsData:fetchAndStoreReservations thought it was auto-blocked due to congestion it runs fine on dev instance https://mellow-elephant-424.convex.cloud
presley
presley2y ago
Hmm... I don't see anything backlog or any errors for knowing-emu-505. You are positive scheduling is happening? If it is happening, it must show in the dashboard as pending (if scheduled in the future) or as executed in the logs (if already executed)
punn
punnOP2y ago
the same logic is deployed on dev and prod but the prod scheduler doesn't fire the action after 1000ms
presley
presley2y ago
Yes, I don't see any scheduled executions, but also don't see any errors. How are you scheduling the functions? Is it from a cron or mutation trigger?
punn
punnOP2y ago
let delay = 1000;
for (const listingId of listingIds) {
const jobId: Id<"scheduledJobs"> = await runMutation(
"scheduledJobs:addJob",
{
userId: user?._id as Id<"users">,
pmsPlatform,
type: "listing_import",
status: "pending",
startedAt: Date.now(),
details: listingId,
}
);

scheduler.runAfter(delay, "actions/pmsData:fetchFlowForListing", {
listingId,
pmsPlatform,
jobId,
userId: user?._id as Id<"users">,
});

jobIds.push(jobId);
}
let delay = 1000;
for (const listingId of listingIds) {
const jobId: Id<"scheduledJobs"> = await runMutation(
"scheduledJobs:addJob",
{
userId: user?._id as Id<"users">,
pmsPlatform,
type: "listing_import",
status: "pending",
startedAt: Date.now(),
details: listingId,
}
);

scheduler.runAfter(delay, "actions/pmsData:fetchFlowForListing", {
listingId,
pmsPlatform,
jobId,
userId: user?._id as Id<"users">,
});

jobIds.push(jobId);
}
presley
presley2y ago
You have to await runAfter It is a async operations.
punn
punnOP2y ago
Ah okay let me try again. Had been working okay with development but that might be it
presley
presley2y ago
Yeah, it is a race. Since it runs in Node.js we can't gurantee we wait for all futures to complete.
punn
punnOP2y ago
sweet thank you totally missed that. appreciate the help! working now
presley
presley2y ago
No worries. I see why having to await the scheduling might be confusing.This is a simple case of dangling promises we can likely throw an error for so it fails more loudly. We wouldn't be able to do that if you had other nested promises but we can detect the most basic/common case.
punn
punnOP2y ago
Gotcha that makes sense. Only other issue we're having is the overloaded error
presley
presley2y ago
Where do you see those errors? Is it in the action logs or when you call it from the browser? Is that the dev or prod instance? Yeah, I saw the error on our side. The transactions are failing since they are execute concurrently && they conflict with each other. Is it possible the transactions conflict with each other (they read/modify the same rows)? Do you see OptimisticConcurrencyControlFailure in the Dashboard logs? The error message should have more details. A stop gap solution might be to add some delay between the mutations. A proper fix is to make sure the mutations don’t read the entire table for which a common solution is to use an index.
punn
punnOP2y ago
action logs and in the ErrorMessage potion of the scheduled job I think they read the same rows but don't modify them And the error message is just this got it I'm using indexes for most queries now so will keep you updated

Did you find this page helpful?