punn
punn2y ago

Transient Error

I get occasional transient errors when I make a change to a convex function without reloading the web page. However, it now seems to happen even when there aren't any changes. Is there something I should look out for to fix this issue?
9 Replies
ian
ian2y ago
What output do you see for these errors? Is it client-side or coming from a convex function call? The files being re-generated by npx convex dev sometimes cause errors in my client code, which flashes the browser bc of vite’s hot reloading. I also once had an issue where I had configured Vite to do the module reloading over https but was hitting it on localhost (http), which caused client-side issues. If it’s server-side errors, then if you DM your project slug we can look for errors on our side.
punn
punnOP2y ago
I think it's server side
No description
punn
punnOP2y ago
and it also happens on the product deployment
ian
ian2y ago
Gotcha. If you send along your deployment ID, I can check for errors in Sentry. It’s in the deployment url- 2 words and a number. Unlike mutations, actions will not automatically be retried on transient failure. During deploys there may be slightly elevated likelihood of them. You can add your own retry logic on the client, which will hopefully handle most cases. We’re working to make transient failures as unlikely as possible but with unpredictable networks and such, it’ll never be 100%.
RJ
RJ2y ago
@ian How would you recommend protecting against these kinds of failures for actions which are triggered by HTTP functions? And are HTTP functions themselves also vulnerable to these sorts of failures? (Sorry to hijack your thread @punn, I can also create a separate one to not clutter!)
ian
ian2y ago
If the action is being executed via runAction, then you could implement a basic retry with a try/catch (since the failed action will raise an exception). Some general guidelines I use when implementing retries: 1. Try to distinguish by the exception raised whether it's a transient / temporary error, or a "permanent" failure. If you retry something that's bound to fail again, you've essentially doubled your failed traffic with no benefit. Some status codes that imply transience to me: 408, 409, 425, 429, 502, 503, 504. There's some risk in retrying the ones that already indicate the server is overloaded, like 504, but ideally the server does a good job of shedding load to allow a later request to succeed. 500's are controversial. I find 500's that are worth retrying ought to be caught in user-space and returned to explicitly encouraging a retry. E.g. if a 3rd party service returns a transient error, either retry on the server or return a response that gives the client enough context to decide whether to retry. 2. Wait a bit before retrying. If you're retrying more than once, wait longer each time (exponential backoff). Otherwise you can have a "thundering herd" problem. My default is 100ms, then 500ms, then 1s. If you want to wait 10s+, you could use the scheduler to try again in the future. 3. Ideally wait a random amount of time before retrying (jitter), in case a bunch of requests are competing in a burst, to spread the second wave of the burst out. 4. Log whether it succeeded on the 1st, 2nd, ... or never succeeded. If you rarely see something succeed on a second try, you might be retrying a permanent failure (e.g. a 400) 5. Be careful of nesting retry layers. If the server is doing 3 retries, and the client does 3 retries, then your traffic can 9x if you push code that starts failing, or if there's a network hiccup / server restart It could probably be a good Stack post at some point. But hopefully that's helpful for now
ian
ian2y ago
Just chatted with some folks about retries - for httpEndpoint errors thrown in your code are returned as 500s, but we may change those so you can assume 500s are Convex's responsibility. And to reiterate for others reading this why you'd need to do this vs. just having us auto-retry actions / httpEndpoints (and as a reminder, we retry queries & mutations for you): we can't guarantee that things that have side-effects are safe to execute more than once, so we push the responsibility to developers to decide which to retry. Some of this is in the post https://stack.convex.dev/background-job-management but it's a bit high-level / hand-wavy. If I write a post on it, some other things I could cover: 1. Implementing idempotency keys so transactional things everything can be retried safely 2. Checkpointing actions via database writes, to provide more granular info on where the failed invocation made it to. 3. Code snippets for exponential backoff retry helpers
Background Job Management
Implement asynchronous job patterns using a table to track progress. Fire-and-forget, cancelation, timeouts, and more.
martin
martin2y ago
we've been encountering the same issue as @punn for the last week it's entirely inconsistent. Sometimes the transient error happens all the time, preventing our app from working at all. At other times, it only happens when switching Clerk workspaces (which causes a reauth). And still other times, like this afternoon, the issue does not occur in any of our environments. It's very puzzling
ballingt
ballingt2y ago
Thanks for reporting, all. Looking into it more, these transient errors are caused by WebSocket reconnects (due to auth changes, network errors, or other errors) while actions are running. As long as the WebSocket was connected when the action was called on the client, the action is going to run; check the dashboard logs to see for a particular action run. However we're not doing the extra work to prove to the client that these actions completed if there's a websocket disconnection, and their return values won't be send back to the client after a reconnect. Since the client doesn't know if they've succeeded, it calls this failure and throws this "transient error." We're doing several things to improve this, some of which will be in the next client release. In the meantime, you can handle these in dev by adding error handlers to the actions. In prod these already shouldn't be firing loudly.

Did you find this page helpful?