atadan
atadan22h ago

r2/actionRetrier cleanupExpiredRuns cron job stuck at loop

I'm noticing this message repeatedly appearing in the logs, and it seems to be originating from r2/actionRetrier. @erquhart
No description
20 Replies
erquhart
erquhart22h ago
Hmm that's kind of what it's supposed to do, the errors are like the system was down, so it's retrying. How often are you seeing this?
atadan
atadanOP22h ago
I first noticed it looping today, even though I'm not currently trying to upload anything.
erquhart
erquhart21h ago
Every 24 hours it runs a cron to clean up expired runs, but it should only loop like this if the action fails, which should only happen if the action fails for some reason. So this is normal to see once like this.
atadan
atadanOP21h ago
So, there’s no retry limit, and does it stop after 24 hours? Cause it's still trying
erquhart
erquhart21h ago
Hmm that's a system error that should have just been intermittent. I also don't see that the action retrier is actually written to retry this action at all, so I'm honestly not sure what's happening here. cc/ @Ian in case you have any insights
ian
ian21h ago
How often is the cron running? and is there data in there? It looks like there's some retry-able failure. The scheduler will continue retrying mutations if they OCC, e.g. Maybe it's incorrectly classifying it, or they're all competing for writes..
atadan
atadanOP21h ago
Totally I don't see either, thanks for the help normally it should run every 24 hour but right now it's retries every second with some backoff
erquhart
erquhart21h ago
The r2 component sets up one instance of the action retrier so I assume this is just one function rerunning and not multiple
ian
ian21h ago
I can't find this function or usage of the action retrier in r2, so I'm looking in the wrong spot. @erquhart i'll leave it to you for now
atadan
atadanOP21h ago
I found this—one of the runs was skipped. Maybe it will help.
No description
erquhart
erquhart21h ago
Can you check how many documents are in the runs table of the retrier
atadan
atadanOP21h ago
There are 49 documents but none of them has numFailures
erquhart
erquhart21h ago
Yeah that should be a very fast mutation. This isn't actually retrier stuff this is just a simple mutation that deletes all completed records. It's a mutation, and it's not actual retried by the retrier: https://github.com/get-convex/action-retrier/blob/c4c74363b80ff8503e74fd30819e724956007964/src/component/run.ts#L312-L330 Will see if someone can take a look.
atadan
atadanOP20h ago
The only reason it should be stuck is if the number of documents exceeds 1024, but I only have 49. I tried deleting the records manually, but that didn’t help. and it's still running
erquhart
erquhart20h ago
If you leave it be, it might help whoever looks into it if it remains in a failure state - assuming this isn't having any known impact on your project
atadan
atadanOP20h ago
It's fine, I'll leave it as is.
erquhart
erquhart20h ago
I'd expect that to cause failure 100% of the time, but you have successful runs in the logs
atadan
atadanOP19h ago
You’re right—if no arguments are being passed to the function, it should cause a failure every time. It’s curious that there are still successful runs in the logs. Still investigating
sonandmjy
sonandmjy8h ago
actually now that I see this...I am also getting it and my fail rate is at 100% any ideas? 🤔
No description
No description

Did you find this page helpful?