r2/actionRetrier cleanupExpiredRuns cron job stuck at loop
I'm noticing this message repeatedly appearing in the logs, and it seems to be originating from r2/actionRetrier. @erquhart

20 Replies
Hmm that's kind of what it's supposed to do, the errors are like the system was down, so it's retrying. How often are you seeing this?
I first noticed it looping today, even though I'm not currently trying to upload anything.
Every 24 hours it runs a cron to clean up expired runs, but it should only loop like this if the action fails, which should only happen if the action fails for some reason.
So this is normal to see once like this.
So, there’s no retry limit, and does it stop after 24 hours?
Cause it's still trying
Hmm that's a system error that should have just been intermittent. I also don't see that the action retrier is actually written to retry this action at all, so I'm honestly not sure what's happening here.
cc/ @Ian in case you have any insights
How often is the cron running? and is there data in there?
It looks like there's some retry-able failure. The scheduler will continue retrying mutations if they OCC, e.g.
Maybe it's incorrectly classifying it, or they're all competing for writes..
Totally I don't see either, thanks for the help
normally it should run every 24 hour but right now it's retries every second
with some backoff
The r2 component sets up one instance of the action retrier so I assume this is just one function rerunning and not multiple
I can't find this function or usage of the action retrier in r2, so I'm looking in the wrong spot. @erquhart i'll leave it to you for now
I found this—one of the runs was skipped. Maybe it will help.

Can you check how many documents are in the
runs
table of the retrierThere are 49 documents
but none of them has numFailures
Yeah that should be a very fast mutation. This isn't actually retrier stuff this is just a simple mutation that deletes all completed records. It's a mutation, and it's not actual retried by the retrier: https://github.com/get-convex/action-retrier/blob/c4c74363b80ff8503e74fd30819e724956007964/src/component/run.ts#L312-L330
Will see if someone can take a look.
The only reason it should be stuck is if the number of documents exceeds 1024, but I only have 49. I tried deleting the records manually, but that didn’t help.
and it's still running
If you leave it be, it might help whoever looks into it if it remains in a failure state - assuming this isn't having any known impact on your project
This might be the reason: no arguments are being passed to the function.
https://github.com/get-convex/action-retrier/commit/22c5d01feaa2681f76c0745b11fdb4d74d0d9032
It's fine, I'll leave it as is.
I'd expect that to cause failure 100% of the time, but you have successful runs in the logs
You’re right—if no arguments are being passed to the function, it should cause a failure every time. It’s curious that there are still successful runs in the logs. Still investigating
actually now that I see this...I am also getting it and my fail rate is at 100% any ideas? 🤔

