atadan
atadan4w ago

r2/actionRetrier cleanupExpiredRuns cron job stuck at loop

I'm noticing this message repeatedly appearing in the logs, and it seems to be originating from r2/actionRetrier. @erquhart
No description
40 Replies
erquhart
erquhart4w ago
Hmm that's kind of what it's supposed to do, the errors are like the system was down, so it's retrying. How often are you seeing this?
atadan
atadanOP4w ago
I first noticed it looping today, even though I'm not currently trying to upload anything.
erquhart
erquhart4w ago
Every 24 hours it runs a cron to clean up expired runs, but it should only loop like this if the action fails, which should only happen if the action fails for some reason. So this is normal to see once like this.
atadan
atadanOP4w ago
So, there’s no retry limit, and does it stop after 24 hours? Cause it's still trying
erquhart
erquhart4w ago
Hmm that's a system error that should have just been intermittent. I also don't see that the action retrier is actually written to retry this action at all, so I'm honestly not sure what's happening here. cc/ @Ian in case you have any insights
ian
ian4w ago
How often is the cron running? and is there data in there? It looks like there's some retry-able failure. The scheduler will continue retrying mutations if they OCC, e.g. Maybe it's incorrectly classifying it, or they're all competing for writes..
atadan
atadanOP4w ago
Totally I don't see either, thanks for the help normally it should run every 24 hour but right now it's retries every second with some backoff
erquhart
erquhart4w ago
The r2 component sets up one instance of the action retrier so I assume this is just one function rerunning and not multiple
ian
ian4w ago
I can't find this function or usage of the action retrier in r2, so I'm looking in the wrong spot. @erquhart i'll leave it to you for now
atadan
atadanOP4w ago
I found this—one of the runs was skipped. Maybe it will help.
No description
erquhart
erquhart4w ago
Can you check how many documents are in the runs table of the retrier
atadan
atadanOP4w ago
There are 49 documents but none of them has numFailures
erquhart
erquhart4w ago
Yeah that should be a very fast mutation. This isn't actually retrier stuff this is just a simple mutation that deletes all completed records. It's a mutation, and it's not actual retried by the retrier: https://github.com/get-convex/action-retrier/blob/c4c74363b80ff8503e74fd30819e724956007964/src/component/run.ts#L312-L330 Will see if someone can take a look.
atadan
atadanOP4w ago
The only reason it should be stuck is if the number of documents exceeds 1024, but I only have 49. I tried deleting the records manually, but that didn’t help. and it's still running
erquhart
erquhart4w ago
If you leave it be, it might help whoever looks into it if it remains in a failure state - assuming this isn't having any known impact on your project
atadan
atadanOP4w ago
It's fine, I'll leave it as is.
erquhart
erquhart4w ago
I'd expect that to cause failure 100% of the time, but you have successful runs in the logs
atadan
atadanOP4w ago
You’re right—if no arguments are being passed to the function, it should cause a failure every time. It’s curious that there are still successful runs in the logs. Still investigating
sonandmjy
sonandmjy3w ago
actually now that I see this...I am also getting it and my fail rate is at 100% any ideas? 🤔
No description
No description
atadan
atadanOP3w ago
I'm still trying to figure it out—facing the same problem. https://discord.com/channels/1019350475847499849/1357528982425436271
ampp
ampp3w ago
Yeah i'm also getting this now, i just switched from local dev to cloud as i blew up something locally. It took me searching here to find that it was action retrier as i was checking my logs and not seeing anything (only viewing app). normally i run npx convex dev --tail-logs disable so i wouldn't have seen it
jamwt
jamwt3w ago
can you all dm me your deployment names so we can dig into logs on our side sorry about that, no idea what's up here and obviously you're not getting a very clear error message if you have a pro account, if you make a ticket from your dashboard, that will do it automatically. if not, dming me is fine
atadan
atadanOP3w ago
reached out thanks for the help adding @Zeroday too.
jamwt
jamwt3w ago
we've identified the issue. working on a fix here
atadan
atadanOP3w ago
can I learn what's the issue love the technical side of things.
jamwt
jamwt3w ago
we'd made a change this week to convex system tables that affected cron jobs inside components so components which use cron jobs (like action retrier) were affected @nipunn is working on rolling a fix out now
nipunn
nipunn3w ago
https://github.com/get-convex/convex-backend/commit/486f13405114c85525ff1935fac24262d2d9410e if you are curious. It's very behind-the-scenes. Rolling it out now.
atadan
atadanOP3w ago
I'll look into this, thanks for the quick fix and explanations
nipunn
nipunn3w ago
just rolled it out - are you seeing improvement on your side?
ampp
ampp3w ago
Yeah its no longer showing the error as of 10 minutes ago
nipunn
nipunn3w ago
excellent :phew:
ampp
ampp3w ago
while on the topic of seeing log entries 🙂 i noticed that cleanupExpiredStreams runs every minute which is part of the new persistent-text-streaming component which i installed(and haven't used yet) and just did a quick search and didn't see anything on the docs like if that needs to loop that aggressively, or why it runs often .
nipunn
nipunn3w ago
https://github.com/get-convex/persistent-text-streaming/blob/main/src/component/crons.ts does look like it! @Jamie wrote the component and might have some wisdom. Probably doesn't need to run every minute, but also 🤷 doesn't hurt. It's not going to be unreasonably expensive. Seems fine. What about the aggressive running is bothersome? The log entries? There is a dropdown to filter log entries which could be useful.
atadan
atadanOP3w ago
yeah there is no issue in my end too
jamwt
jamwt3w ago
we could make it configurable if you want to ramp it down to every N minutes or whatever I did the math and it was like 4 cents of function calls a month, so it didn't seem like a big deal yeah, $0.043 per month for 1440 * 30 function calls
erquhart
erquhart3w ago
I wonder if a good pattern would be having the cron run a query that only schedules a mutation when necessary, as the query would probably be cached most of the time when there's no rows to cleanup. Which will help when caching is treated differently in pricing/usage limits.
Zeroday
Zeroday3w ago
No issues on my end anymore either. I had a question though. I was getting that error repeatedly for about a day or two. Will this add to my bill?
jamwt
jamwt3w ago
@rebecca maybe you can help -- would that count as a billed function call?
rebecca
rebecca3w ago
if a function failed due to an internal system error i don't believe that would count, but please don't hesitate to reach out if your bill doesn't look right!

Did you find this page helpful?