jamwt•3w ago

Slow deployments

Making a thread to gather info about slow deployments 🧵

38 Replies

jamwtOP•3w ago

if you're having issues, I'd love the output of npx convex network-test and feel free to dm me your deployment slug ( happy-rabbit-123) it's possible one particular database shard is having issues and our telemetry isn't catching it. we're still in the progress of finishing our database migration

Gravitynomad•3w ago

Thanks for looking into this Jamie DM sent with network test results and deployment slug. Good morning @Jamie 🌞 Everything seems to be better today.. and figured why not use the time to also optimize everything I can following the best practices guide... so I did. Dashboard loading still hangs from time to time but I'm relieved that the critical parts, are working and users are being able to use the site. Again, thanks for looking into this and if you have any extra updates, please let me know if you get a minute 🙂 Have an awesome week! Right now its really really failing. Dashboards not loading and client side queries are not working at all. 😭

jamwtOP•3w ago

@Gravitynomad hi! an update here is our team spent awhile looking on sunday, and your deployment isn't having any issue. all the resources it's on our happy. do you have any other deployments that are not exhibiting this behavior? also, do you have a VPN just to try a separate route to our gateway? but yes, at least in the backend, no signs at all of overload of backpressure or slowness so I'm suspecting some kind of network or routing issue from you to our servers. your network test definitely shows poor connectivity to our servers, but the infra itself seems completely fine, based on a bit of digging/telemetry we dug up yesterday

Gravitynomad•3w ago

I tried VPN , yes...

jamwtOP•3w ago

what's the outcome of the network-test on that deployment (when you're on the VPN)? also, do you have other projects that are also having this issue for you, or is it isolated to one deployment

Gravitynomad•3w ago

I have another deployment but its just a simple one page site with one query. nothing like my current project It just totally fails.

jamwtOP•3w ago

and when you do the network-test with the simple site with one query, same thing? can you paste that one for comparison's sake?

Gravitynomad•3w ago

I'll try on the other deployment. One sec..

Gravitynomad•3w ago

yea.. this one fails too... it does get further though

jamwtOP•3w ago

yeah. that's a really bummer. but yeah, it makes me wonder if there's some kind of crazy network anomaly happening to you. one sec, some more diagnostic commands are you on a pro plan, out of curiosity?

Gravitynomad•3w ago

that would suck 😄 I mean.. I dont have any connectivity issues with anything else really

jamwtOP•3w ago

also, if you dont mind sharing, where are you in the world?

Gravitynomad•3w ago

no.. not yet.. I moving my stuff out of supabase to see it working before I did

jamwtOP•3w ago

got it. the reason I was asking is custom domains use a completely differnet traffic layer -- cloudflare, which provides much better global POPs so we're not reliant on routing systems across the globe

Gravitynomad•3w ago

If it makes any difference I'll be happy to be in the pro plan understand

jamwtOP•3w ago

we've seen issues before when people are doing home runs to our gateway on us-east from lots of different places

Gravitynomad•3w ago

and maybe me being in SE Asia doesnt help that much either 😄

jamwtOP•3w ago

ah yes. so that has 100% been where we've seen this issue I have no idea why but some providers there give us crazy routes to our gateway we're going to work on getting better traffic infrastructure for normal happy-animal-123 instance names too, but it hasn't been done yet. custom domains have because our bigger customers in production needed this solved now. as you can imagine

Gravitynomad•3w ago

ok, understand. Well thanks Jamie, really appreciate your help, I understand that there might be an issue with this so I'll just call it a day for today, this really burned everything I had today and tomorrow I'll get into the pro sub and try again 🙂 So when I move into Pro, my deployments get migrated automatically?

jamwtOP•3w ago

unfortunately -- it is only implemented for custom domains right now, which you can set on your prod instance, but not dev. so this will make your production project work great, but doesn't fix your dev workflow (push, dashboard, etc) we're actually just starting a partnership with cloudflare to get all our traffic using their global POP infrastructure so we don't have this issue. the reputation of their IPs/network is obviously really solid everywhere in the world

Gravitynomad•3w ago

ok. thanks, I understand... as long as my users get to be ok, I'm fine with it and I can live with some hiccups on the dev side. No worries. Will do some tests tomorrow and see how it goes. Thanks again Jamie.

jamwtOP•3w ago

sorry for the issue. I should have asked right away where you were, because we've seen this once or twice before.

Gravitynomad•3w ago

It's ok. No worries. Thanks Jamie. 🙂

conradkoh•3w ago

hey Jamie, I hope it's ok that I join the thread here. I've encountered this quite a number of times, and I'm here in Singapore too - SE Asia definitely. I wanted to ask if you think it is a gateway / network issue, because most of the connectivity tests pass, with the exception of the websockets one. also wondering regarding the SSE tests that always seem to pass - whether it would be a consideration to use SSE instead of websockets for the real time functionality so that it is more reliable.

conradkoh•3w ago

I took a screen recording of the test when I encountered the issue. Often, the first 2 stages experience extremely slow speeds and pass, but at higher sizes they fail. Like in the video, the 1MB test will eventually fail but the lower 2 tests do pass after an extremely long duration.

Ranga•3w ago

Wondering if it’s only for you or your users as well? Are the users able to use your app @conradkoh ? I want to start bootstrapping my app with convex.

conradkoh•3w ago

Quite sure it was for everyone. I had a friend reaching out saying that his deployment was down as well. I tried on multiple computers and multiple networks and all of them were stuck in a loading state. To my knowledge this is the 3rd time that this has happened and the root cause is not known. I do know that during the last incident, I connected to a VPN and things worked fine. @Ranga here is a previous thread - https://discord.com/channels/1019350475847499849/1414556005030690858/1414556005030690858

jamwtOP•2w ago

@conradkoh hi! are you using custom domains?

conradkoh•2w ago

hey @jamwt, nope i’m not! i’m on a pro plan but not custom domains. i read the thread above - would you say that the network infra is significantly different for the http endpoints vs the websocket one though?

jamwtOP•2w ago

no, but I think some routes don't allow long-lived WS in the region reliabily. but if the backhaul is running over cloudflare's network, then everything is fine, since they obviously will allow it

conradkoh•2w ago

oo i see. that makes sense. do you think there would be a case for using SSE as a fallback in the future? it feels like a real risk for users because when these things happen it’s almost entirely out of both our hands.

Ranga•2w ago

Interesting. I have seen couple of people complaining about this. Mostly around Asia. Have you tried self hosting this and seeing the same issue?

conradkoh•2w ago

nope i haven’t. but i don’t believe i’m more well versed than the convex team, so if they’re seeing issues i don’t think i’ll try to solve it on my own 😂 i’m going with my gut to say that it is one of the inherent problems of websockets over a large number of hops from the client to the server while trying to maintain a persistent connection - more expensive for all parties involved. so we in asia are more affected because more hops to the US based servers aka more chances of hitting a server that disrupts the connection..? so i’m watching for 1. regional deployments 2. SSE 3. self-hosted (last resort - i’d much prefer a pay per use model) i’m also thinking that a majority of users are not benefitting from the websockets that much because of convex’s pricing model where you are billed for each of those. so i feel that most people would be minimizing high frequency writes - which makes SSE make more sense..? i think. SSE is also be more compatible with general web infra for HTTP from what i can tell. but I’m sure there are good reasons for why they chose this and I don’t think I have as much expertise as the Convex team on the matter, so I’d defer to them to weigh in on the trade offs and the rationale.

jamwtOP•2w ago

SSE won't happen for sure, and websockets are used pretty pervasively for lots of applications but most bigger infra companies control their backhaul thus our need to do it when the connections change hands through like 14 intermediaries or whatever on long routes, there's a lot that can go wrong

conradkoh•2w ago

thanks for the reply @jamwt appreciate it. i gave it a bid more through and i recall that smaller packets were going through from the convex tester. so this would imply that it actually is not likely to be a connection level issue. so maybe it is a bad deployment + sticky session with a specific issue on larger payloads..? can’t be sure though. anyways thanks for the help 🙌🏼. the system is up and i guess we’ll find out more the next time it happens!

Gravitynomad•2w ago

Wanted to confirm that using Custom Domains does in fact, fix this issue. I really believe this is not a small problem and there should be a warning posted somewhere stating that websocket connections fail in more than one country in SE Asia for all deployments not using custom domains. (In the case of Vietnam, I can confirm its the whole country regardless of the internet connection and city). This affects both admins and users, as well as server and also Convex dashboard connectivity. This is 100% resolved by moving to the Pro plan and configuring custom domains. Thanks Jamie for the help, and pointing me in the right direction. I'm happy its resolved. 🙂

jamwtOP•2w ago

glad to hear it! yes, we'll move all traffic over to the cloudflare edge network soon

sqroot•2w ago

What is the definition of a "slow deployment"? We have a relatively new, small app and currently just making minor changes to code can take between 10 and 30 seconds: ----- 08:10:27 Convex functions ready! (26.95s) ----- We are based in South Africa. This was before today's network issues. Before we make the switch to a local dev environment, can someone let me know how long we should expect for code to be pushed to the dev environments each time we change a file?

Slow deployments

Did you find this page helpful?