jamwt
jamwt3w ago

Slow deployments

Making a thread to gather info about slow deployments 🧵
38 Replies
jamwt
jamwtOP3w ago
if you're having issues, I'd love the output of npx convex network-test and feel free to dm me your deployment slug ( happy-rabbit-123) it's possible one particular database shard is having issues and our telemetry isn't catching it. we're still in the progress of finishing our database migration
Gravitynomad
Gravitynomad3w ago
Thanks for looking into this Jamie DM sent with network test results and deployment slug. Good morning @Jamie 🌞 Everything seems to be better today.. and figured why not use the time to also optimize everything I can following the best practices guide... so I did. Dashboard loading still hangs from time to time but I'm relieved that the critical parts, are working and users are being able to use the site. Again, thanks for looking into this and if you have any extra updates, please let me know if you get a minute 🙂 Have an awesome week! Right now its really really failing. Dashboards not loading and client side queries are not working at all. 😭
jamwt
jamwtOP3w ago
@Gravitynomad hi! an update here is our team spent awhile looking on sunday, and your deployment isn't having any issue. all the resources it's on our happy. do you have any other deployments that are not exhibiting this behavior? also, do you have a VPN just to try a separate route to our gateway? but yes, at least in the backend, no signs at all of overload of backpressure or slowness so I'm suspecting some kind of network or routing issue from you to our servers. your network test definitely shows poor connectivity to our servers, but the infra itself seems completely fine, based on a bit of digging/telemetry we dug up yesterday
Gravitynomad
Gravitynomad3w ago
I tried VPN , yes...
jamwt
jamwtOP3w ago
what's the outcome of the network-test on that deployment (when you're on the VPN)? also, do you have other projects that are also having this issue for you, or is it isolated to one deployment
Gravitynomad
Gravitynomad3w ago
I have another deployment but its just a simple one page site with one query. nothing like my current project It just totally fails.
jamwt
jamwtOP3w ago
and when you do the network-test with the simple site with one query, same thing? can you paste that one for comparison's sake?
Gravitynomad
Gravitynomad3w ago
I'll try on the other deployment. One sec..
Gravitynomad
Gravitynomad3w ago
yea.. this one fails too... it does get further though
No description
jamwt
jamwtOP3w ago
yeah. that's a really bummer. but yeah, it makes me wonder if there's some kind of crazy network anomaly happening to you. one sec, some more diagnostic commands are you on a pro plan, out of curiosity?
Gravitynomad
Gravitynomad3w ago
that would suck 😄 I mean.. I dont have any connectivity issues with anything else really
jamwt
jamwtOP3w ago
also, if you dont mind sharing, where are you in the world?
Gravitynomad
Gravitynomad3w ago
no.. not yet.. I moving my stuff out of supabase to see it working before I did
jamwt
jamwtOP3w ago
got it. the reason I was asking is custom domains use a completely differnet traffic layer -- cloudflare, which provides much better global POPs so we're not reliant on routing systems across the globe
Gravitynomad
Gravitynomad3w ago
If it makes any difference I'll be happy to be in the pro plan understand
jamwt
jamwtOP3w ago
we've seen issues before when people are doing home runs to our gateway on us-east from lots of different places
Gravitynomad
Gravitynomad3w ago
and maybe me being in SE Asia doesnt help that much either 😄
jamwt
jamwtOP3w ago
ah yes. so that has 100% been where we've seen this issue I have no idea why but some providers there give us crazy routes to our gateway we're going to work on getting better traffic infrastructure for normal happy-animal-123 instance names too, but it hasn't been done yet. custom domains have because our bigger customers in production needed this solved now. as you can imagine
Gravitynomad
Gravitynomad3w ago
ok, understand. Well thanks Jamie, really appreciate your help, I understand that there might be an issue with this so I'll just call it a day for today, this really burned everything I had today and tomorrow I'll get into the pro sub and try again 🙂 So when I move into Pro, my deployments get migrated automatically?
jamwt
jamwtOP3w ago
unfortunately -- it is only implemented for custom domains right now, which you can set on your prod instance, but not dev. so this will make your production project work great, but doesn't fix your dev workflow (push, dashboard, etc) we're actually just starting a partnership with cloudflare to get all our traffic using their global POP infrastructure so we don't have this issue. the reputation of their IPs/network is obviously really solid everywhere in the world
Gravitynomad
Gravitynomad3w ago
ok. thanks, I understand... as long as my users get to be ok, I'm fine with it and I can live with some hiccups on the dev side. No worries. Will do some tests tomorrow and see how it goes. Thanks again Jamie.
jamwt
jamwtOP3w ago
sorry for the issue. I should have asked right away where you were, because we've seen this once or twice before.
Gravitynomad
Gravitynomad3w ago
It's ok. No worries. Thanks Jamie. 🙂
conradkoh
conradkoh3w ago
hey Jamie, I hope it's ok that I join the thread here. I've encountered this quite a number of times, and I'm here in Singapore too - SE Asia definitely. I wanted to ask if you think it is a gateway / network issue, because most of the connectivity tests pass, with the exception of the websockets one. also wondering regarding the SSE tests that always seem to pass - whether it would be a consideration to use SSE instead of websockets for the real time functionality so that it is more reliable.
conradkoh
conradkoh3w ago
I took a screen recording of the test when I encountered the issue. Often, the first 2 stages experience extremely slow speeds and pass, but at higher sizes they fail. Like in the video, the 1MB test will eventually fail but the lower 2 tests do pass after an extremely long duration.
Ranga
Ranga3w ago
Wondering if it’s only for you or your users as well? Are the users able to use your app @conradkoh ? I want to start bootstrapping my app with convex.
conradkoh
conradkoh3w ago
Quite sure it was for everyone. I had a friend reaching out saying that his deployment was down as well. I tried on multiple computers and multiple networks and all of them were stuck in a loading state. To my knowledge this is the 3rd time that this has happened and the root cause is not known. I do know that during the last incident, I connected to a VPN and things worked fine. @Ranga here is a previous thread - https://discord.com/channels/1019350475847499849/1414556005030690858/1414556005030690858
jamwt
jamwtOP2w ago
@conradkoh hi! are you using custom domains?
conradkoh
conradkoh2w ago
hey @jamwt, nope i’m not! i’m on a pro plan but not custom domains. i read the thread above - would you say that the network infra is significantly different for the http endpoints vs the websocket one though?
jamwt
jamwtOP2w ago
no, but I think some routes don't allow long-lived WS in the region reliabily. but if the backhaul is running over cloudflare's network, then everything is fine, since they obviously will allow it
conradkoh
conradkoh2w ago
oo i see. that makes sense. do you think there would be a case for using SSE as a fallback in the future? it feels like a real risk for users because when these things happen it’s almost entirely out of both our hands.
Ranga
Ranga2w ago
Interesting. I have seen couple of people complaining about this. Mostly around Asia. Have you tried self hosting this and seeing the same issue?
conradkoh
conradkoh2w ago
nope i haven’t. but i don’t believe i’m more well versed than the convex team, so if they’re seeing issues i don’t think i’ll try to solve it on my own 😂 i’m going with my gut to say that it is one of the inherent problems of websockets over a large number of hops from the client to the server while trying to maintain a persistent connection - more expensive for all parties involved. so we in asia are more affected because more hops to the US based servers aka more chances of hitting a server that disrupts the connection..? so i’m watching for 1. regional deployments 2. SSE 3. self-hosted (last resort - i’d much prefer a pay per use model) i’m also thinking that a majority of users are not benefitting from the websockets that much because of convex’s pricing model where you are billed for each of those. so i feel that most people would be minimizing high frequency writes - which makes SSE make more sense..? i think. SSE is also be more compatible with general web infra for HTTP from what i can tell. but I’m sure there are good reasons for why they chose this and I don’t think I have as much expertise as the Convex team on the matter, so I’d defer to them to weigh in on the trade offs and the rationale.
jamwt
jamwtOP2w ago
SSE won't happen for sure, and websockets are used pretty pervasively for lots of applications but most bigger infra companies control their backhaul thus our need to do it when the connections change hands through like 14 intermediaries or whatever on long routes, there's a lot that can go wrong
conradkoh
conradkoh2w ago
thanks for the reply @jamwt appreciate it. i gave it a bid more through and i recall that smaller packets were going through from the convex tester. so this would imply that it actually is not likely to be a connection level issue. so maybe it is a bad deployment + sticky session with a specific issue on larger payloads..? can’t be sure though. anyways thanks for the help 🙌🏼. the system is up and i guess we’ll find out more the next time it happens!
Gravitynomad
Gravitynomad2w ago
Wanted to confirm that using Custom Domains does in fact, fix this issue. I really believe this is not a small problem and there should be a warning posted somewhere stating that websocket connections fail in more than one country in SE Asia for all deployments not using custom domains. (In the case of Vietnam, I can confirm its the whole country regardless of the internet connection and city). This affects both admins and users, as well as server and also Convex dashboard connectivity. This is 100% resolved by moving to the Pro plan and configuring custom domains. Thanks Jamie for the help, and pointing me in the right direction. I'm happy its resolved. 🙂
jamwt
jamwtOP2w ago
glad to hear it! yes, we'll move all traffic over to the cloudflare edge network soon
sqroot
sqroot2w ago
What is the definition of a "slow deployment"? We have a relatively new, small app and currently just making minor changes to code can take between 10 and 30 seconds: ----- 08:10:27 Convex functions ready! (26.95s) ----- We are based in South Africa. This was before today's network issues. Before we make the switch to a local dev environment, can someone let me know how long we should expect for code to be pushed to the dev environments each time we change a file?

Did you find this page helpful?