magicseth
magicseth16mo ago

Debugging occasional slowness

I'm seeing occasional lag when a child asks our system a question. In this screenshot append_to_conversation inserts to a list, that is being queried, and newHistory indicates that we received it. That happened at 11:18:40 The server says it took just a few ms, but it didn't log until timestamp 11:18:44
9/25/2023, 11:18:44 PM
19ms
success
Mutation
queries/ai_conversations/append_to_conversation
log
'childSpoke'
9/25/2023, 11:18:44 PM
19ms
success
Mutation
queries/ai_conversations/append_to_conversation
log
'childSpoke'
Usually it's way more intant, but here it took 4 seconds to arrive at the server. Is there any instrumentation I can add to identify the lag? It happens rarely, but sometimes I see lags bigger than 20 seconds. Trying to track down the core issue.
No description
16 Replies
magicseth
magicsethOP16mo ago
One potential issue: I am streaming llm responses to my table, and my query is subscribed to it. It seems like on a slow connection, we saturate the pipe, and have to wait for EVERY intermediate state to be transferred to the client, because it is subscribed. This means we can't get other updates, and are wasting a lot of time...
presley
presley16mo ago
Yes, your hypothesis make a lot of sense. There are a lot of intermediate results (we don't necessary have an update for every change) and on slow connection that can lead to some lag while the data is being sent over the web socket. We have recently made improvements to limit the size of the buffer, in order to limit the server memory usage, which also limits the max lag. Previously it could grow unbounded and is now limited so there is a bound on the max lag. But in your case, you probably don't hit the max buffer size, so some lag still exists. I will add metrics and look up for a fix on our side.
magicseth
magicsethOP16mo ago
It doesn't make a ton of sense to me though why a request would take 4 seconds to hit your server I can't be clogging the outbound queue Thanks for explaining some of the server side piece
presley
presley16mo ago
Ah, sorry, you are right, the intermediate response issue make sense. Not sure why the server wouldn't receive a request for 4 to 20 seconds since this is going in the other direction. Something unclear from the original description,
and newHistory indicates that we received it
and newHistory indicates that we received it
What is newHistory and where was it received? How have you observed the 20 second lag? Have you verified that there is 20 seconds difference between the client and server timestamps like above or the server response takes 20 seconds? Few hypothesis I am trying to narrow down: * The server receives the request and executes it pretty quickly (this contradicts the timestamp delta you have pasted above, but also couple of seconds might be clock skew, 20 seconds like is not) but then the response (visible effect to the client) is delayed due to the buffering / slow connection. * The server receives the request, but is busy executing a query (the query takes too long), which is why it does not process it immediately.
magicseth
magicsethOP16mo ago
newHistory is printed out when my callback receives the update on the frontend from the query subscription I thinkthe 20 seconds could be caused by pipeline being clogged I suspect the 4 seconds delay is real... I'm not sure how to find out when the client socket actually sends out the request...
presley
presley16mo ago
As next steps here, my plan here is to add some metrics to measure the web socket buffer messages. And then I can look it for your specific instances, and can try reducing the buffer, which I think will significantly reduce the max delay on slow connections. Please send me the related instance name(s).
magicseth
magicsethOP16mo ago
awesome @presley thank you. opulent-goose-254 I can repro with you when you have logging in place.
presley
presley15mo ago
Hi @magicseth, sorry about the delay here. I have added metrics so I can see the delay between when responses are generated and when they are sent. If you reproduce the slow issues, I should be able to tell if this metric raises to confirm our hypothesis above. If we confirm this is the root cause, I think I can lower the buffer size to make it go away. I tentative plan to do that for everyone, but would be great to confirm this was the issue you were seeing before I do that.
magicseth
magicsethOP15mo ago
Thanks @presley I'll take a look. Do you want timestamps of when it happens?
presley
presley15mo ago
I can look at the graph to see how much latency spikes so don't need exact timestamps. Rought hour/minute might be useful but not required.
magicseth
magicsethOP15mo ago
@presley Maybe I just caused some back up?
presley
presley15mo ago
Yeah, I see about 2 second backup.
magicseth
magicsethOP15mo ago
awesome, I may have piled up a few of those, because I had a lot of delay
presley
presley15mo ago
Hmm... actually it seems some of your Node actions are just slow? I see some latencies between 2 and 8 seconds? If you see a ton of delay, can the slowness be from open ai responses?
magicseth
magicsethOP15mo ago
I'm sending a transcription of what the kid says to convex then I don't display it until you send it back to me that is taking 20+ seconds there are a lot of concurrent inserts and updates for things on the "conversation list" from openai updating every word but it delaying the kid's transcript round tripping back to me
presley
presley15mo ago
I don't see delay in the metric I expected. So there must be some other buffer happening.

Did you find this page helpful?