Debugging occasional slowness
I'm seeing occasional lag when a child asks our system a question.
In this screenshot append_to_conversation inserts to a list, that is being queried, and newHistory indicates that we received it.
That happened at 11:18:40
The server says it took just a few ms, but it didn't log until timestamp 11:18:44
Usually it's way more intant, but here it took 4 seconds to arrive at the server. Is there any instrumentation I can add to identify the lag?
It happens rarely, but sometimes I see lags bigger than 20 seconds.
Trying to track down the core issue.
16 Replies
One potential issue: I am streaming llm responses to my table, and my query is subscribed to it. It seems like on a slow connection, we saturate the pipe, and have to wait for EVERY intermediate state to be transferred to the client, because it is subscribed. This means we can't get other updates, and are wasting a lot of time...
Yes, your hypothesis make a lot of sense. There are a lot of intermediate results (we don't necessary have an update for every change) and on slow connection that can lead to some lag while the data is being sent over the web socket. We have recently made improvements to limit the size of the buffer, in order to limit the server memory usage, which also limits the max lag. Previously it could grow unbounded and is now limited so there is a bound on the max lag.
But in your case, you probably don't hit the max buffer size, so some lag still exists. I will add metrics and look up for a fix on our side.
It doesn't make a ton of sense to me though why a request would take 4 seconds to hit your server
I can't be clogging the outbound queue
Thanks for explaining some of the server side piece
Ah, sorry, you are right, the intermediate response issue make sense. Not sure why the server wouldn't receive a request for 4 to 20 seconds since this is going in the other direction.
Something unclear from the original description, What is newHistory and where was it received?
How have you observed the 20 second lag? Have you verified that there is 20 seconds difference between the client and server timestamps like above or the server response takes 20 seconds? Few hypothesis I am trying to narrow down:
* The server receives the request and executes it pretty quickly (this contradicts the timestamp delta you have pasted above, but also couple of seconds might be clock skew, 20 seconds like is not) but then the response (visible effect to the client) is delayed due to the buffering / slow connection.
* The server receives the request, but is busy executing a query (the query takes too long), which is why it does not process it immediately.
newHistory is printed out when my callback receives the update on the frontend from the query subscription
I thinkthe 20 seconds could be caused by pipeline being clogged
I suspect the 4 seconds delay is real... I'm not sure how to find out when the client socket actually sends out the request...
As next steps here, my plan here is to add some metrics to measure the web socket buffer messages. And then I can look it for your specific instances, and can try reducing the buffer, which I think will significantly reduce the max delay on slow connections. Please send me the related instance name(s).
awesome @presley thank you.
opulent-goose-254
I can repro with you when you have logging in place.
Hi @magicseth, sorry about the delay here. I have added metrics so I can see the delay between when responses are generated and when they are sent. If you reproduce the slow issues, I should be able to tell if this metric raises to confirm our hypothesis above.
If we confirm this is the root cause, I think I can lower the buffer size to make it go away. I tentative plan to do that for everyone, but would be great to confirm this was the issue you were seeing before I do that.
Thanks @presley I'll take a look. Do you want timestamps of when it happens?
I can look at the graph to see how much latency spikes so don't need exact timestamps. Rought hour/minute might be useful but not required.
@presley Maybe I just caused some back up?
Yeah, I see about 2 second backup.
awesome, I may have piled up a few of those, because I had a lot of delay
Hmm... actually it seems some of your Node actions are just slow? I see some latencies between 2 and 8 seconds? If you see a ton of delay, can the slowness be from open ai responses?
I'm sending a transcription of what the kid says to convex
then I don't display it until you send it back to me
that is taking 20+ seconds
there are a lot of concurrent inserts and updates
for things on the "conversation list"
from openai updating every word
but it delaying the kid's transcript round tripping back to me
I don't see delay in the metric I expected. So there must be some other buffer happening.