-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gRPC server rarely returns UNAVAILABLE on seemingly successful requests #7602
Comments
We probably need more information to be able to help, here.
What is your telemetry? You say it's otel -- is that
Do you know what status the server is reporting for the RPC in this case? Is it a success? We'd really like to see some client-side debug logs here if you are able to get it. Possibly the client starts the RPC, then the connection is lost, but the server doesn't notice that until after it processes the RPC? |
This is the stats handler implementation we are using: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go
The server reports UNAVAILABLE, but the server side middleware does not see any errors. So yeah, it is possible that by the time the server tries to write the response the connection is gone. It's just a bit annoying for us since we alert on a few status codes (INTERNAL, UNAVAILABLE) and this alert fires off once a day. I just confirmed that the client side is getting We are running with full logging enabled on the server side, we can do the same on the client side if the logs are not helpful the next time this issue occurs. |
@Sovietaced, based on the fact that the nginx ingress reports an HTTP 499, can you confirm if the client closing the transport (and not just cancelling the request context) before the server sends back a response? |
We provide one as I described earlier, which we recommend using instead. I hope you understand that we cannot support this one contributed to the otel repo.
Server-side status codes mostly only make sense if the server actually responds to the RPC. If our library is converting client cancelation into UNAVAILABLE errors, we'd need some help determining where that's happening (ideally in the form of some minimal reproduction code). There's also always potential races where the server responds with a status code but the client cancels the RPC before it receives that response. |
i have the same error, but in my case frontend is the grpc-java that calls some endpoint.
|
I understand, however if you look at the stats_handler in the OTEL repo it seems that it is merely observing status codes as passed to it so I don't think the stats_handler is the issue.
Yeah, given that this seems to happen only once a day I assume this is some weird race. Coincidentally, now that I have enabled the additional logging environment variables it hasn't been reproduced yet :) |
Alright. The issue just happened twice and i have server side logs.
It seems that this log is unique to the issue from what I see.
The client in this case is a server making a call to an auth service (prior to handling the incoming RPC request). Our error logging middleware logs the following.
The client stats_handler reports status code 1 (UNKNOWN) which is strange. I'll have to look more into that behavior. |
From your logs it sounds like the server attempted to write a status of OK, but because the connection was lost before it could be written, the client sees UNAVAILABLE instead. If that's what is happening, then that's to be expected, and there's not really anything else that we could do in these situations. |
The server is reporting |
Thanks for the correction. In that case we'd want to see some logs on the client to determine what's going on. It's possible something else happened, like the headers were received but the data was cut off due to the connection loss, and then the codec errored decoding it (just a guess). Since we observed a connection loss, it's not surprising that there are differences between the client and server, but there could always be a bug in the code we're returning. |
Alright. I will update all of our clients with the increase logging to see what we can find. |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
I don't believe there's enough to go on, still. |
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
What version of gRPC are you using?
1.64.1
What version of Go are you using (
go version
)?1.22.1
What operating system (Linux, Windows, …) and version?
Linux
What is the issue?
In our production environment our telemetry will indicate that one of our gRPC servers returns code UNAVAILABLE about once per day. What is particularly interesting is that all evidence indicates that the request completes processing through the middleware and the application layer but somehow fails in what seems to be the gRPC internals and is out of our control.
I say this because we have middleware which will log if any error is returned from the application layer and we see no error logs. Interestingly the metrics/tracing is done via open telemetry and that is registered as stats handler which does see the error. We front the gRPC server with nginx ingress and it reports an HTTP 499 which appears to be reported with a client gives up on a request?
Interesting the trace does seem to indicate that the client gives up on the request since the span is 14ms from the client side but 340ms from the server side.
I will follow up with verbose logging output if I find anything relevant.
The text was updated successfully, but these errors were encountered: