gRPC server rarely returns UNAVAILABLE on seemingly successful requests #7602

Sovietaced · 2024-09-09T19:43:14Z

What version of gRPC are you using?

1.64.1

What version of Go are you using (`go version`)?

1.22.1

What operating system (Linux, Windows, …) and version?

Linux

What is the issue?

In our production environment our telemetry will indicate that one of our gRPC servers returns code UNAVAILABLE about once per day. What is particularly interesting is that all evidence indicates that the request completes processing through the middleware and the application layer but somehow fails in what seems to be the gRPC internals and is out of our control.

I say this because we have middleware which will log if any error is returned from the application layer and we see no error logs. Interestingly the metrics/tracing is done via open telemetry and that is registered as stats handler which does see the error. We front the gRPC server with nginx ingress and it reports an HTTP 499 which appears to be reported with a client gives up on a request?

Interesting the trace does seem to indicate that the client gives up on the request since the span is 14ms from the client side but 340ms from the server side.

I will follow up with verbose logging output if I find anything relevant.

The text was updated successfully, but these errors were encountered:

dfawley · 2024-09-10T00:23:07Z

We probably need more information to be able to help, here.

In our production environment our telemetry will indicate that one of our gRPC servers returns code UNAVAILABLE about once per day. What is particularly interesting is that all evidence indicates that the request completes processing through the middleware and the application layer but somehow fails in what seems to be the gRPC internals and is out of our control.

What is your telemetry? You say it's otel -- is that google.golang.org/grpc/stats/opentelemetry or something else? On what side (client/server) is it reporting UNAVAILABLE? What about your middleware?

Interesting the trace does seem to indicate that the client gives up on the request since the span is 14ms from the client side but 340ms from the server side.

Do you know what status the server is reporting for the RPC in this case? Is it a success? We'd really like to see some client-side debug logs here if you are able to get it. Possibly the client starts the RPC, then the connection is lost, but the server doesn't notice that until after it processes the RPC?

Sovietaced · 2024-09-10T00:34:10Z

What is your telemetry? You say it's otel -- is that google.golang.org/grpc/stats/opentelemetry or something else?

This is the stats handler implementation we are using: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go

Do you know what status the server is reporting for the RPC in this case?

The server reports UNAVAILABLE, but the server side middleware does not see any errors. So yeah, it is possible that by the time the server tries to write the response the connection is gone. It's just a bit annoying for us since we alert on a few status codes (INTERNAL, UNAVAILABLE) and this alert fires off once a day.

I just confirmed that the client side is getting CANCELLED due to the context getting timed out/cancelled. It seems strange to me that this results in UNAVAILABLE on the server side though.

We are running with full logging enabled on the server side, we can do the same on the client side if the logs are not helpful the next time this issue occurs.

arjan-bal · 2024-09-12T07:50:35Z

@Sovietaced, based on the fact that the nginx ingress reports an HTTP 499, can you confirm if the client closing the transport (and not just cancelling the request context) before the server sends back a response?

dfawley · 2024-09-12T15:21:00Z

This is the stats handler implementation we are using: open-telemetry/opentelemetry-go-contrib@main/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go

We provide one as I described earlier, which we recommend using instead. I hope you understand that we cannot support this one contributed to the otel repo.

It seems strange to me that this results in UNAVAILABLE on the server side though.

Server-side status codes mostly only make sense if the server actually responds to the RPC. If our library is converting client cancelation into UNAVAILABLE errors, we'd need some help determining where that's happening (ideally in the form of some minimal reproduction code).

There's also always potential races where the server responds with a status code but the client cancels the RPC before it receives that response.

vtolstov · 2024-09-12T15:43:58Z

i have the same error, but in my case frontend is the grpc-java that calls some endpoint.
my repro case:

send first request to endpoint with invalid data, so my backend sends grpc error like Unauthorized
immediately send the second request to the same endpoint with valid data - my backend responds with 0 code, but nginx ingress in logs says that it returns 499 status code and grpc-java receives grpc error.

Sovietaced · 2024-09-12T17:15:12Z

We provide one as I described earlier, which we recommend using instead. I hope you understand that we cannot support this one contributed to the otel repo.

I understand, however if you look at the stats_handler in the OTEL repo it seems that it is merely observing status codes as passed to it so I don't think the stats_handler is the issue.

There's also always potential races where the server responds with a status code but the client cancels the RPC before it receives that response.

Yeah, given that this seems to happen only once a day I assume this is some weird race. Coincidentally, now that I have enabled the additional logging environment variables it hasn't been reproduced yet :)

Sovietaced · 2024-09-12T23:21:57Z

Alright. The issue just happened twice and i have server side logs.


2024-09-12 15:16:06.664 | 2024/09/12 22:16:06 WARNING: [core] [Server #6]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" |  
-- | -- | --
  |   | 2024-09-12 15:16:06.630 | 2024/09/12 22:16:06 INFO: [transport] [server-transport 0xc0002f0c00] loopyWriter exiting with error: transport closed by client |  
  |   | 2024-09-12 15:16:06.630 | 2024/09/12 22:16:06 INFO: [transport] [server-transport 0xc0002f0c00] Closing: EOF

It seems that this log is unique to the issue from what I see.

[Server #6]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"

The client in this case is a server making a call to an auth service (prior to handling the incoming RPC request). Our error logging middleware logs the following.

{"time":"2024-09-12T22:16:06.629011334Z","level":"ERROR","source":{"<redacted>","file":"<redacted>","line":38},"msg":"encountered unexpected error: failed to call auth service: rpc error: code = Canceled desc = context canceled","error_message":"failed to call auth service: rpc error: code = Canceled desc = context canceled","method":"<redacted>"}
--

The client stats_handler reports status code 1 (UNKNOWN) which is strange. I'll have to look more into that behavior.

dfawley · 2024-09-13T15:41:42Z

From your logs it sounds like the server attempted to write a status of OK, but because the connection was lost before it could be written, the client sees UNAVAILABLE instead. If that's what is happening, then that's to be expected, and there's not really anything else that we could do in these situations.

Sovietaced · 2024-09-13T16:02:34Z

From your logs it sounds like the server attempted to write a status of OK, but because the connection was lost before it could be written, the client sees UNAVAILABLE instead. If that's what is happening, then that's to be expected, and there's not really anything else that we could do in these situations.

The server is reporting UNAVAILABLE and the clients are reporting UNKNOWN.

dfawley · 2024-09-13T16:06:49Z

Thanks for the correction. In that case we'd want to see some logs on the client to determine what's going on. It's possible something else happened, like the headers were received but the data was cut off due to the connection loss, and then the codec errored decoding it (just a guess). Since we observed a connection loss, it's not surprising that there are differences between the client and server, but there could always be a bug in the code we're returning.

Sovietaced · 2024-09-13T16:08:32Z

Since we observed a connection loss, it's not surprising that there are differences between the client and server, but there could always be a bug in the code we're returning.

Alright. I will update all of our clients with the increase logging to see what we can find.

github-actions · 2024-09-19T16:47:19Z

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

Sovietaced · 2024-09-20T22:18:59Z

The issue was reproduced today but unfortunately it was one of the client services that didn't have more verbose logging. I will say that we have removed nginx-ingress as a codepath so that is no longer a factor and this traffic is now cluster local.

Also during this last reproduction it appears that the server side ran into the issue while the client side was panicking and shutting down. Since the client panics I cannot get any valuable logs.

dfawley · 2024-09-25T16:22:39Z

I don't believe there's enough to go on, still.

github-actions · 2024-10-01T16:47:58Z

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

Sovietaced added the Type: Bug label Sep 9, 2024

dfawley added the Status: Requires Reporter Clarification label Sep 10, 2024

dfawley assigned Sovietaced Sep 10, 2024

dfawley removed the Status: Requires Reporter Clarification label Sep 10, 2024

dfawley assigned arjan-bal and unassigned Sovietaced Sep 10, 2024

purnesh42H added the Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. label Sep 12, 2024

arjan-bal added Area: Server Includes Server, Streams and Server Options. and removed Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. labels Sep 12, 2024

arjan-bal assigned Sovietaced and unassigned arjan-bal Sep 12, 2024

arjan-bal added the Status: Requires Reporter Clarification label Sep 12, 2024

github-actions bot added the stale label Sep 19, 2024

github-actions bot removed the stale label Sep 20, 2024

purnesh42H removed the Status: Requires Reporter Clarification label Sep 25, 2024

purnesh42H assigned dfawley and unassigned Sovietaced Sep 25, 2024

dfawley assigned Sovietaced and unassigned dfawley Sep 25, 2024

dfawley added the Status: Requires Reporter Clarification label Sep 25, 2024

github-actions bot added the stale label Oct 1, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC server rarely returns UNAVAILABLE on seemingly successful requests #7602

gRPC server rarely returns UNAVAILABLE on seemingly successful requests #7602

Sovietaced commented Sep 9, 2024 •

edited

Loading

dfawley commented Sep 10, 2024

Sovietaced commented Sep 10, 2024 •

edited

Loading

arjan-bal commented Sep 12, 2024 •

edited

Loading

dfawley commented Sep 12, 2024

vtolstov commented Sep 12, 2024

Sovietaced commented Sep 12, 2024

Sovietaced commented Sep 12, 2024 •

edited

Loading

dfawley commented Sep 13, 2024

Sovietaced commented Sep 13, 2024

dfawley commented Sep 13, 2024

Sovietaced commented Sep 13, 2024

github-actions bot commented Sep 19, 2024

Sovietaced commented Sep 20, 2024 •

edited

Loading

dfawley commented Sep 25, 2024

github-actions bot commented Oct 1, 2024

gRPC server rarely returns UNAVAILABLE on seemingly successful requests #7602

gRPC server rarely returns UNAVAILABLE on seemingly successful requests #7602

Comments

Sovietaced commented Sep 9, 2024 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What is the issue?

dfawley commented Sep 10, 2024

Sovietaced commented Sep 10, 2024 • edited Loading

arjan-bal commented Sep 12, 2024 • edited Loading

dfawley commented Sep 12, 2024

vtolstov commented Sep 12, 2024

Sovietaced commented Sep 12, 2024

Sovietaced commented Sep 12, 2024 • edited Loading

dfawley commented Sep 13, 2024

Sovietaced commented Sep 13, 2024

dfawley commented Sep 13, 2024

Sovietaced commented Sep 13, 2024

github-actions bot commented Sep 19, 2024

Sovietaced commented Sep 20, 2024 • edited Loading

dfawley commented Sep 25, 2024

github-actions bot commented Oct 1, 2024

Sovietaced commented Sep 9, 2024 •

edited

Loading

What version of Go are you using (`go version`)?

Sovietaced commented Sep 10, 2024 •

edited

Loading

arjan-bal commented Sep 12, 2024 •

edited

Loading

Sovietaced commented Sep 12, 2024 •

edited

Loading

Sovietaced commented Sep 20, 2024 •

edited

Loading