Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC server rarely returns UNAVAILABLE on seemingly successful requests #7602

Closed
Sovietaced opened this issue Sep 9, 2024 · 15 comments
Closed
Assignees
Labels

Comments

@Sovietaced
Copy link

Sovietaced commented Sep 9, 2024

What version of gRPC are you using?

1.64.1

What version of Go are you using (go version)?

1.22.1

What operating system (Linux, Windows, …) and version?

Linux

What is the issue?

In our production environment our telemetry will indicate that one of our gRPC servers returns code UNAVAILABLE about once per day. What is particularly interesting is that all evidence indicates that the request completes processing through the middleware and the application layer but somehow fails in what seems to be the gRPC internals and is out of our control.

I say this because we have middleware which will log if any error is returned from the application layer and we see no error logs. Interestingly the metrics/tracing is done via open telemetry and that is registered as stats handler which does see the error. We front the gRPC server with nginx ingress and it reports an HTTP 499 which appears to be reported with a client gives up on a request?

Interesting the trace does seem to indicate that the client gives up on the request since the span is 14ms from the client side but 340ms from the server side.

Screenshot 2024-09-09 at 12 56 04 PM

I will follow up with verbose logging output if I find anything relevant.

@dfawley
Copy link
Member

dfawley commented Sep 10, 2024

We probably need more information to be able to help, here.

In our production environment our telemetry will indicate that one of our gRPC servers returns code UNAVAILABLE about once per day. What is particularly interesting is that all evidence indicates that the request completes processing through the middleware and the application layer but somehow fails in what seems to be the gRPC internals and is out of our control.

What is your telemetry? You say it's otel -- is that google.golang.org/grpc/stats/opentelemetry or something else? On what side (client/server) is it reporting UNAVAILABLE? What about your middleware?

Interesting the trace does seem to indicate that the client gives up on the request since the span is 14ms from the client side but 340ms from the server side.

Do you know what status the server is reporting for the RPC in this case? Is it a success? We'd really like to see some client-side debug logs here if you are able to get it. Possibly the client starts the RPC, then the connection is lost, but the server doesn't notice that until after it processes the RPC?

@Sovietaced
Copy link
Author

Sovietaced commented Sep 10, 2024

What is your telemetry? You say it's otel -- is that google.golang.org/grpc/stats/opentelemetry or something else?

This is the stats handler implementation we are using: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go

Do you know what status the server is reporting for the RPC in this case?

The server reports UNAVAILABLE, but the server side middleware does not see any errors. So yeah, it is possible that by the time the server tries to write the response the connection is gone. It's just a bit annoying for us since we alert on a few status codes (INTERNAL, UNAVAILABLE) and this alert fires off once a day.

I just confirmed that the client side is getting CANCELLED due to the context getting timed out/cancelled. It seems strange to me that this results in UNAVAILABLE on the server side though.

We are running with full logging enabled on the server side, we can do the same on the client side if the logs are not helpful the next time this issue occurs.

@dfawley dfawley assigned arjan-bal and unassigned Sovietaced Sep 10, 2024
@purnesh42H purnesh42H added the Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. label Sep 12, 2024
@arjan-bal arjan-bal added Area: Server Includes Server, Streams and Server Options. and removed Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. labels Sep 12, 2024
@arjan-bal
Copy link
Contributor

arjan-bal commented Sep 12, 2024

@Sovietaced, based on the fact that the nginx ingress reports an HTTP 499, can you confirm if the client closing the transport (and not just cancelling the request context) before the server sends back a response?

@dfawley
Copy link
Member

dfawley commented Sep 12, 2024

This is the stats handler implementation we are using: open-telemetry/opentelemetry-go-contrib@main/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go

We provide one as I described earlier, which we recommend using instead. I hope you understand that we cannot support this one contributed to the otel repo.

It seems strange to me that this results in UNAVAILABLE on the server side though.

Server-side status codes mostly only make sense if the server actually responds to the RPC. If our library is converting client cancelation into UNAVAILABLE errors, we'd need some help determining where that's happening (ideally in the form of some minimal reproduction code).

There's also always potential races where the server responds with a status code but the client cancels the RPC before it receives that response.

@vtolstov
Copy link

i have the same error, but in my case frontend is the grpc-java that calls some endpoint.
my repro case:

  1. send first request to endpoint with invalid data, so my backend sends grpc error like Unauthorized
  2. immediately send the second request to the same endpoint with valid data - my backend responds with 0 code, but nginx ingress in logs says that it returns 499 status code and grpc-java receives grpc error.

@Sovietaced
Copy link
Author

We provide one as I described earlier, which we recommend using instead. I hope you understand that we cannot support this one contributed to the otel repo.

I understand, however if you look at the stats_handler in the OTEL repo it seems that it is merely observing status codes as passed to it so I don't think the stats_handler is the issue.

There's also always potential races where the server responds with a status code but the client cancels the RPC before it receives that response.

Yeah, given that this seems to happen only once a day I assume this is some weird race. Coincidentally, now that I have enabled the additional logging environment variables it hasn't been reproduced yet :)

@Sovietaced
Copy link
Author

Sovietaced commented Sep 12, 2024

Alright. The issue just happened twice and i have server side logs.


2024-09-12 15:16:06.664 | 2024/09/12 22:16:06 WARNING: [core] [Server #6]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" |  
-- | -- | --
  |   | 2024-09-12 15:16:06.630 | 2024/09/12 22:16:06 INFO: [transport] [server-transport 0xc0002f0c00] loopyWriter exiting with error: transport closed by client |  
  |   | 2024-09-12 15:16:06.630 | 2024/09/12 22:16:06 INFO: [transport] [server-transport 0xc0002f0c00] Closing: EOF

It seems that this log is unique to the issue from what I see.

[Server #6]grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"

The client in this case is a server making a call to an auth service (prior to handling the incoming RPC request). Our error logging middleware logs the following.

{"time":"2024-09-12T22:16:06.629011334Z","level":"ERROR","source":{"<redacted>","file":"<redacted>","line":38},"msg":"encountered unexpected error: failed to call auth service: rpc error: code = Canceled desc = context canceled","error_message":"failed to call auth service: rpc error: code = Canceled desc = context canceled","method":"<redacted>"}
--

The client stats_handler reports status code 1 (UNKNOWN) which is strange. I'll have to look more into that behavior.

@dfawley
Copy link
Member

dfawley commented Sep 13, 2024

From your logs it sounds like the server attempted to write a status of OK, but because the connection was lost before it could be written, the client sees UNAVAILABLE instead. If that's what is happening, then that's to be expected, and there's not really anything else that we could do in these situations.

@Sovietaced
Copy link
Author

From your logs it sounds like the server attempted to write a status of OK, but because the connection was lost before it could be written, the client sees UNAVAILABLE instead. If that's what is happening, then that's to be expected, and there's not really anything else that we could do in these situations.

The server is reporting UNAVAILABLE and the clients are reporting UNKNOWN.

@dfawley
Copy link
Member

dfawley commented Sep 13, 2024

Thanks for the correction. In that case we'd want to see some logs on the client to determine what's going on. It's possible something else happened, like the headers were received but the data was cut off due to the connection loss, and then the codec errored decoding it (just a guess). Since we observed a connection loss, it's not surprising that there are differences between the client and server, but there could always be a bug in the code we're returning.

@Sovietaced
Copy link
Author

Since we observed a connection loss, it's not surprising that there are differences between the client and server, but there could always be a bug in the code we're returning.

Alright. I will update all of our clients with the increase logging to see what we can find.

Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Sep 19, 2024
@Sovietaced
Copy link
Author

Sovietaced commented Sep 20, 2024

The issue was reproduced today but unfortunately it was one of the client services that didn't have more verbose logging. I will say that we have removed nginx-ingress as a codepath so that is no longer a factor and this traffic is now cluster local.

Also during this last reproduction it appears that the server side ran into the issue while the client side was panicking and shutting down. Since the client panics I cannot get any valuable logs.

Screenshot 2024-09-20 at 3 08 19 PM Screenshot 2024-09-20 at 3 05 24 PM

@dfawley
Copy link
Member

dfawley commented Sep 25, 2024

I don't believe there's enough to go on, still.

Copy link

github-actions bot commented Oct 1, 2024

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Oct 1, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants