gRPC-JS client hangs when target server restarts #2785

sheinbergon · 2024-07-04T12:43:30Z

Problem description

gRPC js client freezes on server disconnections, require full client restarts

Reproduction steps

We have a server (server API is gRPC, using malijs), let's call it A which calls a gRPC target proxied by an NLB and nginx gRPC ingress, let's call it B. If we go ahead and restart the target instances of B, The client might sometime hang, resulting in the application A failing to function. All flows calling to A that will result in A calling B will hang

The moment we restart A, it becomes functioning, meaning it's a client state issue.

Environment

Debian 11/12 based containers
Node version: 18.20.x and 22.2.x
NVM based installation
gRPC-JS 1.19.15
Client options: { 'grpc.keepalive_time_ms': 10_000, 'grpc.max_receive_message_length': 64 * 1024 * 1024 }

Additional context

We've activated gRPC tracing on server A

here's what a normal server selection trace looks like:

D 2024-07-04T09:37:26.487Z | v1.9.15 29 | subchannel | (157) 10.41.10.107:443 READY -> IDLE
D 2024-07-04T09:37:26.487Z | v1.9.15 29 | subchannel_refcount | (157) 10.41.10.107:443 refcount 2 -> 1
D 2024-07-04T09:37:26.487Z | v1.9.15 29 | pick_first | READY -> IDLE
D 2024-07-04T09:37:26.487Z | v1.9.15 29 | resolving_load_balancer | dns:service-b.internal.com:443 READY -> IDLE
D 2024-07-04T09:37:26.487Z | v1.9.15 29 | connectivity_state | (2) dns:service-b.internal.com:443 READY -> IDLE
D 2024-07-04T09:37:26.487Z | v1.9.15 29 | dns_resolver | Looking up DNS hostname service-b.internal.com
D 2024-07-04T09:37:26.487Z | v1.9.15 29 | resolving_load_balancer | dns:service-b.internal.com:443 IDLE -> CONNECTING
D 2024-07-04T09:37:26.487Z | v1.9.15 29 | connectivity_state | (2) dns:service-b.internal.com:443 IDLE -> CONNECTING
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | dns_resolver | Resolved addresses for target dns:service-b.internal.com:443: [10.41.33.32:443,10.41.22.168:443,10.41.10.107:443]
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | subchannel | (202) 10.41.33.32:443 Subchannel constructed with options {}
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | subchannel_refcount | (202) 10.41.33.32:443 refcount 0 -> 1
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | subchannel | (203) 10.41.22.168:443 Subchannel constructed with options {}
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | subchannel_refcount | (203) 10.41.22.168:443 refcount 0 -> 1
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | subchannel_refcount | (202) 10.41.33.32:443 refcount 1 -> 2
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | subchannel_refcount | (203) 10.41.22.168:443 refcount 1 -> 2
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | subchannel_refcount | (157) 10.41.10.107:443 refcount 1 -> 2
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | pick_first | Start connecting to subchannel with address 10.41.33.32:443
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | pick_first | IDLE -> CONNECTING
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | resolving_load_balancer | dns:service-b.internal.com:443 CONNECTING -> CONNECTING
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | connectivity_state | (2) dns:service-b.internal.com:443 CONNECTING -> CONNECTING
D 2024-07-04T09:37:26.492Z | v1.9.15 29 | subchannel | (202) 10.41.33.32:443 IDLE -> CONNECTING
D 2024-07-04T09:37:26.499Z | v1.9.15 29 | subchannel | (202) 10.41.33.32:443 CONNECTING -> READY
D 2024-07-04T09:37:26.499Z | v1.9.15 29 | pick_first | Pick subchannel with address 10.41.33.32:443
D 2024-07-04T09:37:26.499Z | v1.9.15 29 | subchannel_refcount | (202) 10.41.33.32:443 refcount 2 -> 3
D 2024-07-04T09:37:26.499Z | v1.9.15 29 | subchannel_refcount | (202) 10.41.33.32:443 refcount 3 -> 2
D 2024-07-04T09:37:26.499Z | v1.9.15 29 | subchannel_refcount | (203) 10.41.22.168:443 refcount 2 -> 1
D 2024-07-04T09:37:26.499Z | v1.9.15 29 | subchannel_refcount | (157) 10.41.10.107:443 refcount 2 -> 1
D 2024-07-04T09:37:26.499Z | v1.9.15 29 | pick_first | CONNECTING -> READY
D 2024-07-04T09:37:26.499Z | v1.9.15 29 | resolving_load_balancer | dns:service-b.internal.com:443 CONNECTING -> READY
D 2024-07-04T09:37:26.499Z | v1.9.15 29 | connectivity_state | (2) dns:service-b.internal.com:443 CONNECTING -> READY
D 2024-07-04T09:37:30.637Z | v1.9.15 29 | subchannel_refcount | (157) 10.41.10.107:443 refcount 1 -> 0

As expected, connections enter into TRANSIENT_FAILURE state once I kill the target pods B. When they become responsive again, We get these messages:

D 2024-07-04T09:47:31.084Z | v1.9.15 29 | subchannel | (299) 10.41.22.168:443 CONNECTING -> TRANSIENT_FAILURE
D 2024-07-04T09:47:31.084Z | v1.9.15 29 | subchannel | (299) 10.41.22.168:443 TRANSIENT_FAILURE -> IDLE
D 2024-07-04T09:47:31.084Z | v1.9.15 29 | subchannel | (299) 10.41.22.168:443 IDLE -> CONNECTING
D 2024-07-04T09:47:31.084Z | v1.9.15 29 | subchannel | (300) 10.41.33.32:443 CONNECTING -> TRANSIENT_FAILURE
D 2024-07-04T09:47:31.084Z | v1.9.15 29 | subchannel | (300) 10.41.33.32:443 TRANSIENT_FAILURE -> IDLE
D 2024-07-04T09:47:31.084Z | v1.9.15 29 | subchannel | (300) 10.41.33.32:443 IDLE -> CONNECTING
D 2024-07-04T09:47:31.093Z | v1.9.15 29 | subchannel | (300) 10.41.33.32:443 CONNECTING -> READY
D 2024-07-04T09:47:31.094Z | v1.9.15 29 | pick_first | Pick subchannel with address 10.41.33.32:443
D 2024-07-04T09:47:31.094Z | v1.9.15 29 | subchannel_refcount | (300) 10.41.33.32:443 refcount 2 -> 3
D 2024-07-04T09:47:31.094Z | v1.9.15 29 | subchannel_refcount | (275) 10.41.10.107:443 refcount 2 -> 1
D 2024-07-04T09:47:31.094Z | v1.9.15 29 | subchannel_refcount | (300) 10.41.33.32:443 refcount 3 -> 2
D 2024-07-04T09:47:31.094Z | v1.9.15 29 | subchannel_refcount | (299) 10.41.22.168:443 refcount 2 -> 1
D 2024-07-04T09:47:31.094Z | v1.9.15 29 | pick_first | TRANSIENT_FAILURE -> READY
D 2024-07-04T09:47:31.094Z | v1.9.15 29 | resolving_load_balancer | dns:service-b.internal.com:443 TRANSIENT_FAILURE -> READY
D 2024-07-04T09:47:31.094Z | v1.9.15 29 | connectivity_state | (2) dns:service-b.internal.com:443 TRANSIENT_FAILURE -> READY
D 2024-07-04T09:47:31.109Z | v1.9.15 29 | subchannel | (299) 10.41.22.168:443 CONNECTING -> READY

And from that moment on, we stop seeing these messages altogether. As though the driver becomes stuck.

We still see these traces

D 2024-07-04T09:43:57.000Z | v1.9.15 29 | channel | (2) dns:service-b.internal.com:443 createResolvingCall [272] method="...", deadline=Infinity
D 2024-07-04T09:43:57.001Z | v1.9.15 29 | channel | (2) dns:service-b.internal.com:443 createRetryingCall [273] method="..."
D 2024-07-04T09:43:57.001Z | v1.9.15 29 | channel | (2) dns:service-b.internal.com:443 createLoadBalancingCall [274] method="..."

but no request is actuall being sent, the client is completely frozen. Any help will be much obliged

The text was updated successfully, but these errors were encountered:

murgatroid99 · 2024-07-08T18:20:09Z

It looks like you enabled only connection management trace logs, but we will also need trace logs for individual requests to determine what is happening with requests that is causing them not to be sent. It's probably simplest to just use GRPC_TRACE=all to get all the logs we might need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC-JS client hangs when target server restarts #2785

gRPC-JS client hangs when target server restarts #2785

sheinbergon commented Jul 4, 2024 •

edited

Loading

murgatroid99 commented Jul 8, 2024

gRPC-JS client hangs when target server restarts #2785

gRPC-JS client hangs when target server restarts #2785

Comments

sheinbergon commented Jul 4, 2024 • edited Loading

Problem description

Reproduction steps

Environment

Additional context

murgatroid99 commented Jul 8, 2024

sheinbergon commented Jul 4, 2024 •

edited

Loading