Intermittent 502s when running SpiceDB behind AWS ALB #2059

alexshanabrook · 2024-09-06T21:31:11Z

What platforms are affected?

linux

What architectures are affected?

amd64

What SpiceDB version are you using?

v1.34.0-amd64

Steps to Reproduce

After making several basic check permissions requests in a loop through an AWS ALB to SpiceDB, after some amount of time (usually within ~20 minutes), we see the following error come up:

Err terminated with errors error="rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway)"

This error resolves on future requests without any changes on our side, but in the meantime a request failed—and the error continues to come up at seemingly arbitrary points if we let the loop continue. The SpiceDB target group protocol is set to gRPC.

Expected Result

We expect not to see these transient errors running SpiceDB behind an ALB.

Looking into the access logs for our ALB, we see the 502. In the log the request_processing_time and target_processing_time are set, meaning the request reached SpiceDB. We're not sure what it means that the target_processing_time is set indicating the load balancer may have received headers from SpiceDB, but the response_processing_time is -1 meaning the load balancer didn't receive a response from the target. AWS suggests that this might happen when either the target closed the connection while the load balancer had an outstanding request, or the target response is malformed or contains invalid HTTP headers.

We tried setting the --grpc-max-conn-age flag to a large number to check the issue isn't that the keep-alive for SpiceDB is shorter than the timeout on the load balancer, but still saw the same errors.

Actual Result

Error:

The text was updated successfully, but these errors were encountered:

tstirrat15 · 2024-09-09T16:18:49Z

This is a known issue - gRPC is not designed to be run through a load balancer, even one that is supposed to be tailor-made for gRPC, and we've had many users (including myself at my old company) run into this.

gRPC clients are thick clients that expect to have a full list of the nodes that they can talk to, and then they establish a long-lived persistent connection to each of those nodes and round-robin outgoing requests across those nodes. Load balancers contravene some of those expectations.

We ran into the same issue as you at my old company. The thing that ended up solving the 502 issue was getting off of an ALB and using CloudMap to give the clients a full list of available ECS nodes. This came with its own problems, namely that any time the SpiceDB nodes rolled, it took a not-insignificant amount of time for the new nodes to be picked up by DNS, and in the meantime it would appear that SpiceDB was unavailable.

The permanent fix was moving to EKS and getting SpiceDB and its clients in the same cluster. This had the added benefit of allowing SpiceDB's horizontal dispatch mechanism to work, which significantly increased cache hit rate. We know it isn't an option for every organization, but it is our recommendation.

There's more discussion in a discord thread here: https://discord.com/channels/844600078504951838/1240169726143893556

alexshanabrook added the kind/bug Something is broken or regressed label Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent 502s when running SpiceDB behind AWS ALB #2059

Intermittent 502s when running SpiceDB behind AWS ALB #2059

alexshanabrook commented Sep 6, 2024

tstirrat15 commented Sep 9, 2024

Intermittent 502s when running SpiceDB behind AWS ALB #2059

Intermittent 502s when running SpiceDB behind AWS ALB #2059

Comments

alexshanabrook commented Sep 6, 2024

What platforms are affected?

What architectures are affected?

What SpiceDB version are you using?

Steps to Reproduce

Expected Result

Actual Result

tstirrat15 commented Sep 9, 2024