Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent 502s when running SpiceDB behind AWS ALB #2059

Open
alexshanabrook opened this issue Sep 6, 2024 · 1 comment
Open

Intermittent 502s when running SpiceDB behind AWS ALB #2059

alexshanabrook opened this issue Sep 6, 2024 · 1 comment
Labels
kind/bug Something is broken or regressed

Comments

@alexshanabrook
Copy link

What platforms are affected?

linux

What architectures are affected?

amd64

What SpiceDB version are you using?

v1.34.0-amd64

Steps to Reproduce

After making several basic check permissions requests in a loop through an AWS ALB to SpiceDB, after some amount of time (usually within ~20 minutes), we see the following error come up:

Err terminated with errors error="rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway)"

This error resolves on future requests without any changes on our side, but in the meantime a request failed—and the error continues to come up at seemingly arbitrary points if we let the loop continue. The SpiceDB target group protocol is set to gRPC.

Expected Result

We expect not to see these transient errors running SpiceDB behind an ALB.

Looking into the access logs for our ALB, we see the 502. In the log the request_processing_time and target_processing_time are set, meaning the request reached SpiceDB. We're not sure what it means that the target_processing_time is set indicating the load balancer may have received headers from SpiceDB, but the response_processing_time is -1 meaning the load balancer didn't receive a response from the target. AWS suggests that this might happen when either the target closed the connection while the load balancer had an outstanding request, or the target response is malformed or contains invalid HTTP headers.

We tried setting the --grpc-max-conn-age flag to a large number to check the issue isn't that the keep-alive for SpiceDB is shorter than the timeout on the load balancer, but still saw the same errors.

Actual Result

Error:

Screenshot 2024-09-05 at 1 47 54 PM
@alexshanabrook alexshanabrook added the kind/bug Something is broken or regressed label Sep 6, 2024
@tstirrat15
Copy link
Contributor

This is a known issue - gRPC is not designed to be run through a load balancer, even one that is supposed to be tailor-made for gRPC, and we've had many users (including myself at my old company) run into this.

gRPC clients are thick clients that expect to have a full list of the nodes that they can talk to, and then they establish a long-lived persistent connection to each of those nodes and round-robin outgoing requests across those nodes. Load balancers contravene some of those expectations.

We ran into the same issue as you at my old company. The thing that ended up solving the 502 issue was getting off of an ALB and using CloudMap to give the clients a full list of available ECS nodes. This came with its own problems, namely that any time the SpiceDB nodes rolled, it took a not-insignificant amount of time for the new nodes to be picked up by DNS, and in the meantime it would appear that SpiceDB was unavailable.

The permanent fix was moving to EKS and getting SpiceDB and its clients in the same cluster. This had the added benefit of allowing SpiceDB's horizontal dispatch mechanism to work, which significantly increased cache hit rate. We know it isn't an option for every organization, but it is our recommendation.

There's more discussion in a discord thread here: https://discord.com/channels/844600078504951838/1240169726143893556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken or regressed
Projects
None yet
Development

No branches or pull requests

2 participants