You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After making several basic check permissions requests in a loop through an AWS ALB to SpiceDB, after some amount of time (usually within ~20 minutes), we see the following error come up:
Err terminated with errors error="rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway)"
This error resolves on future requests without any changes on our side, but in the meantime a request failed—and the error continues to come up at seemingly arbitrary points if we let the loop continue. The SpiceDB target group protocol is set to gRPC.
Expected Result
We expect not to see these transient errors running SpiceDB behind an ALB.
Looking into the access logs for our ALB, we see the 502. In the log the request_processing_time and target_processing_time are set, meaning the request reached SpiceDB. We're not sure what it means that the target_processing_time is set indicating the load balancer may have received headers from SpiceDB, but the response_processing_time is -1 meaning the load balancer didn't receive a response from the target. AWS suggests that this might happen when either the target closed the connection while the load balancer had an outstanding request, or the target response is malformed or contains invalid HTTP headers.
We tried setting the --grpc-max-conn-age flag to a large number to check the issue isn't that the keep-alive for SpiceDB is shorter than the timeout on the load balancer, but still saw the same errors.
Actual Result
Error:
The text was updated successfully, but these errors were encountered:
This is a known issue - gRPC is not designed to be run through a load balancer, even one that is supposed to be tailor-made for gRPC, and we've had many users (including myself at my old company) run into this.
gRPC clients are thick clients that expect to have a full list of the nodes that they can talk to, and then they establish a long-lived persistent connection to each of those nodes and round-robin outgoing requests across those nodes. Load balancers contravene some of those expectations.
We ran into the same issue as you at my old company. The thing that ended up solving the 502 issue was getting off of an ALB and using CloudMap to give the clients a full list of available ECS nodes. This came with its own problems, namely that any time the SpiceDB nodes rolled, it took a not-insignificant amount of time for the new nodes to be picked up by DNS, and in the meantime it would appear that SpiceDB was unavailable.
The permanent fix was moving to EKS and getting SpiceDB and its clients in the same cluster. This had the added benefit of allowing SpiceDB's horizontal dispatch mechanism to work, which significantly increased cache hit rate. We know it isn't an option for every organization, but it is our recommendation.
What platforms are affected?
linux
What architectures are affected?
amd64
What SpiceDB version are you using?
v1.34.0-amd64
Steps to Reproduce
After making several basic check permissions requests in a loop through an AWS ALB to SpiceDB, after some amount of time (usually within ~20 minutes), we see the following error come up:
Err terminated with errors error="rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway)"
This error resolves on future requests without any changes on our side, but in the meantime a request failed—and the error continues to come up at seemingly arbitrary points if we let the loop continue. The SpiceDB target group protocol is set to gRPC.
Expected Result
We expect not to see these transient errors running SpiceDB behind an ALB.
Looking into the access logs for our ALB, we see the 502. In the log the
request_processing_time
andtarget_processing_time
are set, meaning the request reached SpiceDB. We're not sure what it means that thetarget_processing_time
is set indicating the load balancer may have received headers from SpiceDB, but theresponse_processing_time
is -1 meaning the load balancer didn't receive a response from the target. AWS suggests that this might happen when either the target closed the connection while the load balancer had an outstanding request, or the target response is malformed or contains invalid HTTP headers.We tried setting the
--grpc-max-conn-age
flag to a large number to check the issue isn't that the keep-alive for SpiceDB is shorter than the timeout on the load balancer, but still saw the same errors.Actual Result
Error:
The text was updated successfully, but these errors were encountered: