-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote Cache Warning GRPC #765
Comments
Do you see any errors in bazel-remote's logs when the client shows these timeouts? Do you have iowait monitoring on the bazel-remote machine and on the s3 storage (if it's something you're running locally)? If you see iowait spikes when then maybe your storage bandwidth is saturated. |
@mostynb, yeah, we see a lot of our requests increased when we added
I don't believe so, I can take a look here. In general, is it better to scale vertically or horizontally? Currently we have 3 replicas on ECS |
Are there any more details provided in the logs besides bytestream read failed and the resource/blob name? If so, could you share a few of them here?
The REAPIv2 cache service has strong coherence requirements, and bazel doesn't behave nicley when those assumptions fail. eg bazel builds can fail if they make a request to one cache server, then make a request to another server (with a different set of blobs) during the same build. This makes horizontal scaling risky, unless you arrange things in such a way that a given client only talks to a single cache server during a single build. Evicting items from bazel-remote's proxy backends can also break these assumptions. To avoid this we would need to figure out a way to imlpement some sort of LRU-like eviction for S3 (but I don't have an AWS account to do this work myself). |
Hi 👋 We've been using this remote cache backed by s3 and have been recently seeing timeouts in grpc. (blob based s3 storage) with 3 instances of the service running with 4 vcpu. running on version 2.3.9.
Our CPU and memory usage are only 80 & 30% peaks respectively. I don't have a reliable repro for this, but was wondering if you had any insight as to what could be going wrong here?
The text was updated successfully, but these errors were encountered: