Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Increased latencies on store-gatway after updating to 2.14.2 #10291

Open
ecktom opened this issue Dec 20, 2024 · 8 comments
Open

Bug: Increased latencies on store-gatway after updating to 2.14.2 #10291

ecktom opened this issue Dec 20, 2024 · 8 comments
Labels
bug Something isn't working component/store-gateway

Comments

@ecktom
Copy link

ecktom commented Dec 20, 2024

What is the bug?

Hi guys, we've recently upgraded Mimir from 2.13.0 > 2.14.2 and noticed an sharp increase of latencies for the the store-gateway on the read path which we cannot make any reasoning of.

Store-gateway average went up from 50ms -> 1s / p99 went up from 1s -> 25s

Image

We're not having set any special configurations for the store-gateway. Did someone experience some similar issues?

How to reproduce it?

What did you think would happen?

What was your environment?

  • Helm Chart 5.5.1
  • Mimir 2.14.2
  • Microservices mode

Any additional context to share?

No response

@ecktom ecktom added the bug Something isn't working label Dec 20, 2024
@narqo
Copy link
Contributor

narqo commented Dec 24, 2024

Could you expand on your environment: which object storage do you run on (s3, gcs, something on-prem)? Could you identify any particular object storage operations, where the latency increased (the "Mimir / Reads" dashboard should have the "Block object storage (stote-gateway)" set of panels)? Does the caching still work as expected? Anything particular that the "Mimir / Reads resources" dashboard shows?

Also, do you happen to have traces that could reveal the source of latency?

@ecktom
Copy link
Author

ecktom commented Jan 6, 2025

Hi @narqo.

Happy new Year and Thanks for coming back to this issue.

The environment we're running is S3. This is what the object storage related metrics of the Mimir Reads Dashboard look like for the day on which we updated around 11:30.
Image

Caching worked fine as far as I can tell, we've turned off the chunks-cache in the meantime due to high data transfer costs with no noticeable impact on the high store-gateway latencies.

Nothing fancy on the Read resources Dashboard. The Ruler consumes more CPU since the update and the Store-Gateway seems to hit GOMEMLIMIT after running for some time, however, this was also the case before the update.

Image

We haven't set up anything to receive traces in Jaeger format within this cluster. Let me see if I can bring this up and provide you with some traces

@56quarters
Copy link
Contributor

Caching worked fine as far as I can tell, we've turned off the chunks-cache in the meantime due to high data transfer costs with no noticeable impact on the high store-gateway latencies.

Turning off the chunks-cache should have had a massive impact to the point of making querying nearly unusable which makes me think that it has somehow not been working since the upgrade. Can you share the portion of the "Mimir / Reads" dashboard that includes chunks and index caching info? It should include RPS, latency, and hit ratios.

@ecktom
Copy link
Author

ecktom commented Jan 6, 2025

Sure thing, here we go, again for 9th of December on which we did the Upgrade around 11:30
Image

On 18th of December around 1pm we've turned off the chunks cache, find attached some pictures from the Mimir / Reads Dashboard
Image
Image

Turning off the chunks-cache did shows some effect on p50 of the store-gateway though
Image

@ecktom
Copy link
Author

ecktom commented Jan 6, 2025

What also looks really weird is Latency of op: GetRange within Blocks object store (store-gateway accesses). It shows years?
Image

If I set some max. value on the panel it shows some clearer numbers and you can also recognize the two events (updating and turning of the chunks-cache). The values were within the range of 50-150ms before the update.
Image
Image

@56quarters
Copy link
Contributor

What also looks really weird is Latency of op: GetRange within Blocks object store (store-gateway accesses). It shows years?

This was a bug in the object storage client. Annoying but ultimately not causing a problem.

More concerning to me is that it doesn't seem like the index-cache is being used at all from your screenshots? That cache also has a large impact on query latency. Can you confirm that you've setup the index-cache and it's being used?

@ecktom
Copy link
Author

ecktom commented Jan 7, 2025

Can you confirm that you've setup the index-cache and it's being used?

We did not yet set up the index-cache. We're currently having metadata-cache and results-cache in place. However, this has also been the case before updating to 2.14

@ecktom
Copy link
Author

ecktom commented Jan 9, 2025

We've turned on the index-cache without seeing much of an effect.

I somehow failed getting some Mimir traces into our Tempo instance using the following at the store-gateway

- name: JAEGER_AGENT_HOST
  value: alloy-tracing.monitoring.svc.cluster.local
- name: JAEGER_AGENT_PORT
  value: "6831"
- name: JAEGER_SAMPLER_TYPE
  value: const
- name: JAEGER_SAMPLER_PARAM
  value: "1"

and using alloy's receiver for jaeger

otelcol.receiver.jaeger "jaeger" {
  protocols {
    thrift_compact {}
  }

  output {
    metrics = []
    logs    = []
    traces  = [otelcol.processor.discovery.default.input]
  }
}

Something obvious I'm missing here? Live-Debugging for alloy looked like nothing was ever received.

While having debug logs turned on for the store-gateway I noticed some logs using a filter of duration > 10s:

ts=2025-01-09T16:02:37.444557027Z caller=spanlogger.go:122 level=debug user=gridx caller=bucket.go:693 msg="sent streaming chunks" num_series=10932 num_chunks=257016 duration=20.204561294s
ts=2025-01-09T16:02:36.236715039Z caller=spanlogger.go:122 level=debug user=gridx caller=bucket.go:693 msg="sent streaming chunks" num_series=29724 num_chunks=29724 duration=39.350662269s
ts=2025-01-09T16:02:34.268375165Z caller=spanlogger.go:122 level=debug user=gridx caller=bucket.go:693 msg="sent streaming chunks" num_series=31488 num_chunks=783036 duration=14.197317936s

and sometimes also

ts=2025-01-09T16:02:41.50077997Z caller=grpc_logging.go:97 level=warn method=/gatewaypb.StoreGateway/Series duration=46.470386739s msg=gRPC err="rpc error: code = Unknown desc = rpc error: code = Unknown desc = send streaming chunks response: rpc error: code = Canceled desc = context canceled"

Does this help to narrow it down?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/store-gateway
Projects
None yet
Development

No branches or pull requests

3 participants