Bug: Increased latencies on store-gatway after updating to 2.14.2 #10291

ecktom · 2024-12-20T09:25:44Z

What is the bug?

Hi guys, we've recently upgraded Mimir from 2.13.0 > 2.14.2 and noticed an sharp increase of latencies for the the store-gateway on the read path which we cannot make any reasoning of.

Store-gateway average went up from 50ms -> 1s / p99 went up from 1s -> 25s

We're not having set any special configurations for the store-gateway. Did someone experience some similar issues?

How to reproduce it?

What did you think would happen?

What was your environment?

Helm Chart 5.5.1
Mimir 2.14.2
Microservices mode

Any additional context to share?

No response

The text was updated successfully, but these errors were encountered:

narqo · 2024-12-24T13:37:37Z

Could you expand on your environment: which object storage do you run on (s3, gcs, something on-prem)? Could you identify any particular object storage operations, where the latency increased (the "Mimir / Reads" dashboard should have the "Block object storage (stote-gateway)" set of panels)? Does the caching still work as expected? Anything particular that the "Mimir / Reads resources" dashboard shows?

Also, do you happen to have traces that could reveal the source of latency?

ecktom · 2025-01-06T11:56:37Z

Hi @narqo.

Happy new Year and Thanks for coming back to this issue.

The environment we're running is S3. This is what the object storage related metrics of the Mimir Reads Dashboard look like for the day on which we updated around 11:30.

Caching worked fine as far as I can tell, we've turned off the chunks-cache in the meantime due to high data transfer costs with no noticeable impact on the high store-gateway latencies.

Nothing fancy on the Read resources Dashboard. The Ruler consumes more CPU since the update and the Store-Gateway seems to hit GOMEMLIMIT after running for some time, however, this was also the case before the update.

We haven't set up anything to receive traces in Jaeger format within this cluster. Let me see if I can bring this up and provide you with some traces

56quarters · 2025-01-06T14:55:14Z

Caching worked fine as far as I can tell, we've turned off the chunks-cache in the meantime due to high data transfer costs with no noticeable impact on the high store-gateway latencies.

Turning off the chunks-cache should have had a massive impact to the point of making querying nearly unusable which makes me think that it has somehow not been working since the upgrade. Can you share the portion of the "Mimir / Reads" dashboard that includes chunks and index caching info? It should include RPS, latency, and hit ratios.

ecktom · 2025-01-06T16:44:42Z

Sure thing, here we go, again for 9th of December on which we did the Upgrade around 11:30

On 18th of December around 1pm we've turned off the chunks cache, find attached some pictures from the Mimir / Reads Dashboard

Turning off the chunks-cache did shows some effect on p50 of the store-gateway though

ecktom · 2025-01-06T16:54:42Z

What also looks really weird is Latency of op: GetRange within Blocks object store (store-gateway accesses). It shows years?

If I set some max. value on the panel it shows some clearer numbers and you can also recognize the two events (updating and turning of the chunks-cache). The values were within the range of 50-150ms before the update.

56quarters · 2025-01-06T18:19:23Z

What also looks really weird is Latency of op: GetRange within Blocks object store (store-gateway accesses). It shows years?

This was a bug in the object storage client. Annoying but ultimately not causing a problem.

More concerning to me is that it doesn't seem like the index-cache is being used at all from your screenshots? That cache also has a large impact on query latency. Can you confirm that you've setup the index-cache and it's being used?

ecktom · 2025-01-07T09:31:29Z

Can you confirm that you've setup the index-cache and it's being used?

We did not yet set up the index-cache. We're currently having metadata-cache and results-cache in place. However, this has also been the case before updating to 2.14

ecktom · 2025-01-09T16:11:52Z

We've turned on the index-cache without seeing much of an effect.

I somehow failed getting some Mimir traces into our Tempo instance using the following at the store-gateway

- name: JAEGER_AGENT_HOST
  value: alloy-tracing.monitoring.svc.cluster.local
- name: JAEGER_AGENT_PORT
  value: "6831"
- name: JAEGER_SAMPLER_TYPE
  value: const
- name: JAEGER_SAMPLER_PARAM
  value: "1"

and using alloy's receiver for jaeger

otelcol.receiver.jaeger "jaeger" {
  protocols {
    thrift_compact {}
  }

  output {
    metrics = []
    logs    = []
    traces  = [otelcol.processor.discovery.default.input]
  }
}

Something obvious I'm missing here? Live-Debugging for alloy looked like nothing was ever received.

While having debug logs turned on for the store-gateway I noticed some logs using a filter of duration > 10s:

ts=2025-01-09T16:02:37.444557027Z caller=spanlogger.go:122 level=debug user=gridx caller=bucket.go:693 msg="sent streaming chunks" num_series=10932 num_chunks=257016 duration=20.204561294s
ts=2025-01-09T16:02:36.236715039Z caller=spanlogger.go:122 level=debug user=gridx caller=bucket.go:693 msg="sent streaming chunks" num_series=29724 num_chunks=29724 duration=39.350662269s
ts=2025-01-09T16:02:34.268375165Z caller=spanlogger.go:122 level=debug user=gridx caller=bucket.go:693 msg="sent streaming chunks" num_series=31488 num_chunks=783036 duration=14.197317936s

and sometimes also

ts=2025-01-09T16:02:41.50077997Z caller=grpc_logging.go:97 level=warn method=/gatewaypb.StoreGateway/Series duration=46.470386739s msg=gRPC err="rpc error: code = Unknown desc = rpc error: code = Unknown desc = send streaming chunks response: rpc error: code = Canceled desc = context canceled"

Does this help to narrow it down?

ecktom added the bug Something isn't working label Dec 20, 2024

narqo added the component/store-gateway label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Increased latencies on store-gatway after updating to 2.14.2 #10291

Bug: Increased latencies on store-gatway after updating to 2.14.2 #10291

ecktom commented Dec 20, 2024

narqo commented Dec 24, 2024

ecktom commented Jan 6, 2025

56quarters commented Jan 6, 2025

ecktom commented Jan 6, 2025 •

edited

Loading

ecktom commented Jan 6, 2025 •

edited

Loading

56quarters commented Jan 6, 2025

ecktom commented Jan 7, 2025

ecktom commented Jan 9, 2025 •

edited

Loading

Bug: Increased latencies on store-gatway after updating to 2.14.2 #10291

Bug: Increased latencies on store-gatway after updating to 2.14.2 #10291

Comments

ecktom commented Dec 20, 2024

What is the bug?

How to reproduce it?

What did you think would happen?

What was your environment?

Any additional context to share?

narqo commented Dec 24, 2024

ecktom commented Jan 6, 2025

56quarters commented Jan 6, 2025

ecktom commented Jan 6, 2025 • edited Loading

ecktom commented Jan 6, 2025 • edited Loading

56quarters commented Jan 6, 2025

ecktom commented Jan 7, 2025

ecktom commented Jan 9, 2025 • edited Loading

ecktom commented Jan 6, 2025 •

edited

Loading

ecktom commented Jan 6, 2025 •

edited

Loading

ecktom commented Jan 9, 2025 •

edited

Loading