Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Mitigate ingester race between Query & GetChunkIDs #15178

Merged
merged 4 commits into from
Nov 28, 2024

Conversation

benclive
Copy link
Contributor

What this PR does / why we need it:

Fixes a race condition between the Query and GetChunkIDs calls when using partition-ingesters.

The bug occurs when first Querying logs from an ingester that just flushed a chunk (returning 0 logs), followed by calling GetChunkIDs on an ingester that has not flushed a chunk (returning 0 chunk refs). The bug manifests by returning 0 results when some log lines do exist in other ingesters.

  • This solution is to query all the available ingesters for GetChunkIDs in order to make sure we receive any recent chunks that have just been flushed.
  • There is still a small window where a shutting down ingester would flush all it's chunk and then be unavailable during restart to serve the GetChunkIDs request. This is the same with the classic ingesters and would need some more work to fully mitigate, so I propose this as an interim solution while a more complete solution can be discussed & implemented.

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@benclive benclive requested a review from a team as a code owner November 28, 2024 12:41
@benclive benclive changed the title Fix ingester race for chunkids fix: Mitigate ingester race between Query & GetChunkIDs Nov 28, 2024
@@ -35,6 +35,10 @@ import (
util_log "github.com/grafana/loki/v3/pkg/util/log"
)

var defaultQuorumConfig = ring.DoUntilQuorumConfig{
// Nothing here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think redundant comment. If we do want a comment I would write something like this (finishing the sentence):

Suggested change
// Nothing here
// We use the usable, default value of `DoUntilQuorumConfig` as ...
var defaultQuorumConfig = ring.DoUntilQuorumConfig{}

@@ -79,7 +83,8 @@ func newIngesterQuerier(querierConfig Config, clientCfg client.Config, ring ring
}

// forAllIngesters runs f, in parallel, for all ingesters
func (q *IngesterQuerier) forAllIngesters(ctx context.Context, f func(context.Context, logproto.QuerierClient) (interface{}, error)) ([]responseFromIngesters, error) {
// waitForAllResponses param can be used to require results from all ingesters in the replication set. If this is set to false, the call will return as soon as we have a quorum by zone. Only valid for partition-ingesters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can improve this comment, let's discuss offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we will follow up with some better comments in another PR.

}

// forGivenIngesterSets runs f, in parallel, for given ingester sets
func (q *IngesterQuerier) forGivenIngesterSets(ctx context.Context, replicationSet []ring.ReplicationSet, f func(context.Context, logproto.QuerierClient) (interface{}, error)) ([]responseFromIngesters, error) {
// Enable minimize requests so we initially query a single ingester per replication set, as each replication-set is one partition.
// waitForAllResponses param can be used to require results from all ingesters in all replication sets. If this is set to false, the call will return as soon as we have a quorum by zone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Let's see if we can improve this comment. If I was unfamiliar with how this works, I would be surprised because results from all ingesters in all replication sets contradicts quorum by zone which by definition is a majority, not all.

@benclive benclive merged commit bd46e4c into main Nov 28, 2024
60 checks passed
@benclive benclive deleted the fix-ingester-race-for-chunkids branch November 28, 2024 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants