Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry handover error during graceful failover #7028

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yux0
Copy link
Contributor

@yux0 yux0 commented Dec 20, 2024

What changed?

Retry handover error during graceful failover

Why?

Namespace handover error is transient during graceful failover we can retry on this error.

This is the best effort to reduce the returned error. If the retry exhausted, it still return the handover to caller.

The backoff retry handles the context deadline duration and retry.

How did you test it?

Unit tests.

Potential risks

Documentation

Is hotfix candidate?

@yux0 yux0 requested a review from a team as a code owner December 20, 2024 23:57
@yux0 yux0 requested review from yycptt and hai719 December 20, 2024 23:57
err := ni.checkReplicationState(nsEntry, info.FullMethod)
if err != nil {
var nsErr error
nsEntry, nsErr = ni.namespaceRegistry.RefreshNamespaceById(nsEntry.ID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm this will keep refreshing the namespace? I mean all requests will perform a refresh upon retry.

also if the handover error is from history service then the refresh here is unnecessary

return err
}
return ni.checkReplicationState(namespaceEntry, fullMethod)
// Only retry on ErrNamespaceHandover, other errors will be handled by the retry interceptor
err := backoff.ThrottleRetryContext(ctx, op, handoverRetryPolicy, common.IsNamespaceHandoverError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also worry this will cause server api latency alert to fire during the handover, we should probably exclude this backoff from nouserlatency metric.

return err
}
return ni.checkReplicationState(namespaceEntry, fullMethod)
// Only retry on ErrNamespaceHandover, other errors will be handled by the retry interceptor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just let retry interceptor handle the error? and this interceptor just block until ns is no longer in handover state?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants