-
Notifications
You must be signed in to change notification settings - Fork 10.3k
lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Zhijun <[email protected]>
Signed-off-by: Zhijun <[email protected]>
Signed-off-by: Zhijun <[email protected]>
…ad of absolute count, reduce sleep from 10s to 8s Signed-off-by: Zhijun <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zhijun42 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @zhijun42. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test My recommendation would be to first send PR with only tests to show what is the current behavior. Such tests should be beneficial by themselves. Only after the tests are merged we could send a PR that changes the behavior and with tests already merged it should be very easy to show how your proposal impacts output. |
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 23 files with indirect coverage changes @@ Coverage Diff @@
## main #21050 +/- ##
==========================================
+ Coverage 68.43% 68.54% +0.11%
==========================================
Files 429 429
Lines 35213 35211 -2
==========================================
+ Hits 24098 24137 +39
+ Misses 9709 9678 -31
+ Partials 1406 1396 -10 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Sounds good! This is great advice and helps make it easier for code review. Will definitely do so going forward. |
|
Re CI failures:
This could be due to unstable CI env and machine being slow? I can't reproduce it on my local machine. And this is irrelevant to my LeaseKeepAlive changes.
|
|
/retest |
Helps investigate the long-lasting issue #13632 (Github action mistakenly closed it due to inactivity).
Problem
Some users reported that
grpc_server_handled_total{grpc_code="Unavailable"}was unexpectedly inflating for LeaseKeepAlive requests even when the cluster is healthy.Fix
The function
LeaseServer.LeaseKeepAlivealways turnscontext.Canceled(which is grpccodes.Canceled) intorpctypes.ErrGRPCNoLeader(which is grpccodes.Unavailable), even when it’s the client that initializes the cancellation. As a result, the grpc metrics count incorrectly.The old comment is wrong:
// the only server-side cancellation is noleader for now.In fact, there’s no server-side cancellation inside the worker function
EtcdServer.LeaseRenewpath. The only possible scenario when this function returnserrors.ErrCanceledis when the client cancels the request, and then the Done signal propagates into this function.The fix is pretty straightforward. To validate the fix, I added two test cases where I send LeaseKeepAlive request to one follower in the cluster, and while it’s forwarding the request to the leader, I block the leader’s
ServeHTTPpath via go failpoint.As the process is blocked, one test case cancels the request, while the other waits until the forwarding request times out. Both cases should receive expected errors.