Skip to content

Conversation

@zhijun42
Copy link

Helps investigate the long-lasting issue #13632 (Github action mistakenly closed it due to inactivity).

Problem

Some users reported that grpc_server_handled_total{grpc_code="Unavailable"} was unexpectedly inflating for LeaseKeepAlive requests even when the cluster is healthy.

Fix

The function LeaseServer.LeaseKeepAlive always turns context.Canceled (which is grpc codes.Canceled) into rpctypes.ErrGRPCNoLeader (which is grpc codes.Unavailable), even when it’s the client that initializes the cancellation. As a result, the grpc metrics count incorrectly.

The old comment is wrong: // the only server-side cancellation is noleader for now.

In fact, there’s no server-side cancellation inside the worker function EtcdServer.LeaseRenew path. The only possible scenario when this function returns errors.ErrCanceled is when the client cancels the request, and then the Done signal propagates into this function.

The fix is pretty straightforward. To validate the fix, I added two test cases where I send LeaseKeepAlive request to one follower in the cluster, and while it’s forwarding the request to the leader, I block the leader’s ServeHTTP path via go failpoint.

As the process is blocked, one test case cancels the request, while the other waits until the forwarding request times out. Both cases should receive expected errors.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhijun42
Once this PR has been reviewed and has the lgtm label, please assign fuweid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @zhijun42. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Member

/ok-to-test

My recommendation would be to first send PR with only tests to show what is the current behavior. Such tests should be beneficial by themselves. Only after the tests are merged we could send a PR that changes the behavior and with tests already merged it should be very easy to show how your proposal impacts output.

@codecov
Copy link

codecov bot commented Dec 25, 2025

Codecov Report

❌ Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.54%. Comparing base (7d7f951) to head (e9da9a7).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
server/etcdserver/v3_server.go 69.23% 4 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
server/etcdserver/api/v3rpc/lease.go 81.81% <ø> (+4.60%) ⬆️
server/etcdserver/api/v3rpc/util.go 67.74% <ø> (ø)
server/lease/leasehttp/http.go 64.18% <100.00%> (+2.02%) ⬆️
server/etcdserver/v3_server.go 75.47% <69.23%> (-0.09%) ⬇️

... and 23 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #21050      +/-   ##
==========================================
+ Coverage   68.43%   68.54%   +0.11%     
==========================================
  Files         429      429              
  Lines       35213    35211       -2     
==========================================
+ Hits        24098    24137      +39     
+ Misses       9709     9678      -31     
+ Partials     1406     1396      -10     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d7f951...e9da9a7. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhijun42
Copy link
Author

first send PR with only tests to show what is the current behavior.

Sounds good! This is great advice and helps make it easier for code review. Will definitely do so going forward.

@zhijun42
Copy link
Author

Re CI failures:

  1. etcd-robustness-amd64: There's a flaky test go.etcd.io/etcd/tests/v3/robustness: TestRobustnessExploratory/EtcdTrafficDeleteLeases/ClusterOfSize1/compactBeforeSetFinishedCompact=panic that failed due to:
traffic.go:177: Requiring minimal 100.000000 qps before failpoint injection for test results to be reliable, got 59.120913 qps

This could be due to unstable CI env and machine being slow? I can't reproduce it on my local machine. And this is irrelevant to my LeaseKeepAlive changes.

  1. etcd-coverage-repor: The failed test is again irrelevant to my changes.
=== Failed
=== FAIL: integration TestCacheLaggingWatcher/pipeline_overflow (0.06s)
getWatch failed, will retry after 50ms: cache: stale event batch (rev 32 < latest 42)
    cache_test.go:1123: gotEvents=6, wantEvents<1
getWatch failed, will retry after 50ms: context canceled
    --- FAIL: TestCacheLaggingWatcher/pipeline_overflow (0.06s)

@zhijun42
Copy link
Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants