lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21050

zhijun42 · 2025-12-24T13:47:33Z

Helps investigate the long-lasting issue #13632 (Github action mistakenly closed it due to inactivity).

Problem

Some users reported that grpc_server_handled_total{grpc_code="Unavailable"} was unexpectedly inflating for LeaseKeepAlive requests even when the cluster is healthy.

Fix

The function LeaseServer.LeaseKeepAlive always turns context.Canceled (which is grpc codes.Canceled) into rpctypes.ErrGRPCNoLeader (which is grpc codes.Unavailable), even when it’s the client that initializes the cancellation. As a result, the grpc metrics count incorrectly.

The old comment is wrong: // the only server-side cancellation is noleader for now.

In fact, there’s no server-side cancellation inside the worker function EtcdServer.LeaseRenew path. The only possible scenario when this function returns errors.ErrCanceled is when the client cancels the request, and then the Done signal propagates into this function.

The fix is pretty straightforward. To validate the fix, I added two test cases where I send LeaseKeepAlive request to one follower in the cluster, and while it’s forwarding the request to the leader, I block the leader’s ServeHTTP path via go failpoint.

As the process is blocked, one test case cancels the request, while the other waits until the forwarding request times out. Both cases should receive expected errors.

Signed-off-by: Zhijun <[email protected]>

…ad of absolute count, reduce sleep from 10s to 8s Signed-off-by: Zhijun <[email protected]>

k8s-ci-robot · 2025-12-24T13:47:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhijun42
Once this PR has been reviewed and has the lgtm label, please assign fuweid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-12-24T13:47:43Z

Hi @zhijun42. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

serathius · 2025-12-25T10:48:53Z

/ok-to-test

My recommendation would be to first send PR with only tests to show what is the current behavior. Such tests should be beneficial by themselves. Only after the tests are merged we could send a PR that changes the behavior and with tests already merged it should be very easy to show how your proposal impacts output.

codecov · 2025-12-25T11:20:39Z

Codecov Report

❌ Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.54%. Comparing base (7d7f951) to head (e9da9a7).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
server/etcdserver/v3_server.go	69.23%	4 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
server/etcdserver/api/v3rpc/lease.go	`81.81% <ø> (+4.60%)`	⬆️
server/etcdserver/api/v3rpc/util.go	`67.74% <ø> (ø)`
server/lease/leasehttp/http.go	`64.18% <100.00%> (+2.02%)`	⬆️
server/etcdserver/v3_server.go	`75.47% <69.23%> (-0.09%)`	⬇️

... and 23 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #21050      +/-   ##
==========================================
+ Coverage   68.43%   68.54%   +0.11%     
==========================================
  Files         429      429              
  Lines       35213    35211       -2     
==========================================
+ Hits        24098    24137      +39     
+ Misses       9709     9678      -31     
+ Partials     1406     1396      -10

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d7f951...e9da9a7. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zhijun42 · 2025-12-25T12:41:39Z

first send PR with only tests to show what is the current behavior.

Sounds good! This is great advice and helps make it easier for code review. Will definitely do so going forward.

zhijun42 · 2025-12-25T14:53:17Z

Re CI failures:

etcd-robustness-amd64: There's a flaky test go.etcd.io/etcd/tests/v3/robustness: TestRobustnessExploratory/EtcdTrafficDeleteLeases/ClusterOfSize1/compactBeforeSetFinishedCompact=panic that failed due to:

traffic.go:177: Requiring minimal 100.000000 qps before failpoint injection for test results to be reliable, got 59.120913 qps

This could be due to unstable CI env and machine being slow? I can't reproduce it on my local machine. And this is irrelevant to my LeaseKeepAlive changes.

etcd-coverage-repor: The failed test is again irrelevant to my changes.

=== Failed
=== FAIL: integration TestCacheLaggingWatcher/pipeline_overflow (0.06s)
getWatch failed, will retry after 50ms: cache: stale event batch (rev 32 < latest 42)
    cache_test.go:1123: gotEvents=6, wantEvents<1
getWatch failed, will retry after 50ms: context canceled
    --- FAIL: TestCacheLaggingWatcher/pipeline_overflow (0.06s)

zhijun42 · 2025-12-25T14:54:48Z

/retest

zhijun42 added 4 commits December 23, 2025 20:31

Investigate the leaseKeepAlive grpc Unavailable issue

fe570df

Signed-off-by: Zhijun <[email protected]>

Properly do test this time

e7a6172

Signed-off-by: Zhijun <[email protected]>

Reverse expected and actual

9f61df9

Signed-off-by: Zhijun <[email protected]>

Refactor test cases to have a getMetric helper, use metric diff inste…

e9da9a7

…ad of absolute count, reduce sleep from 10s to 8s Signed-off-by: Zhijun <[email protected]>

k8s-ci-robot added the area/testing label Dec 24, 2025

k8s-ci-robot added the needs-ok-to-test label Dec 24, 2025

k8s-ci-robot added the size/L label Dec 24, 2025

zhijun42 mentioned this pull request Dec 24, 2025

LeaseKeepAlive Unavailable #13632

Closed

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Dec 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21050

lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21050

zhijun42 commented Dec 24, 2025

Uh oh!

k8s-ci-robot commented Dec 24, 2025

Uh oh!

k8s-ci-robot commented Dec 24, 2025

Uh oh!

serathius commented Dec 25, 2025

Uh oh!

codecov bot commented Dec 25, 2025 •

edited

Loading

Uh oh!

zhijun42 commented Dec 25, 2025

Uh oh!

zhijun42 commented Dec 25, 2025

Uh oh!

zhijun42 commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21050

Are you sure you want to change the base?

lease: Fix incorrect gRPC Unavailable on client cancel during LeaseKeepAlive forwarding #21050

Conversation

zhijun42 commented Dec 24, 2025

Problem

Fix

Uh oh!

k8s-ci-robot commented Dec 24, 2025

Uh oh!

k8s-ci-robot commented Dec 24, 2025

Uh oh!

serathius commented Dec 25, 2025

Uh oh!

codecov bot commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zhijun42 commented Dec 25, 2025

Uh oh!

zhijun42 commented Dec 25, 2025

Uh oh!

zhijun42 commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

codecov bot commented Dec 25, 2025 •

edited

Loading