stability enhancement during overload conditions #20492

silentred · 2025-08-15T07:39:47Z

The PR is for the proposal #20396
Based on the previous discussion, we agreed that LeaseRevoke should have higher priority to be applied, and we remain cautious about Compact. Therefore, I've prepared a table comparing the potential impacts of applying and not-applying Compact.

Request	Effect of Apply	Effect of No Apply	Is Critical	Typical Scenario
LeaseRevoke	Cleaning a few keys, which is supposed to happen	Keys cannot stop growing, maybe finally trigger a crash of server. It takes very long time to reboot because of large db size. Upstream watcher may be oom-killed due to writing too many resources to cache.	YES	K8S apiserver recording events
Compact	Clean obsolete KVs in index and db, but make the applying slower	treeIndex and db size growing, may reach disk quota. Requires SREs to recover the service.	TBD	default periodical compaction

I would like to have some advices that should I add a unit test or an e2e test? I am not quite sure how this feature should be tested.

k8s-ci-robot · 2025-08-15T07:39:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: silentred
Once this PR has been reviewed and has the lgtm label, please assign fuweid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-08-15T07:39:57Z

Hi @silentred. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

silentred · 2025-08-22T11:19:56Z

@ahrtr would you mind to take a look?

server/etcdserver/v3_server.go

server/etcdserver/util_test.go

server/etcdserver/util.go

ahrtr · 2025-08-22T14:44:29Z

Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement?

silentred · 2025-08-23T12:02:40Z

Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement?

It happened several times in production. One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet.
I've made a simulation of slow applying with follwing changes. dcc092c

# keep sending PUT request with lease
bin/benchmark put --endpoints=http://11.166.81.153:3379 --clients=200 --conns=50 --rate=2000 --total=10000000 --key-space-size=10000000 --lease-reuse

This results in leases living longer than it should be.

If we patch this PR, then everything works fine.

ahrtr · 2025-08-23T12:04:41Z

thx for the feedback, please resolve the review comments

serathius · 2025-08-23T12:45:24Z

One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet.

Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M. Have you verified that this change really would fix your production issue? It might move the needle, but we are talking here about 25x improvement.

silentred · 2025-08-23T13:30:53Z

Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M.

etcdserver was not working properly in that situation. The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

Have you verified that this change really would fix your production issue?

This patch has not been deployed to production, it is not a common issue, maybe once half a year. But I think my simulation test above could verify this patch works.

Signed-off-by: shenmu.wy <[email protected]>

serathius · 2025-08-24T09:38:31Z

The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s

But I think my simulation test above could verify this patch works.

What test?

silentred · 2025-08-24T09:46:24Z

The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s

But I think my simulation test above could verify this patch works.

What test?

Thanks, I will check it.
Please take a look at this comment #20492 (comment)

k8s-ci-robot added the needs-ok-to-test label Aug 15, 2025

k8s-ci-robot added the size/S label Aug 15, 2025

serathius reviewed Aug 22, 2025

View reviewed changes

server/etcdserver/v3_server.go Outdated Show resolved Hide resolved

silentred force-pushed the stability-enhancement branch from 099461d to 3a13982 Compare August 22, 2025 13:08

k8s-ci-robot added size/M and removed size/S labels Aug 22, 2025

ahrtr reviewed Aug 22, 2025

View reviewed changes

server/etcdserver/util_test.go Outdated Show resolved Hide resolved

server/etcdserver/util_test.go Outdated Show resolved Hide resolved

server/etcdserver/util.go Outdated Show resolved Hide resolved

stability enhancement during overload conditions

4a79972

Signed-off-by: shenmu.wy <[email protected]>

silentred force-pushed the stability-enhancement branch from 3a13982 to 4a79972 Compare August 24, 2025 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stability enhancement during overload conditions #20492

stability enhancement during overload conditions #20492

Uh oh!

silentred commented Aug 15, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Aug 15, 2025

Uh oh!

k8s-ci-robot commented Aug 15, 2025

Uh oh!

silentred commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahrtr commented Aug 22, 2025

Uh oh!

silentred commented Aug 23, 2025 •

edited

Loading

Uh oh!

ahrtr commented Aug 23, 2025

Uh oh!

serathius commented Aug 23, 2025

Uh oh!

silentred commented Aug 23, 2025

Uh oh!

serathius commented Aug 24, 2025

Uh oh!

silentred commented Aug 24, 2025

Uh oh!

Uh oh!

stability enhancement during overload conditions #20492

Are you sure you want to change the base?

stability enhancement during overload conditions #20492

Uh oh!

Conversation

silentred commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 15, 2025

Uh oh!

k8s-ci-robot commented Aug 15, 2025

Uh oh!

silentred commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahrtr commented Aug 22, 2025

Uh oh!

silentred commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahrtr commented Aug 23, 2025

Uh oh!

serathius commented Aug 23, 2025

Uh oh!

silentred commented Aug 23, 2025

Uh oh!

serathius commented Aug 24, 2025

Uh oh!

silentred commented Aug 24, 2025

Uh oh!

Uh oh!

silentred commented Aug 15, 2025 •

edited

Loading

silentred commented Aug 23, 2025 •

edited

Loading