Skip to content

Conversation

silentred
Copy link
Contributor

@silentred silentred commented Aug 15, 2025

The PR is for the proposal #20396
Based on the previous discussion, we agreed that LeaseRevoke should have higher priority to be applied, and we remain cautious about Compact. Therefore, I've prepared a table comparing the potential impacts of applying and not-applying Compact.

Request Effect of Apply Effect of No Apply Is Critical Typical Scenario
LeaseRevoke Cleaning a few keys, which is supposed to happen Keys cannot stop growing, maybe finally trigger a crash of server. It takes very long time to reboot because of large db size. Upstream watcher may be oom-killed due to writing too many resources to cache. YES K8S apiserver recording events
Compact Clean obsolete KVs in index and db, but make the applying slower treeIndex and db size growing, may reach disk quota. Requires SREs to recover the service. TBD default periodical compaction

I would like to have some advices that should I add a unit test or an e2e test? I am not quite sure how this feature should be tested.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: silentred
Once this PR has been reviewed and has the lgtm label, please assign fuweid for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @silentred. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@silentred
Copy link
Contributor Author

@ahrtr would you mind to take a look?

@silentred silentred force-pushed the stability-enhancement branch from 099461d to 3a13982 Compare August 22, 2025 13:08
@ahrtr
Copy link
Member

ahrtr commented Aug 22, 2025

Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement?

@silentred
Copy link
Contributor Author

silentred commented Aug 23, 2025

Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement?

It happened several times in production. One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet.
I've made a simulation of slow applying with follwing changes. dcc092c

# keep sending PUT request with lease
bin/benchmark put --endpoints=http://11.166.81.153:3379 --clients=200 --conns=50 --rate=2000 --total=10000000 --key-space-size=10000000 --lease-reuse

This results in leases living longer than it should be.
image

If we patch this PR, then everything works fine.

@ahrtr
Copy link
Member

ahrtr commented Aug 23, 2025

thx for the feedback, please resolve the review comments

@serathius
Copy link
Member

One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet.

Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M. Have you verified that this change really would fix your production issue? It might move the needle, but we are talking here about 25x improvement.

@silentred
Copy link
Contributor Author

Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M.

etcdserver was not working properly in that situation. The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

Have you verified that this change really would fix your production issue?

This patch has not been deployed to production, it is not a common issue, maybe once half a year. But I think my simulation test above could verify this patch works.

@silentred silentred force-pushed the stability-enhancement branch from 3a13982 to 4a79972 Compare August 24, 2025 07:32
@serathius
Copy link
Member

The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s

But I think my simulation test above could verify this patch works.

What test?

@silentred
Copy link
Contributor Author

The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.

That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s

But I think my simulation test above could verify this patch works.

What test?

Thanks, I will check it.
Please take a look at this comment #20492 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants