-
Notifications
You must be signed in to change notification settings - Fork 10.2k
stability enhancement during overload conditions #20492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: silentred The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @silentred. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@ahrtr would you mind to take a look? |
099461d
to
3a13982
Compare
Did this actually happen in production or in dev? Have you tested the patch and seen the expected improvement? |
It happened several times in production. One of cases is that K8S event keys growed to 50Mi, because leases cannot be revoked successfully. This patch has not been deployed to our production yet.
This results in leases living longer than it should be. If we patch this PR, then everything works fine. |
thx for the feedback, please resolve the review comments |
Oh, that's a lot. I have never seen etcd cluster be stable after crossing 2M. Have you verified that this change really would fix your production issue? It might move the needle, but we are talking here about 25x improvement. |
etcdserver was not working properly in that situation. The worst case we've been through is that we have to clear all data and re-setup a new empty cluster for K8S event.
This patch has not been deployed to production, it is not a common issue, maybe once half a year. But I think my simulation test above could verify this patch works. |
Signed-off-by: shenmu.wy <[email protected]>
3a13982
to
4a79972
Compare
That's recommended practice. Her my old talk about this issue https://www.youtube.com/watch?v=aJVMWcVZOPQ&t=4m9s
What test? |
Thanks, I will check it. |
The PR is for the proposal #20396
Based on the previous discussion, we agreed that
LeaseRevoke
should have higher priority to be applied, and we remain cautious aboutCompact
. Therefore, I've prepared a table comparing the potential impacts of applying and not-applyingCompact
.I would like to have some advices that should I add a unit test or an e2e test? I am not quite sure how this feature should be tested.