-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: terminate recvLoop on serverWatchStream.close() #18739
Conversation
Hi @veshij. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files
... and 22 files with indirect coverage changes @@ Coverage Diff @@
## main #18739 +/- ##
==========================================
+ Coverage 68.74% 68.75% +0.01%
==========================================
Files 420 420
Lines 35488 35508 +20
==========================================
+ Hits 24395 24415 +20
- Misses 9659 9664 +5
+ Partials 1434 1429 -5 Continue to review full report in Codecov by Sentry.
|
/ok-to-test |
Can you provide a test? |
tbh I'n not quite sure yet how to trigger this particular issue in a test env. |
Let me look into preparing a test. |
I think the e2e test should just mimic how your production run etcd, and verify the result by comparing the metrics pointed out in #18704 (comment) |
@veshij since this issue is a little hard to reproduce & verify in normal e2e or integration test, and the fix is simple & safe, so it's accepted to approve & merge the fix firstly. The solution of adding two metrics sendLoopCount and recvLoopCount as pointed out in #18704 (comment) is doable, but it's a little weird to expose such two internal metrics to users, so we may not want to proceed with that approach. The other solution is to calculate the count of goroutine using pprof, but we need to investigate how to implement it and ensure it's more generic and reusable. We can revisit the test later. Please signoff the commit. And also can you please backport the fix to 3.5 and 3.4? Thanks |
Please read https://github.com/etcd-io/etcd/pull/18739/checks?check_run_id=31525554356 |
Signed-off-by: Oleg Guba <[email protected]>
signed off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did couple of attempts on writing a test, but they all had some major issue. I think this change by itself is correct as was confirmed in #18704 (comment) to prevent goroutine leak. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, serathius, veshij The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks. Please also update the changelogs for both 3.4 and 3.5 |
Under some conditions serverWatchStream.close() leaves recvLoop goroutine blocked on sending data to ctrlStream channel:
corresponding code:
etcd/server/etcdserver/api/v3rpc/watch.go
Lines 349 to 360 in 40b4715
Reading from the
ctrlStream
channel is implemented in sendLoop, which is terminated onclosec
closure:etcd/server/etcdserver/api/v3rpc/watch.go
Lines 529 to 531 in 40b4715
Fixes #18704