Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: terminate recvLoop on serverWatchStream.close() #18739

Merged
merged 1 commit into from
Oct 24, 2024

Conversation

veshij
Copy link
Contributor

@veshij veshij commented Oct 14, 2024

Under some conditions serverWatchStream.close() leaves recvLoop goroutine blocked on sending data to ctrlStream channel:

goroutine profile: total 177241
43832 @ 0x43fe6e 0x40a6a5 0x40a2f7 0xd9bb8d 0xd9af66 0x473181
#	0xd9bb8c	go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*serverWatchStream).recvLoop+0x70c	external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:348
#	0xd9af65	go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*watchServer).Watch.func2+0x45		external/io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/watch.go:191

corresponding code:

if err == nil {
sws.ctrlStream <- &pb.WatchResponse{
Header: sws.newResponseHeader(sws.watchStream.Rev()),
WatchId: id,
Canceled: true,
}
sws.mu.Lock()
delete(sws.progress, mvcc.WatchID(id))
delete(sws.prevKV, mvcc.WatchID(id))
delete(sws.fragment, mvcc.WatchID(id))
sws.mu.Unlock()
}

Reading from the ctrlStream channel is implemented in sendLoop, which is terminated on closec closure:

case <-sws.closec:
return
}

Fixes #18704

@k8s-ci-robot
Copy link

Hi @veshij. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 46.51163% with 46 lines in your changes missing coverage. Please review.

Project coverage is 68.75%. Comparing base (04efee2) to head (1168125).
Report is 35 commits behind head on main.

Current head 1168125 differs from pull request most recent head 7760735

Please upload reports for the commit 7760735 to get more accurate results.

Files with missing lines Patch % Lines
etcdctl/ctlv3/command/global.go 0.00% 12 Missing ⚠️
etcdctl/ctlv3/command/make_mirror_command.go 0.00% 10 Missing ⚠️
etcdctl/ctlv3/command/ep_command.go 0.00% 6 Missing ⚠️
client/v3/mock/mockserver/mockserver.go 0.00% 4 Missing ⚠️
client/v3/config.go 87.50% 1 Missing and 1 partial ⚠️
client/v3/snapshot/v3_snapshot.go 0.00% 2 Missing ⚠️
etcdctl/ctlv3/ctl.go 0.00% 2 Missing ⚠️
pkg/netutil/netutil.go 50.00% 2 Missing ⚠️
client/v3/client.go 80.00% 0 Missing and 1 partial ⚠️
client/v3/experimental/recipes/key.go 50.00% 1 Missing ⚠️
... and 4 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
Files with missing lines Coverage Δ
client/v3/lease.go 90.87% <100.00%> (ø)
client/v3/retry.go 79.48% <100.00%> (+0.17%) ⬆️
client/v3/retry_interceptor.go 65.61% <100.00%> (ø)
client/v3/watch.go 94.23% <100.00%> (+0.39%) ⬆️
pkg/expect/expect.go 79.12% <100.00%> (ø)
pkg/featuregate/feature_gate.go 87.66% <100.00%> (ø)
pkg/flags/flag.go 68.57% <100.00%> (ø)
server/etcdserver/api/v3rpc/watch.go 84.06% <100.00%> (-1.07%) ⬇️
client/v3/client.go 84.93% <80.00%> (+0.04%) ⬆️
client/v3/experimental/recipes/key.go 75.34% <50.00%> (ø)
... and 12 more

... and 22 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18739      +/-   ##
==========================================
+ Coverage   68.74%   68.75%   +0.01%     
==========================================
  Files         420      420              
  Lines       35488    35508      +20     
==========================================
+ Hits        24395    24415      +20     
- Misses       9659     9664       +5     
+ Partials     1434     1429       -5     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04efee2...7760735. Read the comment docs.

@jmhbnz
Copy link
Member

jmhbnz commented Oct 15, 2024

/ok-to-test

@serathius
Copy link
Member

Can you provide a test?

@veshij veshij marked this pull request as ready for review October 16, 2024 18:39
@veshij
Copy link
Contributor Author

veshij commented Oct 16, 2024

Can you provide a test?

tbh I'n not quite sure yet how to trigger this particular issue in a test env.

@serathius
Copy link
Member

Let me look into preparing a test.

@ahrtr
Copy link
Member

ahrtr commented Oct 17, 2024

Can you provide a test?

tbh I'n not quite sure yet how to trigger this particular issue in a test env.

I think the e2e test should just mimic how your production run etcd, and verify the result by comparing the metrics pointed out in #18704 (comment)

@ahrtr
Copy link
Member

ahrtr commented Oct 21, 2024

@veshij since this issue is a little hard to reproduce & verify in normal e2e or integration test, and the fix is simple & safe, so it's accepted to approve & merge the fix firstly.

The solution of adding two metrics sendLoopCount and recvLoopCount as pointed out in #18704 (comment) is doable, but it's a little weird to expose such two internal metrics to users, so we may not want to proceed with that approach. The other solution is to calculate the count of goroutine using pprof, but we need to investigate how to implement it and ensure it's more generic and reusable. We can revisit the test later.

Please signoff the commit. And also can you please backport the fix to 3.5 and 3.4? Thanks

@ahrtr
Copy link
Member

ahrtr commented Oct 21, 2024

Please signoff the commit.

Please read https://github.com/etcd-io/etcd/pull/18739/checks?check_run_id=31525554356

@veshij
Copy link
Contributor Author

veshij commented Oct 21, 2024

signed off.
Will backport to 3.4 and 3.5

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks

cc. @serathius

@serathius
Copy link
Member

I did couple of attempts on writing a test, but they all had some major issue. I think this change by itself is correct as was confirmed in #18704 (comment) to prevent goroutine leak.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, serathius, veshij

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@serathius serathius merged commit 6d2b232 into etcd-io:main Oct 24, 2024
33 checks passed
@veshij
Copy link
Contributor Author

veshij commented Oct 26, 2024

backport to 3.4: #18785
backport to 3.5: #18784

@ahrtr
Copy link
Member

ahrtr commented Oct 26, 2024

Thanks. Please also update the changelogs for both 3.4 and 3.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Potential goroutine leak in serverWatchStream
6 participants