-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd watch events starvation / lost multiplexed on a single watch stream #17529
Comments
May I get some early feedback on this issue if the analysis makes some sense to you? @serathius @ahrtr @fuweid @tjungblu @siyuanfoundation @jmhbnz @MadhavJivrajani while I am still working on the actual root cause. |
I have a reliable repro using etcd test framework alone without k8s. #17535... |
Please use #17535 for reproduce. Here is an example. etcd server log
Note: The simulated watch cache client is the first watcher so its watch id should be 0. Also I am using the customized key range start key
Test log
The problem is: If the shared watch response channel is full and this watcher next event revision is compacted, this slow/unsync'd watcher would skip all the events pending to send out due to this logic. etcd/server/mvcc/watcher_group.go Lines 253 to 260 in 4a5e9d1
So I would propose prioritize the watch response with CompactRevision is set over the other watch responses so client is guaranteed to observe this watch terminal error instead of some lost events. This approach aligns with the documented watch API semantics.
ref. https://etcd.io/docs/v3.5/learning/api/#watch-streams If this approach makes sense to you, I will create a PR to fix it. |
Expect copying transaction buffer comes from WithPrevKV #16839 which we have already seen causing problems. |
Thanks @chaochn47 added comment in your reproduce pull request #17535 (comment)
Agree. Based on my local test, at least we should skip to update // server/mvcc/watchable_store.go
package mvcc
func (s *watchableStore) syncWatchers() int {
// ...
for w := range wg.watchers {
if w.minRev < compactionRev {
// skip it and retry it later since w.ch is full now
continue
}
w.minRev = curRev + 1
// ...
}
}
gRPC.Send just puts the data in the buffer~. |
Do you mean it may cause lost of event? Is it possible to create a simple test (probably leverage failpoint?) to reproduce this? Regarding the etcd watch events starvation due to high load, did not get time to dig it so far. It's a performance enhancement. Per my immediate feeling, it isn't a low hang fruit. |
I think the underlying issue is not performance based, but the issue of the slow watchers never being dropped as mentioned in kubernetes/kubernetes#123448. A single watcher can just open a watch to all keys and never read a event. This would lead etcd to buffer gigabytes of memory, allowing clients to DOS the etcd. The performance comes only because of reproductions needing to work in sensible time. I think the issue happens also with low throughput but takes hours instead of minutes. |
the #17535 already reproduces the issue. |
Right. The slow watcher should have been cancelled but we coded like it has caught up with the progress of store.. There are two issues.
To resolve first issue, either shard watchers to another gRPC stream (which is done by kubernetes/kubernetes#123532) or improve the performance of stream watch multiplex with prevKV enabled (tracked in #16839). To resolve second issue, we should prioritize sending out compacted watch response to client and tracked by current report. |
Reopen to track the backport efforts. |
etcd/server/etcdserver/api/v3rpc/watch.go Lines 382 to 391 in 4a5e9d1
Just have some questions about the current logic, can correct me if my understanding is incorrect @fuweid @chaochn47
|
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Before we close the issue I would want to track covering this class of issues (slow/blocked watcher) with robustness test. Need create a separate issue when I have time. |
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
It would be great when it became the default component of Linux distributions!! |
Reproduce #17529 in robustness tests
Let's close the issue and create dedicated issue for robustness testing . |
Bug report criteria
What happened?
Watch events starvation
When there are hundreds of watch initialized on a key range with
prevKV
option and multiplexed on the same watch stream, the watch response could be delayed as long as several minutes. And the out of sync watcher was not cancelled by providing a WatchResponse withCompactRevision
not equal to 0 first.It was also observed
etcd_debugging_mvcc_events_total
on one etcd server were accumulated atetcd_debugging_mvcc_pending_events_total <etcd-0> 8.740897e+06
andthere were 722 slow watchers.
etcd_debugging_mvcc_slow_watcher_total <etcd-0> 722
Watch events lost
moveVictims is to synchronize slow watchers to send out pending events in the
watchableStore
. The provider of victim watchers are either from synced watcher in notify or syncWatcher background routine due to the <-sws.watchStream.Chan() is clogged and the pending events could not be sent out. The reason of this blocked channel could be the consumer of this channel is slow and cannot kept up with the events generator.It seems like
eb
could be lost due to this newVictims only carries the lastwb
'seb
for one specified watch. But this theory needs to be further confirmed with additional logging.[Edited: The above theory is not correct since victims are added either from synced watchers or unsynced watchers, they are mutually exclusive..]
Environment set up
The etcd clients are running on 2 VMs but they somehow connects the same etcd server but handled by different gRPC server streams. The client runs
compaction
every minute and the revision in the compact request is one minute old one cached by last compaction response. There are 3 etcd servers.Log and metrics
I added some custom logging to print the watches that only have progress notify enabled which is what k8s watch cache reflector would use in
ListAndWatch
. The other pod watches were observed not having progress notify enabled.It is obvious that the pod deletion event was sent out at
02:22:03
in one watchStream while the same event was sent out at02:24:53
in the other watchStream.What did you expect to happen?
Based on the millions of accumulated pending events on server watchStream, I would like to understand the bottleneck on consuming the watch events, especially on sendLoop which runs on a single thread processing hundreds of watches event generation.
What's the maximum limit of number of watches per watchStream supports? We should clearly document this limit and shard the upcoming watchers to another watchStream if the limit is breached.
How can we reproduce it (as minimally and precisely as possible)?
I don't have a well-written test case to simulate the observed clogged pending events and we should be able to replicate the traffic and have a reproduce with the etcd test framework.
Here are the steps I followed by the reported issue in k8s.
3.6.0
which is a C++ based library that triggers direct etcd watch.500
to800
Anything else we need to know?
Follow up of kubernetes/kubernetes#123072 and possibly related to kubernetes/kubernetes#123448
Etcd version (please run commands below)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: