Watch starvation can cause OOMs #16839

serathius · 2023-10-27T09:20:02Z

What would you like to be added?

Please read #16839 (comment) for up to date information about the problem

serathius · 2023-10-27T09:41:29Z

Profiles I promised. They are less extreme than onces we have seen originally. They are more recent, as we reduced the write throughput later. Still they show the problem:

32% of allocations done by Range to get PrevKey for Watch

57% of cpu used on GC

serathius · 2023-10-30T08:43:23Z

One clarification, access to prevKey in

etcd/server/storage/mvcc/kvstore_txn.go

Line 203 in a4f507c

_, created, ver, err := tw.s.kvindex.Get(key, rev)

doesn't touch bbolt, only the in memory index.

This means that prevKey is not necessarily accessed in apply loop (only if put has p.IgnoreValue || p.IgnoreLease || p.PrevKv). This means we have tradeoff:

Access prevKey in apply loop and risk that there is no watch requesting prevKey.
Access prevKey outside apply loop and copy buffer.

The tradeoff is not clear, we need a proper scalability testing to verify performance in different scenarios.

chaochn47 · 2024-03-07T16:49:30Z

xref. #17529 (comment).

The ConcurrentReadTxMode of read Txn could be as slow as half a second executed in a single threaded (single go routine) in the per serverWatchStream.sendLoop. This would heavily impact per serverWatchStream throughput.

It just reminds me of #12896 which changes to mvcc.SharedBufReadTxMode if the Txn contains a write operation to avoid extra overhead of copying buffer. Potential direction of improvement we can make in the future.

chaochn47 · 2024-03-07T17:01:15Z

The bigger issue about prevKV is for a single write, we copy the event for the matching watchers and for each watcher, we copy the entire transaction buffer just to get prevKV and it would duplicate 100 time (let's say there are 100 watchers).

Hence it is not efficient IMHO.

ahrtr · 2024-03-08T14:21:34Z

Reading a previous version for watch is executed outside of apply loop, meaning we can no longer use blocking read, but a ConcurrentReadTxnMode, which copies the transaction buffer. This is very costly in allocations.

Indeed there are two issues:

for the extra costly ConcurrentReadTxnMode read, can we mitigate it by using a SharedBufReadTxMode?
for the executed outside of apply loop issue, it makes the whole watch response processing not atomic any more. If a compaction happens in between, then we may not read the prevKV. It might not be a serious problem.

wojtek-t · 2024-03-11T12:46:26Z

/cc

serathius · 2024-03-11T20:13:06Z

Deleted the comment above because is was totally inaccurate.

tldr; Issue is not related to prevKV, but to overall watch efficiency during watch starvation. We need to implement #17563 to fix the etcd memory issue.

The main reason that enabling prevKV had such impact is fact that it doubles the event size, meaning it puts the twice as much constrain on the response stream. In my testing I was close to the response throughput limit (~1.6GB/s) even without prevKV, so enabling it caused the memory leak. I repeated the tests where I compared watch streams based on the response throughput (measured using etcd_network_client_grpc_sent_bytes_total metric).

In the repeated tests I have found that prevKV is totally not at fault, the issues is caused by inefficient behavior of syncWatchers when the watch stream is clogged. It will repeatably try to sync watchers every 100ms rereading all the events, but effectively syncing only small subset of watchers. In my testing with 100 watchers, I have observed that sync loop touched each revision 25 times, meaning we allocate and deallocate the etcd watch window 25 times. I also found that
for 100 watchers, reading prevKV 100 times per event is less costly than rereading watch history 25 times by syncWatchers.

prevKV	object size	response throughput[MB/s]	etcd memory [GB]	90%ile watch event latency[s]	99%ile watch event latency[s]	99.9%ile watch event latency[s]	90%ile put latency[s]	99%ile put latency[s]	99.9%ile put latency[s]
TRUE	50KB	1003	0.389	0.116	0.57	0.843	0.001	0.003	0.01
FALSE	100KB	1065	0.5637	0.099	0.495	0.6629	0.001	0.004	0.012
TRUE	100KB	1591	4	6.56	10.16	12.78	0.002	0.007	0.014
FALSE	200KB	1670	6.1	5.5	8.34	9.96	0.003	0.011	0.023

The table above shows 2 sets of tests for response through below the limit 1GB/s, and one above the limit 1.6GB/s. I adjusted the object size between enabled and disabled prevKV to get the similar response throughput.

Conclusions: Same as before, changes by @chaochn47 will not address the issue. No need to improve prevKV (at least for now), to address the issue of etcd memory bloating we need to improve the reuse of event history like in #17563

serathius · 2024-03-12T08:34:57Z

Have one more idea, reduce the frequency of syncWatchers to allow more batching.

serathius · 2024-03-12T19:10:32Z

Amazing just reducing the syncWatchers frequency improves watch latency and memory usage.

sync duration[s]	response throughput[MB/s]	etcd memory [GB]	50%ile watch event latency[s]	90%ile watch event latency[s]	99%ile watch event latency[s]	99.9%ile watch event latency[s]
0.1 (current)	1620	6.6	2.93	6.34	9.64	14.04
0.5	1671	3.2	2.83	5.46	8.4	11.04
1 (proposed)	1736	2.1	2.56	4.29	6.18	7.78
2	1728	1.7	3.28	5.19	6.53	7.44
5	1204	2.2	6.16	9.38	13.42	16.43

Looks like the improvements peek around 1s, improving the watch response throughput by 4%, reducing memory allocations by 4-5 times (base memory for puts is 850MB) and reducing the latency 30-45%. This already looks like great improvement.

A proper implementation of reusing events will be complicated, making backport questionable, thus I propose to a simple backportable solution to reduce impact of slow watchers.

Proposal: Reduce the syncLoop period from 0.1s to 1s

Want to confirm how the original value was proposed.

Note; there is no risk of syncLoop not catching up with unsynched watchers as it also includes shortcut if it cannot keep up

etcd/server/storage/mvcc/watchable_store.go

Lines 213 to 245 in ddf5471

    
           func (s *watchableStore) syncWatchersLoop() { 
        
           	defer s.wg.Done() 
        
           	waitDuration := 100 * time.Millisecond 
        
           	delayTicker := time.NewTicker(waitDuration) 
        
           	defer delayTicker.Stop() 
        
           	for { 
        
           		s.mu.RLock() 
        
           		st := time.Now() 
        
           		lastUnsyncedWatchers := s.unsynced.size() 
        
           		s.mu.RUnlock() 
        
           		unsyncedWatchers := 0 
        
           		if lastUnsyncedWatchers > 0 { 
        
           			unsyncedWatchers = s.syncWatchers() 
        
           		} 
        
           		syncDuration := time.Since(st) 
        
           		delayTicker.Reset(waitDuration) 
        
           		// more work pending? 
        
           		if unsyncedWatchers != 0 && lastUnsyncedWatchers > unsyncedWatchers { 
        
           			// be fair to other store operations by yielding time taken 
        
           			delayTicker.Reset(syncDuration) 
        
           		} 
        
           		select { 
        
           		case <-delayTicker.C: 
        
           		case <-s.stopc: 
        
           			return 
        
           		} 
        
           	} 
        
           }

serathius · 2024-03-12T20:06:57Z

Validated that 1s performs better than 0.1s for different number of streams:

object size	streams	sync duration[s]	response throughput[MB/s]	etcd memory [GB]	50%ile watch event latency[s]	90%ile watch event latency[s]	99%ile watch event latency[s]	99.9%ile watch event latency[s]
1MB	20	0.1	1579	14.9	3.47	6.32	8.22	9.4
1MB	20	1	1634	9	3.45	5.63	7.98	9.34
30KB	1000	0.1	2006	5.5	4.26	12.74	18.81	22.69
30KB	1000	1	2257	1.4	4.07	11.76	17.8	21.87
3KB	10000	0.1	2061	1.8	4.78	12.82	19.33	23.68
3KB	10000	1	2154	1.1	4.29	11.74	18.01	22.36

Tested all sensible number of watchers from minimal (minimal value that below max write size <1MB limit) up to 10`000, which is 20 times more than limit of watchers per sync. Adjusted object size to ensure that watch stream is congested.

serathius · 2024-03-12T20:16:22Z

synchWatchers period was always set to 100ms from first implementation of watch #3507 Didn't find any discussion pointing to why 100ms was picked.

This makes me feel convinced that we can change it to improve handling of watch stream starvation.

ahrtr · 2024-03-15T10:10:32Z

Thanks for the performance test results.

Proposal: Reduce the syncLoop period from 0.1s to 1s

I think it's more prudent to introduce a config item instead of hard coding the sync loop interval/frequency.

Since the history will never change (of course it can be compacted), so the idea of reusing events (#17563) should be OK.

We should also consider to limit the max data/events to read. Currently it always read all the data/events in range [minRev, curRev]. Obviously it may consume huge memory if there are huge number of events in the range.

etcd/server/storage/mvcc/watchable_store.go

Line 363 in d639abe

revs, vs := tx.UnsafeRange(schema.Key, minBytes, maxBytes, 0)

serathius added the type/feature label Oct 27, 2023

serathius mentioned this issue Oct 27, 2023

Handle Extremely High Throughput by holding back requests to etcd until the throughput decreases. #16837

Open

serathius added the area/performance label Oct 27, 2023

serathius changed the title ~~Enabling watch PrevKey causes high allocations~~ Watch PrevKey causes high allocations Oct 27, 2023

serathius mentioned this issue Mar 6, 2024

etcd watch events starvation / lost multiplexed on a single watch stream #17529

Closed

4 tasks

serathius mentioned this issue Mar 10, 2024

Improve watch latency benchmark #17562

Merged

serathius changed the title ~~Watch PrevKey causes high allocations~~ Watch starvation can cause OOMs Mar 11, 2024

serathius mentioned this issue Mar 13, 2024

Increase sync watchers period to 1 second #17583

Open

serathius mentioned this issue Jun 3, 2024

Slow watchers impact PUT latency #18109

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch starvation can cause OOMs #16839

Watch starvation can cause OOMs #16839

serathius commented Oct 27, 2023 •

edited

Loading

serathius commented Oct 27, 2023 •

edited

Loading

serathius commented Oct 30, 2023 •

edited

Loading

chaochn47 commented Mar 7, 2024 •

edited

Loading

chaochn47 commented Mar 7, 2024

ahrtr commented Mar 8, 2024

wojtek-t commented Mar 11, 2024

serathius commented Mar 11, 2024 •

edited

Loading

serathius commented Mar 12, 2024

serathius commented Mar 12, 2024 •

edited

Loading

serathius commented Mar 12, 2024 •

edited

Loading

serathius commented Mar 12, 2024 •

edited

Loading

ahrtr commented Mar 15, 2024

Watch starvation can cause OOMs #16839

Watch starvation can cause OOMs #16839

Comments

serathius commented Oct 27, 2023 • edited Loading

What would you like to be added?

serathius commented Oct 27, 2023 • edited Loading

serathius commented Oct 30, 2023 • edited Loading

chaochn47 commented Mar 7, 2024 • edited Loading

chaochn47 commented Mar 7, 2024

ahrtr commented Mar 8, 2024

wojtek-t commented Mar 11, 2024

serathius commented Mar 11, 2024 • edited Loading

serathius commented Mar 12, 2024

serathius commented Mar 12, 2024 • edited Loading

serathius commented Mar 12, 2024 • edited Loading

serathius commented Mar 12, 2024 • edited Loading

ahrtr commented Mar 15, 2024

serathius commented Oct 27, 2023 •

edited

Loading

serathius commented Oct 27, 2023 •

edited

Loading

serathius commented Oct 30, 2023 •

edited

Loading

chaochn47 commented Mar 7, 2024 •

edited

Loading

serathius commented Mar 11, 2024 •

edited

Loading

serathius commented Mar 12, 2024 •

edited

Loading

serathius commented Mar 12, 2024 •

edited

Loading

serathius commented Mar 12, 2024 •

edited

Loading