fix: deadlock on start missing watches #604

agaudreault · 2024-07-11T14:56:18Z

The callback on listResources must not obtain locks from other callers.

Caused by #532
Fixes argoproj/argo-cd#18467
Fixes argoproj/argo-cd#18902
Fixes argoproj/argo-cd#16950

Signed-off-by: Alexandre Gaudreault <[email protected]>

jdroot · 2024-07-11T15:32:32Z

Your initial comment has the same issue linked twice.

This looks good to me and I believe would fix the issue. One concern I still have about startMissingWatching though is that it takes a lock and then makes network calls to Kubernetes. Typically that is not a great pattern, but I don't understand the code enough to feel comfortable re-architecting the lock usage. I was curious if you have any thoughts on that, since I think you know the code much better

agaudreault · 2024-07-11T15:56:12Z

This looks good to me and I believe would fix the issue. One concern I still have about startMissingWatching though is that it takes a lock and then makes network calls to Kubernetes. Typically that is not a great pattern, but I don't understand the code enough to feel comfortable re-architecting the lock usage. I was curious if you have any thoughts on that, since I think you know the code much better

I took a quick look at it and I expected it to be similar to the sync() method, which also holds a lock. I also don't know enough about the code to decide if it is necessary or not. This code is there for quite some time. Maybe @alexmt have a few pointers to give you if you decide to have a stab at it.

Signed-off-by: Alexandre Gaudreault <[email protected]>

crenshaw-dev · 2024-07-11T19:47:25Z

clusterCache.sync kicks off a goroutine for every monitored API to populate the cache and start watches
- In that goroutine, we call clusterCache.listResources which acquires a piece of the N=50 list semaphore. That semaphore is intended to limit memory use spikes on cluster cache.
- After listing resources, we kick off clusterCache.watchEvents in a goroutine for the API.
- When running clusterCache.watchEvents, if we're not given a resource version, we load the initial state of the API.
- After listing items, we take out the cluster cache lock and replace the items list for that API.
- After releasing the lock, we set up a watch on the API.
- Handling an event for the API, if the event is for a CRD but is NOT a delete event, we take out the cluster lock and run clusterCache.startMissingWatches.
- clusterCache.startMissingWatches gets the current list of API resources and loops over them synchronously.
- For each API resource, if it's not already marked as handled in clusterCache.apisMeta, we call clusterCache.loadInitialState.
- clusterCache.loadInitialState calls clusterCache.listResources, which takes out a piece of the N=50 list semaphore. We're currently holding clusterCache.lock. If the semaphore is full, we can't release clusterCache.lockIf all 50 list semaphores are taken by processes which needclusterCache.lock`, we're deadlocked.

crenshaw-dev · 2024-07-11T20:50:11Z

Okay, so putting it into plain English.

The cluster cache is being rebuilt. A bunch of goroutines have been kicked off to pull initial resource lists. Each of those takes out a lock on the list semaphore (max 50). So those 50 routines are churning along, listing resources.

Lo and behold! A CRD update event arrives. We take out the cluster lock and run startMissingWatches to make sure the new CRD's resources are being watched appropriately.

As part of startMissingWatches, we loop through any currently-unwatched APIs and load their initial state.

Loading the initial state involves listing resources. Listing resources requires acquiring the list semaphore. So we're stuck waiting for those other 50 lists to finish.

The other 50 lists finish, and they're ready to write their resource state to the cache. But the cache lock has been taken out to handle startMissingWatches.

So the initial cache loading is blocked waiting on the CRD event to be handled, and the CRD event handler code is stuck waiting for a the initial cache loading to free up the list semaphore. Deadlock.

crenshaw-dev · 2024-07-11T21:04:42Z

pkg/cache/cluster.go

- return c.listResources(ctx, resClient, func(listPager *pager.ListPager) error {
- var items []*Resource
- err := listPager.EachListItem(ctx, metav1.ListOptions{}, func(obj runtime.Object) error {
+ var items []*Resource


By moving this outside the semaphore, we increase the simultaneous memory usage, right? Will this increase the memory spike on cache construction?

Yes, I don't think it is a problem for the initial listing, because the cache is empty. But on relisting, it might cause 2x more memory because we can have more than 50x list of items waiting to get the lock to update the cache.

In plain english, if there is a new kubernetes CRD (startMissingWatches holding the lock) or any other kind of long running operation holding the lock during relisting, then the items will be held in memory.

The only other way I can think of processing the chunk of items with a limit of 50x api items in memory is to add the exception like @jdroot proposed in #599. If you hold the lock already, you can bypass the 50x semaphore, and because startMissingWatches is not processing all watches in parallel like sync, it is not that impactful.

I just added a unit test that can reproduce the deadlock and also a few others that were already fixed in previous PRs. Maybe it will help find out if startMissingWatches needs to hold the lock for so long, which would also solve the issue elegantly.

Okay I think I'm in favor of merging to fix the deadlock and then deal with any memory consumption regression if it appears.

Maybe it will help find out if startMissingWatches needs to hold the lock for so long, which would also solve the issue elegantly.

This would be the real ideal fix I think. Minimizing lock holding time is generally a good practice. I see what you mean about the memory issue this proposed change causes, I wonder how much that will work out to be in practice, and how spikey it will be.

imo since the semaphore is per-cluster, we're probably not mitigating spikes very well anyway. Personal experience from monitoring memory use in rollouts backs up that feeling.

We deployed v2.11.5 last night, which I believe includes this commit. Our clusters are quite large (some over 1000 nodes, with 1000s of namespaces, 10s of thousands of pods) and noticed no meaningful spike in memory (either average or during reconciliation). So I think the fix looks pretty solid from that perspective. It was difficult to reproduce the problem, so I am not sure I have a good way to validate the deadlock no longer happens

Nice. The main thing to look for is whether the workqueue depth ever gets stuck with a "floor," i.e. the deadlocked cluster's queue depth.

We have monitors and pages set up for if we hit issues, but lack of an issue isn't always indicative of the problem being fixed haha

Signed-off-by: Alexandre Gaudreault <[email protected]>

crenshaw-dev

Knowing the potential impact on startup memory spikes, I saw we ship the fix. We'll deal with any memory spike regressions if they appear.

crenshaw-dev · 2024-07-12T17:36:39Z

@agaudreault I think we should include this in 2.12, but probably not cherry-pick it back further. The memory regression is a bigger concern than deadlocks imo for already-stable releases.

* fix: deadlock on start missing watches Signed-off-by: Alexandre Gaudreault <[email protected]> * revert error Signed-off-by: Alexandre Gaudreault <[email protected]> * add unit test to validate some deadlock scenarios Signed-off-by: Alexandre Gaudreault <[email protected]> * test name Signed-off-by: Alexandre Gaudreault <[email protected]> * clarify comment Signed-off-by: Alexandre Gaudreault <[email protected]> --------- Signed-off-by: Alexandre Gaudreault <[email protected]>

* fix: deadlock on start missing watches * revert error * add unit test to validate some deadlock scenarios * test name * clarify comment --------- Signed-off-by: Alexandre Gaudreault <[email protected]>

* fix: deadlock on start missing watches Signed-off-by: Alexandre Gaudreault <[email protected]> * revert error Signed-off-by: Alexandre Gaudreault <[email protected]> * add unit test to validate some deadlock scenarios Signed-off-by: Alexandre Gaudreault <[email protected]> * test name Signed-off-by: Alexandre Gaudreault <[email protected]> * clarify comment Signed-off-by: Alexandre Gaudreault <[email protected]> --------- Signed-off-by: Alexandre Gaudreault <[email protected]>

fix: deadlock on start missing watches

d858cd6

Signed-off-by: Alexandre Gaudreault <[email protected]>

agaudreault mentioned this pull request Jul 11, 2024

fix: Deadlock issue caused in listResources (argoproj/argo-cd#18902) #599

Closed

14 tasks

agaudreault requested a review from gdsoumya July 11, 2024 15:12

revert error

cac515a

Signed-off-by: Alexandre Gaudreault <[email protected]>

agaudreault marked this pull request as ready for review July 11, 2024 19:16

crenshaw-dev reviewed Jul 11, 2024

View reviewed changes

agaudreault added 3 commits July 11, 2024 20:20

add unit test to validate some deadlock scenarios

64fe2d2

Signed-off-by: Alexandre Gaudreault <[email protected]>

test name

768c9f5

Signed-off-by: Alexandre Gaudreault <[email protected]>

clarify comment

b04ee04

Signed-off-by: Alexandre Gaudreault <[email protected]>

agaudreault mentioned this pull request Jul 12, 2024

chore: update gitops-engine (2.11) argoproj/argo-cd#19036

Merged

crenshaw-dev approved these changes Jul 12, 2024

View reviewed changes

crenshaw-dev merged commit a0c23b4 into argoproj:master Jul 12, 2024
2 checks passed

This was referenced Jul 15, 2024

chore: update gitops-engine argoproj/argo-cd#19040

Merged

chore: update gitops-engine (2.12) argoproj/argo-cd#19052

Closed

agaudreault deleted the fix-clus-event-deadlock branch July 18, 2024 21:31

agaudreault mentioned this pull request Jul 29, 2024

REQUEST: Promotion to Approver for @agaudreault argoproj/argoproj#310

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deadlock on start missing watches #604

fix: deadlock on start missing watches #604

agaudreault commented Jul 11, 2024 •

edited

Loading

jdroot commented Jul 11, 2024

agaudreault commented Jul 11, 2024

crenshaw-dev commented Jul 11, 2024 •

edited

Loading

crenshaw-dev commented Jul 11, 2024

crenshaw-dev Jul 11, 2024

agaudreault Jul 12, 2024 •

edited

Loading

agaudreault Jul 12, 2024 •

edited

Loading

crenshaw-dev Jul 12, 2024

jdroot Jul 12, 2024

crenshaw-dev Jul 12, 2024

jdroot Jul 18, 2024

crenshaw-dev Jul 18, 2024

jdroot Jul 18, 2024

crenshaw-dev left a comment

crenshaw-dev commented Jul 12, 2024

fix: deadlock on start missing watches #604

fix: deadlock on start missing watches #604

Conversation

agaudreault commented Jul 11, 2024 • edited Loading

jdroot commented Jul 11, 2024

agaudreault commented Jul 11, 2024

crenshaw-dev commented Jul 11, 2024 • edited Loading

crenshaw-dev commented Jul 11, 2024

crenshaw-dev Jul 11, 2024

Choose a reason for hiding this comment

agaudreault Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

agaudreault Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

crenshaw-dev Jul 12, 2024

Choose a reason for hiding this comment

jdroot Jul 12, 2024

Choose a reason for hiding this comment

crenshaw-dev Jul 12, 2024

Choose a reason for hiding this comment

jdroot Jul 18, 2024

Choose a reason for hiding this comment

crenshaw-dev Jul 18, 2024

Choose a reason for hiding this comment

jdroot Jul 18, 2024

Choose a reason for hiding this comment

crenshaw-dev left a comment

Choose a reason for hiding this comment

crenshaw-dev commented Jul 12, 2024

agaudreault commented Jul 11, 2024 •

edited

Loading

crenshaw-dev commented Jul 11, 2024 •

edited

Loading

agaudreault Jul 12, 2024 •

edited

Loading

agaudreault Jul 12, 2024 •

edited

Loading