Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: deadlock on start missing watches #604

Merged
merged 5 commits into from
Jul 12, 2024

Conversation

agaudreault
Copy link
Member

@agaudreault agaudreault commented Jul 11, 2024

  • The callback on listResources must not obtain locks from other callers.

Caused by #532
Fixes argoproj/argo-cd#18467
Fixes argoproj/argo-cd#18902
Fixes argoproj/argo-cd#16950

Signed-off-by: Alexandre Gaudreault <[email protected]>
@jdroot
Copy link

jdroot commented Jul 11, 2024

Your initial comment has the same issue linked twice.

This looks good to me and I believe would fix the issue. One concern I still have about startMissingWatching though is that it takes a lock and then makes network calls to Kubernetes. Typically that is not a great pattern, but I don't understand the code enough to feel comfortable re-architecting the lock usage. I was curious if you have any thoughts on that, since I think you know the code much better

@agaudreault
Copy link
Member Author

This looks good to me and I believe would fix the issue. One concern I still have about startMissingWatching though is that it takes a lock and then makes network calls to Kubernetes. Typically that is not a great pattern, but I don't understand the code enough to feel comfortable re-architecting the lock usage. I was curious if you have any thoughts on that, since I think you know the code much better

I took a quick look at it and I expected it to be similar to the sync() method, which also holds a lock. I also don't know enough about the code to decide if it is necessary or not. This code is there for quite some time. Maybe @alexmt have a few pointers to give you if you decide to have a stab at it.

Signed-off-by: Alexandre Gaudreault <[email protected]>
@agaudreault agaudreault marked this pull request as ready for review July 11, 2024 19:16
@crenshaw-dev
Copy link
Member

crenshaw-dev commented Jul 11, 2024

  • clusterCache.sync kicks off a goroutine for every monitored API to populate the cache and start watches
    • In that goroutine, we call clusterCache.listResources which acquires a piece of the N=50 list semaphore. That semaphore is intended to limit memory use spikes on cluster cache.
    • After listing resources, we kick off clusterCache.watchEvents in a goroutine for the API.
    • When running clusterCache.watchEvents, if we're not given a resource version, we load the initial state of the API.
    • After listing items, we take out the cluster cache lock and replace the items list for that API.
    • After releasing the lock, we set up a watch on the API.
    • Handling an event for the API, if the event is for a CRD but is NOT a delete event, we take out the cluster lock and run clusterCache.startMissingWatches.
    • clusterCache.startMissingWatches gets the current list of API resources and loops over them synchronously.
    • For each API resource, if it's not already marked as handled in clusterCache.apisMeta, we call clusterCache.loadInitialState.
    • clusterCache.loadInitialState calls clusterCache.listResources, which takes out a piece of the N=50 list semaphore. We're currently holding clusterCache.lock. If the semaphore is full, we can't release clusterCache.lockIf all 50 list semaphores are taken by processes which needclusterCache.lock`, we're deadlocked.

@crenshaw-dev
Copy link
Member

Okay, so putting it into plain English.

The cluster cache is being rebuilt. A bunch of goroutines have been kicked off to pull initial resource lists. Each of those takes out a lock on the list semaphore (max 50). So those 50 routines are churning along, listing resources.

Lo and behold! A CRD update event arrives. We take out the cluster lock and run startMissingWatches to make sure the new CRD's resources are being watched appropriately.

As part of startMissingWatches, we loop through any currently-unwatched APIs and load their initial state.

Loading the initial state involves listing resources. Listing resources requires acquiring the list semaphore. So we're stuck waiting for those other 50 lists to finish.

The other 50 lists finish, and they're ready to write their resource state to the cache. But the cache lock has been taken out to handle startMissingWatches.

So the initial cache loading is blocked waiting on the CRD event to be handled, and the CRD event handler code is stuck waiting for a the initial cache loading to free up the list semaphore. Deadlock.

return c.listResources(ctx, resClient, func(listPager *pager.ListPager) error {
var items []*Resource
err := listPager.EachListItem(ctx, metav1.ListOptions{}, func(obj runtime.Object) error {
var items []*Resource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By moving this outside the semaphore, we increase the simultaneous memory usage, right? Will this increase the memory spike on cache construction?

Copy link
Member Author

@agaudreault agaudreault Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think it is a problem for the initial listing, because the cache is empty. But on relisting, it might cause 2x more memory because we can have more than 50x list of items waiting to get the lock to update the cache.

In plain english, if there is a new kubernetes CRD (startMissingWatches holding the lock) or any other kind of long running operation holding the lock during relisting, then the items will be held in memory.

Copy link
Member Author

@agaudreault agaudreault Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other way I can think of processing the chunk of items with a limit of 50x api items in memory is to add the exception like @jdroot proposed in #599. If you hold the lock already, you can bypass the 50x semaphore, and because startMissingWatches is not processing all watches in parallel like sync, it is not that impactful.

I just added a unit test that can reproduce the deadlock and also a few others that were already fixed in previous PRs. Maybe it will help find out if startMissingWatches needs to hold the lock for so long, which would also solve the issue elegantly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think I'm in favor of merging to fix the deadlock and then deal with any memory consumption regression if it appears.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it will help find out if startMissingWatches needs to hold the lock for so long, which would also solve the issue elegantly.

This would be the real ideal fix I think. Minimizing lock holding time is generally a good practice. I see what you mean about the memory issue this proposed change causes, I wonder how much that will work out to be in practice, and how spikey it will be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo since the semaphore is per-cluster, we're probably not mitigating spikes very well anyway. Personal experience from monitoring memory use in rollouts backs up that feeling.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We deployed v2.11.5 last night, which I believe includes this commit. Our clusters are quite large (some over 1000 nodes, with 1000s of namespaces, 10s of thousands of pods) and noticed no meaningful spike in memory (either average or during reconciliation). So I think the fix looks pretty solid from that perspective. It was difficult to reproduce the problem, so I am not sure I have a good way to validate the deadlock no longer happens

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. The main thing to look for is whether the workqueue depth ever gets stuck with a "floor," i.e. the deadlocked cluster's queue depth.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have monitors and pages set up for if we hit issues, but lack of an issue isn't always indicative of the problem being fixed haha

Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Copy link
Member

@crenshaw-dev crenshaw-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Knowing the potential impact on startup memory spikes, I saw we ship the fix. We'll deal with any memory spike regressions if they appear.

@crenshaw-dev crenshaw-dev merged commit a0c23b4 into argoproj:master Jul 12, 2024
2 checks passed
@crenshaw-dev
Copy link
Member

@agaudreault I think we should include this in 2.12, but probably not cherry-pick it back further. The memory regression is a bigger concern than deadlocks imo for already-stable releases.

agaudreault added a commit to agaudreault/gitops-engine that referenced this pull request Jul 15, 2024
* fix: deadlock on start missing watches

Signed-off-by: Alexandre Gaudreault <[email protected]>

* revert error

Signed-off-by: Alexandre Gaudreault <[email protected]>

* add unit test to validate some deadlock scenarios

Signed-off-by: Alexandre Gaudreault <[email protected]>

* test name

Signed-off-by: Alexandre Gaudreault <[email protected]>

* clarify comment

Signed-off-by: Alexandre Gaudreault <[email protected]>

---------

Signed-off-by: Alexandre Gaudreault <[email protected]>
crenshaw-dev pushed a commit that referenced this pull request Jul 15, 2024
* fix: deadlock on start missing watches



* revert error



* add unit test to validate some deadlock scenarios



* test name



* clarify comment



---------

Signed-off-by: Alexandre Gaudreault <[email protected]>
@agaudreault agaudreault deleted the fix-clus-event-deadlock branch July 18, 2024 21:31
mpelekh pushed a commit to mpelekh/gitops-engine that referenced this pull request Aug 27, 2024
* fix: deadlock on start missing watches

Signed-off-by: Alexandre Gaudreault <[email protected]>

* revert error

Signed-off-by: Alexandre Gaudreault <[email protected]>

* add unit test to validate some deadlock scenarios

Signed-off-by: Alexandre Gaudreault <[email protected]>

* test name

Signed-off-by: Alexandre Gaudreault <[email protected]>

* clarify comment

Signed-off-by: Alexandre Gaudreault <[email protected]>

---------

Signed-off-by: Alexandre Gaudreault <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants