Resource tree slow refresh #8172

klamkma · 2022-01-13T20:00:00Z

Hello,

Describe the bug

We have a big kubernetes cluster with almost 3000 argocd applications.
Currently we are running ArgoCD 2.2.2.
Since upgrade to version 2 we noticed that refresh of the resource tree for applications is much slower. For example:
I click on "Restart" for a deployment
ReplicaSet appears immediately
New pod appears sometimes after 40 seconds
I've tried increasing values --status-processors, --operation-processors, --kubectl-parallelism-limit for the controller, but it does not help.
Any idea what could we do? Which component is responsible for this refresh, is it argocd-server?

To Reproduce

I click on "Restart" for a deployment
ReplicaSet appears immediately
New pod appears sometimes after 40 seconds

Expected behavior

Pods should appear faster.

Version

argocd: v2.2.2+03b17e0
  BuildDate: 2022-01-01T06:27:52Z
  GitCommit: 03b17e0233e64787ffb5fcf65c740cc2a20822ba
  GitTreeState: clean
  GoVersion: go1.16.11
  Compiler: gc
  Platform: linux/amd64

Thank you.

yydzhou · 2022-02-02T19:54:35Z

We are same here too. Previously we had 4800+ applications the argocd handles them pretty well, although with some slowness on application listing. After some re-org, we have 3000+ applications now. However, since upgraded v2, the refresh and the sync become very very slow. The refresh action, which supposed to be done in a few seconds can run up to 2 minutes. The sync waiting is even slower. compared to previous using experience, I believe there are large room for performance tuning/improvement.

alexmt · 2022-02-03T06:33:46Z

It is really difficult to troubleshoot it remotelly. The controller might be CPU throttled, repo server might need to be scaled up or control plane K8S API server might be slow.

@klamkma , @yydzhou if possible can we have an interactive session (e.g. zoom call) and debug it together. Later we could document changes we've made to help anyone else who faces this issue.

yeya24 · 2022-02-03T07:08:45Z

Thank you @alexmt. Would be great to have a debug session together with @yydzhou.

klamkma · 2022-02-23T10:15:12Z

Hello, I'm available for a session too. Thank you @alexmt.

klamkma · 2022-03-16T07:42:38Z

Hi again,

I enabled ARGOCD_ENABLE_GRPC_TIME_HISTOGRAM.
Could you give me some tips how to use it to investigate performance issues?

Thank you

leotomas837 · 2023-03-15T16:02:27Z

Any update about this ? We are experiencing the same issue. It may be a duplicate from this issue.

There is enough RAM, CPU, disk space, and we tried multiplying the number of replicas of the controller and the server pods by 4 just to see if it helps, but not at all.

jujubetsz · 2023-03-31T06:06:53Z

I have the same problem. 2.5k Apps, helm, Argo v2.6.6, 1 very big cluster (HML). I cant see any problem like throttle, OOM's or resource starvation. Did all recommended tunning for high performance. Argocd have a pool with some big nodes just for it to play. Tomorrow i will try to debug the kubernetes cluster to see if the control plane is ok.

AnubhavSabarwal · 2023-05-24T07:39:20Z

Is there any solution for this, we have somewhere around 6000 applications and argocd version is 2.7.2

Sync and refresh is very slow
Restart and Delete of replicas or deployment doesn't show on ARGO UI.
Whenever you delete deployment, pod or replicas always say doesn't exist on ARGOUI

klamkma · 2023-05-24T09:28:25Z

Hi, For us we had a huge improvement in the UI by enabling --enable-gzip, but still pod refresh is very slow.

evs-ops · 2024-01-08T15:28:08Z

Any news? We have the same problem with even a smaller cluster of about 1000 apps and 6 clusters.
I think it might be related to the fact that we have about 5 or 6 plugins but thats not a huge cluster.
Any thoughts?

jujubetsz · 2024-01-08T15:50:14Z

@evs-ops, Hi.

I’ve tried every possible tunning and version in Argocd and got no improvements…. Since my cluster is running in OpenStack/Rancher inside my company cloud, i’m now improving the cluster itself.. Upgrading kubernetes version, etcd performance etc. I’m doing this because i’m seeing lots of timeouts to kubernetes in application-controller and also because none of the tunning worked. logs:

time=“2024-01-08T15:31:23Z” level=info msg=“Failed to watch Deployment.apps on https://x.x.x.x:443: Resyncing Deployment.apps on https://x.x.x.x:443 due to timeout, retrying in 1s” server=“https://kubernetes.default.svc” time=“2024-01-08T15:34:15Z” level=info msg=“Failed to watch Secret on https://x.x.x.x:443: Resyncing Secret on https://x.x.x.x:443 due to timeout, retrying in 1s” server=“https://kubernetes.default.svc” time=“2024-01-08T15:35:20Z” level=info msg=“Failed to watch ReplicationController on https://x.x.x.x:443: Resyncing ReplicationController on https://x.x.x.x:443 due to timeout, retrying in 1s” server=“https://kubernetes.default.svc”

The symptoms i’m experimenting are:

The navigation in argocd web is fast as expected, but if i delete a pod, for exemple, nothing happens. The box with that pod persist in the argocd frontend, but if i watch the namespace using kubectl the pod is beeing killed and a new pod is beeing scheduled. After several minutes (10 ~15m) the new pod spawn in argocd front. This happen with every object owned by argocd.

evs-ops · 2024-01-10T11:07:38Z

Hi,
Very similar to my problem. I can delete and it would probably take about 5 to 10 min. Refresh no less then 2 min up to 5 min.
It's new to me since in my previous roles I used argo and it was lightning fast :(

jujubetsz · 2024-01-10T11:38:46Z

Hi,

Having more clusters to manage is not a bad thing in my point of view. You can have one replica of application-controller for each cluster. Here are some docs and posts that may help you:

https://www.infracloud.io/blogs/sharding-clusters-across-argo-cd-application-controller-replicas/
https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller

Did you tried that?

Another question: Your clusters are managed (GKE, EKS etc) or is the same as me: Self deployed and managed?

jujubetsz · 2024-01-26T16:27:46Z

@evs-ops,

I bumped my version to v2.10.0-rc4 in order to test jitter implementation on reconciliation. You can check the proposal and description issues/14241.

The results till now are incredible, no delay at all in ArgoCD UI/Front. If i delete a pod, the new pod appears instantly, so i recommend you to try if possible. I bumped the version this morning and got an stable environment so far. Will update this thread if something new happens.

Some general info about my environment:

2.7k Apps
Lots of monorepos, each team/tribe have one ranging from 10 to 200 apps
Only one Cluster
Kubernetes v1.25 running in Openstack/Rancher in private cloud
ArgoCD components have tons of resources to use
Reconciliation timeout: 600s
Reconciliation jitter: 180s

machine3 · 2024-02-04T03:44:39Z

Have you found the reason?

ritheshgm · 2024-02-14T23:54:04Z

+1

When running 3,000 applications and engaging in activities such as syncing 200 applications, clicking "Restart" for a deployment immediately displays the ReplicaSet, but new pods may take up to two minutes to appear.

machine3 · 2024-02-21T01:09:35Z

Does anyone have any ideas for solving the problem, or a temporary solution?

CryptoTr4der · 2024-03-08T16:03:33Z

Same problem here. The refresh is very slow (~3-5m) per Application. Even with version 2.10.2
1 GIT Repo (monorepo) with ~50 applications
CMP argocd-vault-plugin as sidecar deployed

Tried many things, but nothing helps atm

gazidizdaroglu · 2024-06-25T19:50:04Z

When running 3,000 applications and engaging in activities such as syncing 200 applications, clicking "Restart" for a deployment immediately displays the ReplicaSet, but new pods may take up to two minutes to appear.

+1

daftping · 2024-06-25T20:17:57Z

We are encountering a similar issue. In large clusters where Argo CD monitors numerous resources, it is significantly slow in processing watches—taking approximately 7 minutes in our case. Consequently, the Argo CD UI displays outdated information and adversely affects several functionalities that depend on sync waves, such as PruneLast. Eventually, the volume of events from the cluster overwhelmed the system, causing Argo CD to stall completely.

To mitigate this, we disabled tracking of Pods and ReplicaSets, which unfortunately diminishes one of the primary advantages of the Argo CD UI. We also disregarded all irrelevant events and attempted to optimize various settings in the application controller. However, scaling the application controller vertically showed no effect, and horizontal scaling is not feasible for a single cluster due to sharding constraints.

CryptoTr4der · 2024-06-26T09:17:57Z

We have removed all argocd config plugins (switched from argocd-vault-plugin to vault-secrets-webhook) and now everything seems to work smoothly

gazidizdaroglu · 2024-07-19T06:27:55Z

Hey, this thread can help you as well!

https://cloud-native.slack.com/archives/C01TSERG0KZ/p1721141931660909

machine3 · 2024-07-22T00:59:02Z

Hey, this thread can help you as well!

https://cloud-native.slack.com/archives/C01TSERG0KZ/p1721141931660909

I'm sorry, I can't access the link you provided. Could you please share some details with me?

mpelekh · 2024-08-09T09:55:09Z

We are encountering a similar issue. In large clusters where Argo CD monitors numerous resources, it is significantly slow in processing watches—taking approximately 7 minutes in our case. Consequently, the Argo CD UI displays outdated information and adversely affects several functionalities that depend on sync waves, such as PruneLast. Eventually, the volume of events from the cluster overwhelmed the system, causing Argo CD to stall completely.

To mitigate this, we disabled tracking of Pods and ReplicaSets, which unfortunately diminishes one of the primary advantages of the Argo CD UI. We also disregarded all irrelevant events and attempted to optimize various settings in the application controller. However, scaling the application controller vertically showed no effect, and horizontal scaling is not feasible for a single cluster due to sharding constraints.

We are observing precisely the same issue you described. ArgoCD v2.10.9.
@daftping, did you find a way to resolve the issue without disabling tracking pods and replica sets?

andrii-korotkov-verkada · 2024-08-09T13:01:08Z

The fix is there on master and would be a part of v2.13. It optimizes getting resource tree dfs from O(<tree_size> * <namespace_resource_count>) to O(<namespace_resource_count>)

mpelekh · 2024-08-09T17:55:53Z

Hi @andrii-korotkov-verkada. Thanks for replying
Do you mean the following fixes?

Thanks for your contribution. IterateHierarchyV2 looks promising.

I actually patched v2.10.9 with the above commits. It helped, but not to the very end.

Even though patches significantly improve performance, Argo CD still can not handle the load from large clusters.

In the screenshot, you can see one of the largest clusters. Here, the patched with the above commits v2.10.9 build is running.

till 12:50, pods and replica sets are disabled from tracking
from 12:50 to 13:34, pods and replica sets are enabled to be tracked
after 13:34, pods and replica sets are disabled from tracking

As can be seen, once pods and rs are enabled to be tracked, the cluster event count falls close to zero, and reconciliation time increases drastically.

Number of pods in cluster: ~76k
Number of rs in cluster: ~52k

@andrii-korotkov-verkada Do you have any ideas on what can be improved?

crenshaw-dev · 2024-08-09T17:59:59Z

Are you hitting CPU throttling?

mpelekh · 2024-08-09T18:21:15Z

@crenshaw-dev No, we don't set CPU limits at all and still have plenty of resources on the node.

We found that the potential reason is lock contention.

Here, I added a few more metrics and found out that when the number of events is significant, sometimes it takes ~5 minutes to acquire a lock, which leads to a delay in reconciliation.
mpelekh/gitops-engine@560ef00#diff-9c9e197d543705f08c9b1bc2dc404a55506cfc2935a988e6007d248257aadb1aR1372

NOTE: The following metrics we got in 2.10.9 patched with the following commits:

andrii-korotkov-verkada · 2024-08-09T18:30:35Z

I had this attempt argoproj/gitops-engine#602, but the benchmark showed neutral-to-regression in terms of throughput. But maybe average latency can get better, idk.

crenshaw-dev · 2024-08-09T18:35:37Z

I'm curious how much of a performance win you saw from just IterateHierarchy, @mpelekh. Those changes are mostly useful for situations where you have a ton of resources in a single namespace.

Am also super curious if Andrii's locking improvements help with this. If so, that's a strong case for merging those changes.

mpelekh · 2024-08-09T20:06:23Z

I'm curious how much of a performance win you saw from just IterateHierarchy

@crenshaw-dev The comparison is as follows:

Large cluster

Number of pods in cluster: ~76k
Number of rs in cluster: ~52k

v2.10.9 without improvements, only additional metrics are added (deployed at 18:00 according to Grafana charts)

Pods and replica sets are enabled to be watched at 18:25.

v2.10.9 with improvements (argoproj/gitops-engine@`6b2984e` and `267f243`) and additional metrics (deployed at 21:30 according to Grafana charts)

till 22:50, pods and replica sets are disabled from tracking
from 22:50 to 23:34, pods and replica sets are enabled to be tracked
after 23:34, pods and replica sets are disabled from tracking

Even in the enormous cluster, a very tidy performance improvement can be observed with the v2.10.9 with IterateHierarchyV2 (cluster events count is not completely zero; it's ~3-5k).

Smaller cluster

Here are the results from a smaller cluster (compared to the previous one). Pods and replica sets are watched.
Pods: ~18k
ReplicaSets: ~17k

Before 21:45, the v2.10.9 version from upstream was running.
After 21:45, the patched v2.10.9 version with additional metrics and with the following commits was running.

This is the case when IterateHierarchyV2 improves performance significantly.

crenshaw-dev · 2024-08-09T20:21:55Z

Gotcha. So IterateHierarchy gets us ~90% of the way there, but on a huge cluster we'll still have significant lock contention.

mpelekh · 2024-08-09T20:29:18Z

Am also super curious if Andrii's locking improvements help with this. If so, that's a strong case for merging those changes.

@crenshaw-dev I am going to create a patched v2.10.9 image with additional metrics and the following fixes:

I will share the results once I test it in a large cluster.
FYI @andrii-korotkov-verkada

andrii-korotkov-verkada · 2024-08-09T21:35:29Z

If the pods and replica sets are excluded from tracking, would they not even show up in Argo UI, or would it just make them potentially stale?

mpelekh · 2024-08-10T16:54:06Z

If the pods and replica sets are excluded from tracking, would they not even show up in Argo UI, or would it just make them potentially stale?

@andrii-korotkov-verkada If the pods and replica sets are excluded from tracking, they are not visible in the ArgoCD UI; only deployment is visible, and nothing underneath.

mpelekh · 2024-08-13T13:17:35Z

@crenshaw-dev I am going to create a patched v2.10.9 image with additional metrics and the following fixes:

argoproj/gitops-engine@6b2984e

267f243

argoproj/gitops-engine@637dca1

I will share the results once I test it in a large cluster. FYI @andrii-korotkov-verkada

As we agreed, I tested the patched v2.10.9 build with the following fixes:

tl;dr

The results are almost the same as with only IterateHierarchyV2 improvement.
Once the pods and replica sets are enabled for tracking, the Cluster Event Count falls close to zero.
The logs demonstrate that even though we added the changes that optimized the lock usage, we still have significant lock contention.

Details

The patched image has been deployed to one of the most largest clusters, where the pods and replica sets are disabled from tracking.

Please take a look at where one of the additional logs has been added - https://github.com/mpelekh/gitops-engine/blob/e773bed14ca188333ce5f3aa9ca08ab582eff360/pkg/cache/cluster.go#L1429.
This log shows how much time it takes to acquire a lock.

The results are as follows.

Pods and ReplicaSets are disabled from tracking

time to gather logs - from 2024-08-12T15:13:10Z to 2024-08-12T15:24:21Z
total number of processed events during that time - 76482
from 0ms to 1000ms - 75678
from 1000ms to 10000ms - 795
from 10000ms to 20000ms - 9
from 20000ms to 30000ms - 0
from 30000ms to 40000ms - 0
from 40000ms to 50000ms - 0
from 50000ms to 60000ms - 0
from 60000ms and higher - 0

Enable ReplicaSets to be tracked. The pods are still excluded.

APIs count increased to 85
Resource count became ~130k

time to gather logs - from 2024-08-12T15:25:41Z to 2024-08-12T15:52:27Z
total number of processed events during that time - 123501
from 0ms to 1000ms - 120403
from 1000ms to 10000ms - 3085
from 10000ms to 20000ms - 4
from 20000ms to 30000ms - 3
from 30000ms to 40000ms - 6
from 40000ms to 50000ms - 0
from 50000ms to 60000ms - 0
from 60000ms and higher - 0

Include ReplicaSets and Pods for watching

time to gather logs - from 2024-08-12T16:18:58Z to 2024-08-12T16:38:10Z
total number of processed events during that time - 14856
from 0ms to 1000ms -    11006
from 1000ms to 10000ms - 3812
from 10000ms to 20000ms - 8
from 20000ms to 30000ms - 15
from 30000ms to 40000ms - 1
from 40000ms to 50000ms - 5
from 50000ms to 60000ms -  4
from 60000ms and higher - 5

We see that the number of processed events decreased significantly when the ReplicaSets and Pods were included for watching.
The logs demonstrate that even though we added the changes that optimized the lock usage, we still have significant lock contention.

Do you have any thoughts regarding how we can better optimize the lock usage in the cluster so as to handle such a huge number of resources (~210k when RS and Pods are included)?

FYI @crenshaw-dev @andrii-korotkov-verkada

mpelekh · 2024-08-15T16:06:15Z

Has anyone tried changing the global lock approach by fine-grained locking to avoid lock contentions?

andrii-korotkov-verkada · 2024-08-20T16:54:54Z

I don't have thoughts how to optimize lock usage unfortunately. The approach I'm learning about now is a cell architecture with multiple clusters #19607.

Ga13Ou · 2024-09-17T10:02:22Z

We are experiencing slowness issue on our bigger clusters as well. It would be really helpful for debugging if those temporary lock metrics are added to the metrics exported by Argo

mpelekh · 2024-10-04T08:16:53Z

@crenshaw-dev This PR argoproj/gitops-engine#629 resolves the problem described above.
FYI @andrii-korotkov-verkada

Signed-off-by: Mykola Pelekh <[email protected]>

klamkma added the bug Something isn't working label Jan 13, 2022

leotomas837 mentioned this issue Mar 15, 2023

ArgoCD slow to refresh #8663

Open

alexmt added bug/in-triage This issue needs further triage to be correctly classified component:core Syncing, diffing, cluster state cache type:bug labels Jun 25, 2024

alexmt added the component:server label Jun 26, 2024

mpelekh mentioned this issue Oct 4, 2024

fix: avoid resources lock contention utilizing channel argoproj/gitops-engine#629

Open

mpelekh added a commit to mpelekh/argo-cd that referenced this issue Oct 10, 2024

chore: avoid resources lock contention (argoproj#8172)

a9706b4

mpelekh added a commit to mpelekh/argo-cd that referenced this issue Oct 10, 2024

chore: avoid resources lock contention (argoproj#8172)

d196e3a

Signed-off-by: Mykola Pelekh <[email protected]>

mpelekh linked a pull request Oct 10, 2024 that will close this issue

chore: avoid resources lock contention (#8172) #20329

Open

14 tasks

mpelekh added a commit to mpelekh/argo-cd that referenced this issue Oct 10, 2024

chore: avoid resources lock contention (argoproj#8172)

981d245

Signed-off-by: Mykola Pelekh <[email protected]>

mpelekh added a commit to mpelekh/argo-cd that referenced this issue Oct 10, 2024

chore: avoid resources lock contention (argoproj#8172)

83c216a

Signed-off-by: Mykola Pelekh <[email protected]>

mpelekh added a commit to mpelekh/argo-cd that referenced this issue Oct 10, 2024

chore: avoid resources lock contention (argoproj#8172)

bb09323

Signed-off-by: Mykola Pelekh <[email protected]>

mpelekh added a commit to mpelekh/argo-cd that referenced this issue Oct 11, 2024

chore: avoid resources lock contention (argoproj#8172)

07c2774

Signed-off-by: Mykola Pelekh <[email protected]>

Resource tree slow refresh #8172

Resource tree slow refresh #8172

Comments

klamkma commented Jan 13, 2022

yydzhou commented Feb 2, 2022

alexmt commented Feb 3, 2022

yeya24 commented Feb 3, 2022

klamkma commented Feb 23, 2022

klamkma commented Mar 16, 2022

leotomas837 commented Mar 15, 2023

jujubetsz commented Mar 31, 2023

AnubhavSabarwal commented May 24, 2023

klamkma commented May 24, 2023

evs-ops commented Jan 8, 2024

jujubetsz commented Jan 8, 2024 • edited Loading

evs-ops commented Jan 10, 2024

jujubetsz commented Jan 10, 2024

jujubetsz commented Jan 26, 2024 • edited Loading

machine3 commented Feb 4, 2024

ritheshgm commented Feb 14, 2024

machine3 commented Feb 21, 2024

CryptoTr4der commented Mar 8, 2024 • edited Loading

gazidizdaroglu commented Jun 25, 2024 • edited Loading

daftping commented Jun 25, 2024

CryptoTr4der commented Jun 26, 2024

gazidizdaroglu commented Jul 19, 2024

machine3 commented Jul 22, 2024

mpelekh commented Aug 9, 2024

andrii-korotkov-verkada commented Aug 9, 2024

mpelekh commented Aug 9, 2024

crenshaw-dev commented Aug 9, 2024

mpelekh commented Aug 9, 2024 • edited Loading

andrii-korotkov-verkada commented Aug 9, 2024

crenshaw-dev commented Aug 9, 2024

mpelekh commented Aug 9, 2024

Large cluster

v2.10.9 without improvements, only additional metrics are added (deployed at 18:00 according to Grafana charts)

v2.10.9 with improvements (argoproj/gitops-engine@6b2984e and 267f243) and additional metrics (deployed at 21:30 according to Grafana charts)

Smaller cluster

crenshaw-dev commented Aug 9, 2024

mpelekh commented Aug 9, 2024

andrii-korotkov-verkada commented Aug 9, 2024

mpelekh commented Aug 10, 2024

mpelekh commented Aug 13, 2024

tl;dr

Details

Pods and ReplicaSets are disabled from tracking

Enable ReplicaSets to be tracked. The pods are still excluded.

Include ReplicaSets and Pods for watching

mpelekh commented Aug 15, 2024

andrii-korotkov-verkada commented Aug 20, 2024

Ga13Ou commented Sep 17, 2024

mpelekh commented Oct 4, 2024

jujubetsz commented Jan 8, 2024 •

edited

Loading

jujubetsz commented Jan 26, 2024 •

edited

Loading

CryptoTr4der commented Mar 8, 2024 •

edited

Loading

gazidizdaroglu commented Jun 25, 2024 •

edited

Loading

mpelekh commented Aug 9, 2024 •

edited

Loading

v2.10.9 with improvements (argoproj/gitops-engine@`6b2984e` and `267f243`) and additional metrics (deployed at 21:30 according to Grafana charts)