Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The timeout of git fetch in repo server correlates with unbounded growth of ephemeral storage use, up to tens of Gi #18831

Open
3 tasks done
andrii-korotkov-verkada opened this issue Jun 26, 2024 · 3 comments
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working component:api API bugs and enhancements component:core Syncing, diffing, cluster state cache component:repo-server component:server type:bug

Comments

@andrii-korotkov-verkada
Copy link
Contributor

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

At some point there were quite a lot of logs like

Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unknown desc = failed to initialize repository resources: rpc error: code = Internal desc = Failed to fetch default: `git fetch origin --tags --force --prune` failed timeout after 1m30s

The disk usage of repo server was growing unbounded, and even with a large ephemeral storage request and limit the pods would get evicted rather quickly.
After increasing the exec timeout to 2m30s the timeouts were gone and ephemeral storage use was stable at ~2.3Gi instead of 50Gi+.
I'm pretty sure that's correlated since that's the only related change to repo server I was making at the moment.
Looks like partially loaded data is not getting cleaned up if there's a timeout.

To Reproduce

Get a large enough repo of multiple Gi with many updates to trigger exec timeouts on a repo server.

Expected behavior

The ephemeral storage usage is limited.

Screenshots

Screenshot 2024-06-26 at 9 37 12 AM

Version

Custom build from master + #18694 around the time of v2.12.0-rc1 release.

Logs

Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unknown desc = failed to initialize repository resources: rpc error: code = Internal desc = Failed to fetch default: `git fetch origin --tags --force --prune` failed timeout after 1m30s
@andrii-korotkov-verkada andrii-korotkov-verkada added the bug Something isn't working label Jun 26, 2024
@alexmt alexmt added bug/in-triage This issue needs further triage to be correctly classified component:core Syncing, diffing, cluster state cache type:bug component:server labels Jun 26, 2024
@todaywasawesome
Copy link
Contributor

Can you try exec-ing into the container to look at the files and figure out which files are getting stuck here? OOM is ok but no great. Maybe we can add cleanup of specific files to prevent the leak in the first place. via @crenshaw-dev

@andrii-korotkov-verkada
Copy link
Contributor Author

OOM is ok but no great

nit: OODisk

I'm trying to figure out the repro, probably would need to reduce the exec timeout during off-business hours and exec into live pod, since can't do that with pod that already errors.

@alexmt alexmt added component:api API bugs and enhancements component:repo-server labels Jun 28, 2024
@christianh814
Copy link
Member

@andrii-korotkov-verkada you might be able to exec into a failed pod with kubectl debug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/in-triage This issue needs further triage to be correctly classified bug Something isn't working component:api API bugs and enhancements component:core Syncing, diffing, cluster state cache component:repo-server component:server type:bug
Projects
None yet
Development

No branches or pull requests

4 participants