The timeout of git fetch in repo server correlates with unbounded growth of ephemeral storage use, up to tens of Gi #18831

andrii-korotkov-verkada · 2024-06-26T16:38:29Z

Checklist:

I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
I've included steps to reproduce the bug.
I've pasted the output of argocd version.

Describe the bug

At some point there were quite a lot of logs like

Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unknown desc = failed to initialize repository resources: rpc error: code = Internal desc = Failed to fetch default: `git fetch origin --tags --force --prune` failed timeout after 1m30s

The disk usage of repo server was growing unbounded, and even with a large ephemeral storage request and limit the pods would get evicted rather quickly.
After increasing the exec timeout to 2m30s the timeouts were gone and ephemeral storage use was stable at ~2.3Gi instead of 50Gi+.
I'm pretty sure that's correlated since that's the only related change to repo server I was making at the moment.
Looks like partially loaded data is not getting cleaned up if there's a timeout.

To Reproduce

Get a large enough repo of multiple Gi with many updates to trigger exec timeouts on a repo server.

Expected behavior

The ephemeral storage usage is limited.

Screenshots

Version

Custom build from master + #18694 around the time of v2.12.0-rc1 release.

Logs

Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unknown desc = failed to initialize repository resources: rpc error: code = Internal desc = Failed to fetch default: `git fetch origin --tags --force --prune` failed timeout after 1m30s

The text was updated successfully, but these errors were encountered:

todaywasawesome · 2024-06-27T15:54:28Z

Can you try exec-ing into the container to look at the files and figure out which files are getting stuck here? OOM is ok but no great. Maybe we can add cleanup of specific files to prevent the leak in the first place. via @crenshaw-dev

andrii-korotkov-verkada · 2024-06-27T21:49:04Z

OOM is ok but no great

nit: OODisk

I'm trying to figure out the repro, probably would need to reduce the exec timeout during off-business hours and exec into live pod, since can't do that with pod that already errors.

christianh814 · 2024-06-28T23:48:55Z

@andrii-korotkov-verkada you might be able to exec into a failed pod with kubectl debug

andrii-korotkov-verkada added the bug Something isn't working label Jun 26, 2024

alexmt added bug/in-triage This issue needs further triage to be correctly classified component:core Syncing, diffing, cluster state cache type:bug component:server labels Jun 26, 2024

alexmt added component:api API bugs and enhancements component:repo-server labels Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The timeout of git fetch in repo server correlates with unbounded growth of ephemeral storage use, up to tens of Gi #18831

The timeout of git fetch in repo server correlates with unbounded growth of ephemeral storage use, up to tens of Gi #18831

andrii-korotkov-verkada commented Jun 26, 2024

todaywasawesome commented Jun 27, 2024

andrii-korotkov-verkada commented Jun 27, 2024

christianh814 commented Jun 28, 2024

The timeout of git fetch in repo server correlates with unbounded growth of ephemeral storage use, up to tens of Gi #18831

The timeout of git fetch in repo server correlates with unbounded growth of ephemeral storage use, up to tens of Gi #18831

Comments

andrii-korotkov-verkada commented Jun 26, 2024

todaywasawesome commented Jun 27, 2024

andrii-korotkov-verkada commented Jun 27, 2024

christianh814 commented Jun 28, 2024