Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateVolume times out before task can complete, starts another in an infinite loop #3024

Open
braunsonm opened this issue Sep 3, 2024 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@braunsonm
Copy link

braunsonm commented Sep 3, 2024

/kind bug

What happened:
When creating a volume from a snapshot which is ~2TB in size, I am seeing timeouts from the CSI in which it "gives up" on the current task in vSphere and starts another to create the volume again.

This seems to happen after about 30-35 minutes from the PVC being created in a pending state. My task in vSphere does complete after 40 minutes but by then, a new task is created by the CSI and the loop starts over again until eventually almost all disk space is used in the datastore.

What you expected to happen:

The CSI should not create multiple tasks if the original task is still in progress. Or should have a configurable timeout.

How to reproduce it (as minimally and precisely as possible):

  1. Create a PVC which will take >30 minutes to restore from a snapshot. In my case, 2TB
  2. Create a PVC from that snapshot
  3. Notice that while vSphere works on the task to create the container volume, the CnsVolumeOperationRequest will give up waiting and create a new task after about 30 minutes.

Anything else we need to know?:

Is there anyway to configure this timeout value? I'm not seeing a method in the code directly right now.

Environment:

  • csi-vsphere version: v3.1.2
  • vsphere-cloud-controller-manager version: 1.28.0
  • Kubernetes version: 1.28.10
  • vSphere version: 7.0.3.01700
  • OS (e.g. from /etc/os-release): Ubuntu 22.04
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 3, 2024
@braunsonm
Copy link
Author

braunsonm commented Sep 3, 2024

Digging a bit in the logs I can see coming from the MonitorCreateVolumeTask function taskResult is empty for CreateVolume task: "task-xxxxxx", opID: "xxxxx"

This repeats a few times for the same task ID for 30 minutes and then is never output again as a new task is created in the CnsVolumeOperationRequest. As mentioned, the task does complete eventually but it takes a little bit longer than whatever timeout is happening here that makes the CSI give up waiting. This results in an infinite loop and orphaned FCDs in vSphere

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2024
@braunsonm
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants