Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVCs are "Bound" via dynamic provisioning -- MountVolume.MountDevice failed / not attached to node #2954

Open
wiredcolony opened this issue Jul 17, 2024 · 12 comments

Comments

@wiredcolony
Copy link

wiredcolony commented Jul 17, 2024

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
I am unable to get my PVCs to mount the respective pods. The volumes are present in "container volumes" in vSphere and PVCs & PVs are BOUND. Provisioning doesn't seem to be the issue. I have parameters "disk.EnableUUID = TRUE" & "ctkEnabled=FALSE". The storage class used in K8 is set to "default" and using a custom vSphere storage policy. Also, "VolumeExpansion=true.

What you expected to happen:
After PVC is bound, I expect the PV to mount the pod so pod/containers leave the "CreatingContainer" state.

How to reproduce it (as minimally and precisely as possible):
create storage class ->create PVC-->create pod/run helm chart requiring PVs

Anything else we need to know?:
vSphere files services are enabled.

Logs kubelet:

8054 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi.vsphere.vmware.com^fd5422b7-29b8-4d5c-aa36-a74eb2678b94 podName: nodeName:}" failed. No retries permitted until 2024-07-16 23:55:13.605158725 +0000 UTC m=+1236.629690811 (durationBeforeRetry 2m2s). Error: MountVolume.MountDevice failed for volume "pvc-4490fddc-6cb1-4305-8ec3-3d81b22e0a45" (UniqueName: "kubernetes.io/csi/csi.vsphere.vmware.com^fd5422b7-29b8-4d5c-aa36-a74eb2678b94") pod "task-pv-pod" (UID: "11fd6ace-c0c7-4ac8-930a-679d2b10e778") : rpc error: code = NotFound desc = disk: 6000c29ba0cd1d1581a00e6f2f53df29 not attached to node
Jul 16 23:54:45 kadm-prac-wk-1 kubelet[8054]: E0716 23:54:45.056752    8054 pod_workers.go:1298] "Error syncing pod, skipping" err="unmounted volumes=[task-pv-storage], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="default/task-pv-pod" podUID="11fd6ace-c0c7-4ac8-930a-679d2b10e778"
Jul 16 23:55:13 kadm-prac-wk-1 kubelet[8054]: I0716 23:55:13.697301    8054 operation_generator.go:622] "MountVolume.WaitForAttach entering for volume \"pvc-4490fddc-6cb1-4305-8ec3-3d81b22e0a45\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^fd5422b7-29b8-4d5c-aa36-a74eb2678b94\") pod \"task-pv-pod\" (UID: \"11fd6ace-c0c7-4ac8-930a-679d2b10e778\") DevicePath \"\"" pod="default/task-pv-pod"
Jul 16 23:55:13 kadm-prac-wk-1 kubelet[8054]: I0716 23:55:13.698733    8054 operation_generator.go:632] "MountVolume.WaitForAttach succeeded for volume \"pvc-4490fddc-6cb1-4305-8ec3-3d81b22e0a45\" (UniqueName: \"kubernetes.io/csi/csi.vsphere.vmware.com^fd5422b7-29b8-4d5c-aa36-a74eb2678b94\") pod \"task-pv-pod\" (UID: \"11fd6ace-c0c7-4ac8-930a-679d2b10e778\") DevicePath \"csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9\"" pod="default/task-pv-pod"
Jul 16 23:55:13 kadm-prac-wk-1 kubelet[8054]: E0716 23:55:13.701674    8054 csi_attacher.go:366] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = NotFound desc = disk: 6000c29ba0cd1d1581a00e6f2f53df29 not attached to node

Logs CSI-Controller:

1 reflector.go:800] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.PersistentVolume total 10 items received
I0716 23:54:35.994760       1 reflector.go:800] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.CSINode total 9 items received
I0716 23:54:58.619468       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
I0716 23:54:58.633808       1 csi_handler.go:740] Found NodeID 420d48db-149f-3532-6595-40ccba2ed96a in CSINode kadm-prac-wk-1
I0716 23:55:58.633874       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
I0716 23:55:58.647525       1 csi_handler.go:740] Found NodeID 420d48db-149f-3532-6595-40ccba2ed96a in CSINode kadm-prac-wk-1
I0716 23:56:58.647774       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
I0716 23:56:58.661844       1 csi_handler.go:740] Found NodeID 420d48db-149f-3532-6595-40ccba2ed96a in CSINode kadm-prac-wk-1
I0716 23:57:57.991286       1 reflector.go:377] k8s.io/client-go/informers/factory.go:159: forcing resync
I0716 23:57:57.991332       1 controller.go:210] Started VA processing "csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9"
I0716 23:57:57.991343       1 csi_handler.go:224] CSIHandler: processing VA "csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9"
I0716 23:57:57.991349       1 csi_handler.go:246] "csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9" is already attached
I0716 23:57:57.991353       1 csi_handler.go:240] CSIHandler: finished processing "csi-adc9f2a040f8d03ab6654b5346eabd7a77d6ec61532fe9babc0abfc38b2c08f9"
I0716 23:57:57.992445       1 reflector.go:377] k8s.io/client-go/informers/factory.go:159: forcing resync
I0716 23:57:58.661952       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
I0716 23:57:58.675563       1 csi_handler.go:740] Found NodeID 420d48db-149f-3532-6595-40ccba2ed96a in CSINode kadm-prac-wk-1

Logs CSI Node (I just noticed the errors here, I am not sure what it means):

I0716 23:17:42.078188       1 main.go:135] Version: v2.10.1
I0716 23:17:42.078253       1 main.go:136] Running node-driver-registrar in mode=
I0716 23:17:42.078258       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0716 23:17:42.082382       1 connection.go:215] Connecting to unix:///csi/csi.sock
I0716 23:17:44.263539       1 main.go:164] Calling CSI driver to discover driver name
I0716 23:17:44.263563       1 connection.go:244] GRPC call: /csi.v1.Identity/GetPluginInfo
I0716 23:17:44.263566       1 connection.go:245] GRPC request: {}
I0716 23:17:44.266892       1 connection.go:251] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.3.0"}
I0716 23:17:44.266903       1 connection.go:252] GRPC error: <nil>
I0716 23:17:44.266908       1 main.go:173] CSI driver name: "csi.vsphere.vmware.com"
I0716 23:17:44.266924       1 node_register.go:55] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I0716 23:17:44.267028       1 node_register.go:64] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I0716 23:17:44.267144       1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I0716 23:17:44.462797       1 main.go:90] Received GetInfo call: &InfoRequest{}
I0716 23:17:58.301288       1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
I0716 23:34:37.095442       1 main.go:90] Received GetInfo call: &InfoRequest{}
I0716 23:34:37.124833       1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

Environment:

  • csi-vsphere version: 3.30
  • vsphere-cloud-controller-manager version: 1.30.1 (1.29.1 gave error image pull, this might be my issue. Going to try to get a version in 1.29.x as next step)
  • Kubernetes version: 1.29.6
  • vSphere version: 8.02
  • OS (e.g. from /etc/os-release): Ubuntu 22.04
  • Kernel (e.g. uname -a): 5.15.0-116-generic
  • Install tools: kubeadm
  • Others: Containerd

Thank you.

@wiredcolony
Copy link
Author

I just completed creating another cluster with vsphere CPI and CSI. All versions are within version skew limits. I am still getting a similar issue as I mentioned above. All PVCs bind, but I still get a mounting error.

Here is the test pod events:

Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               7m22s                 default-scheduler        Successfully assigned default/task-pv-pod to kadm-wk-1-a
  Normal   SuccessfulAttachVolume  7m21s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-00ecf48b-5f4d-4998-9a3b-8e5a40713b08"
  Warning  FailedMount             64s (x11 over 7m16s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-00ecf48b-5f4d-4998-9a3b-8e5a40713b08" : rpc error: code = NotFound desc = disk: 6000c295272288d128b6e94eed08a9c9 not attached to node

I suspected permission issues, so I gave admin privileges to the user. I don't know what else to try. Thanks in advance for the help.

P.S.--Kubernetes version in new cluster is now 1.30.3.

@wiredcolony wiredcolony changed the title PVCs bound via dynamic provisioning -- volumes will not mount pods. PVCs are "Bound" via dynamic provisioning -- MountVolume.MountDevice failed / not attached to node Jul 21, 2024
@wiredcolony
Copy link
Author

Updated: CSI Version 3.3.1

UPDATE:
I was able to rule out vSphere issues (i.e. permissions, configurations, etc.) by successfully creating, attaching and mounting a PVC/PV to a pod in a cluster built with Rancher (RKE1.28.11).

I am trying to figure out what to do next. I noticed in the Rancher cluster, the CSI has 3 controllers whereas my cluster only runs 1 controller. I tried running a couple more to mimic Rancher, however I believe I need to add 2 control-plane nodes for this to work (I tried already and the pods stay in a "pending" state). Which is an oddity because there are 3 running on Rancher and there is only one Control-Plane node in that cluster as well.

Next steps: Going to add 2 control-plane nodes and run 2 more replicas of the CSI-controller for a total of 3. Will see if that has any effect.

@wiredcolony
Copy link
Author

wiredcolony commented Aug 1, 2024

Still curious about this issue. I would like to get some information on how to troubleshoot or resolve this issue. Adding two more control planes and CSI controllers had zero effect on the issue. Thank you.

@zeddit
Copy link

zeddit commented Aug 3, 2024

@wiredcolony facing the same issue, have you found a solution?
I am running rke2 1.28.11 with vsphere-csi version of rancher-vsphere-csi-3.1.2-rancher400, 3.1.2-rancher4

@wiredcolony
Copy link
Author

@wiredcolony facing the same issue, have you found a solution?

I am running rke2 1.28.11 with vsphere-csi version of rancher-vsphere-csi-3.1.2-rancher400, 3.1.2-rancher4

I have not. However, my issue is with vanilla K8s via Kubeadm.

I also have a rancher server and the PV provisioning and mounting worked perfectly with the same vSphere setup. I tested it with RKE2 and the embedded rancher VMware CSI/CPI (not the rancher apps/helm CSI/CPI).

Double check your vSphere permissions or try using vSphere admin username and password to easily troubleshoot permission issues. If admin account works, you know it is a permission issue.

Also, make sure the username is the full username "[email protected]" when entering in Rancher.

Hopefully someone from the CSI team takes the case. I would love to figure out what is going on. Let me know if you have any questions or get it working. Thanks!

@zeddit
Copy link

zeddit commented Aug 3, 2024

@wiredcolony great thanks for your patient reply, really learned a lot! My problem is solved and the root cause is that I didn't give my VM an option of disk.enableUUID=TRUE.

besides, have you ever tried a hybrid cluster, which I mean there are some nodes that are not running on top of vSphere. Could these nodes be added into the k8s cluster and use vsphere-csi storage class. If could be, does it have some requirements on the network connectivity for those non-vSphere nodes. thanks again.

@wiredcolony
Copy link
Author

@wiredcolony great thanks for your patient reply, really learned a lot! My problem is solved and the root cause is that I didn't give my VM an option of disk.enableUUID=TRUE.

besides, have you ever tried a hybrid cluster, which I mean there are some nodes that are not running on top of vSphere. Could these nodes be added into the k8s cluster and use vsphere-csi storage class. If could be, does it have some requirements on the network connectivity for those non-vSphere nodes. thanks again.

Glad you got things up and running. disk.enableUUID=TRUE will definitely be an issue for mounting. All my nodes are apart of VMware. I thought maybe the issue was versioning, but the issue persists. Hopefully someone could give me some troubleshooting tips soon.

@pbs-jyu
Copy link

pbs-jyu commented Aug 7, 2024

I'm having the same problem with "MountVolume.MountDevice failed for volume "pvc-xx" : rpc error: code = NotFound desc = disk: xxx not attached to node" I have not figured out why yet. But found something may lead to the fix. testing restore the cluster with Velero, end up have to with restore cluster nodes with cluster nodes/VM backup. Saw the complained disk: xxx actually still attached to the original bad cluster nodes/VMs. I also used the same datastore for two different K8s clusters for testing.

@pbs-jyu
Copy link

pbs-jyu commented Aug 7, 2024

@pbs-jyu
Copy link

pbs-jyu commented Aug 7, 2024

The following fixed part of my problem: Ø kubectl describe pod :
Warning FailedMount 2m50s (x781 over 26h) kubelet MountVolume.MountDevice failed for volume "pvc-xx9" : rpc error: code = NotFound desc = disk: xx not attached to node

Ø kubectl describe volumeattachments csi-xxx
Status:
Attach Error:
Message: rpc error: code = Internal desc = failed to attach disk: "sss" with node: "xx" err failed to attach cns volume: ... "Volume with ID xx is not registered as CNS Volume. Error message: The input volume xx is not registered as a CNS volume.

kubectl get volumeattachments
kubectl edit volumeattachments csi-xxx
# comment out the following:
finalizers:
- external-attacher/csi-vsphere-vmware-com
kubectl delete volumeattachments csi-xxx
Then the volume attachment will be automatically created
kubectl describe volumeattachments csi-xxx again. you should not see the errors any more.

Referenced: https://knowledge.broadcom.com/external/article/327470/persistent-volumes-cannot-attach-to-a-ne.html

@pbs-jyu
Copy link

pbs-jyu commented Aug 7, 2024

kubectl delete volumeattachments csi-xxx did not work. The error comes back again. not sure why sometimes the volumes shows under datastore/Monitor/Container Volumes/ in vcenter, sometimes it disappear itself while I do nothing.

@wiredcolony
Copy link
Author

My issue was solved by making sure VM advanced parameters has "disk.EnableUUID = True". Mine was set to "disk.EnabledUUID = True"(e.g. 'Enable' not 'Enabled'). Credit to reddit user PlexingtonSteel @ https://www.reddit.com/r/kubernetes/comments/1e8v7nr/comment/llbcsko/?context=3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants