-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common PVC cleanup job can be assigned to incorrect node in multi-node cluster #1269
Comments
It seems like this annotation is supposed to be applied from Che, as configured from this Che Cluster CR field. However, if multiple nodes are selected, it's possible that the cleanup job may be assigned to a different node than the node where the PVC is mounted. |
Can we use pod affinity [doc] for the cleanup pod so that it's scheduled on the same node as the workspace pod? |
Description
In a multi-node cluster, it's possible when deleting a devworkspace that uses the per-user/common PVC strategy for the PVC cleanup pod to be scheduled on a node that is different than the node where the PVC is mounted. Since PVCs are created as ReadWriteOnce, only a single node can mount the PVC and thus the cleanup pod will fail to start with a PVC mount error. This causes the devworkspace to remain in a terminating state indefinitely.
Since you cannot modify the node that a pod is scheduled on after the pod has been created, you need to delete the cleanup pod and have it automatically re-created until it is assigned to the node where the PVC is mounted in order for the workspace to be deleted.
What's odd is that we are already applying a node selector label to the cleanup pod. Perhaps there are cases where the namespace is missing the node selector annotation? CC: @musienko-maxim
How To Reproduce
Does not always occur, requires a multi-node cluster.
Expected behavior
The cleanup-workspace pod is scheduled on thesame node where the PVC is mounted and terminates successfully. The deworkspace gets terminated successfully.
Additional context
Encountered this while testing on @musienko-maxim 's OCP 4.15 test cluster.
The text was updated successfully, but these errors were encountered: