-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PodGroup stuck in Pending Phase #3910
Comments
|
@lowang-bh Thanks for the pointer. Even after updating the queue with a larger capacity of:
The PodGroup still does not get scheduled and is stuck on
There is a single node in this NodePool with nothing running in it, so in theory this should have caused a scale-up. |
the spec in podgroup shows it need at least 112c and 1608g memory. Does your cluster has enouth idle resource with a overcommit factor equasl 15.0 (real idle resource multiple 15 should large than 112C-608G)? |
Thank you for your response @lowang-bh Yes, on idle the cluster in total has 112vCPU cores and 492 GB Ram, autoscaling is enabled as well in one of the node-pools where spark jobs are scheduled. With an overcommit factor of 15 the idle resources should be ~7TB ram. I also tried an over-commit factor of 20.0 but that did not work as well and got the same issue like before. |
@lowang-bh just following up in this issue. have you had a chance to look at this? It is currently affecting some of our jobs in production. Thank you! |
Please paste a log of scheduler. |
@lowang-bh Attaching some of the scheduler logs with Log Level =5
I may be mistaken, but it seems that capacity.go:310 is not respecting the overcommit factor? In the log line capacity.go:304 It seems the queue capability is different from the capability that I gave the queue and instead it is using the cluster Idle Resources (I have 3 nodes of 32C-64GB RAM and 1 node of 16C 300GB Ram for a total of 112C and 492GB of ram).
Hope this is helpful. Let me know if anything else is needed. |
queue myqueue capability <cpu 112000.00...> this is real capability, which can not exceed cluster's physical resource. |
Cluster has auto-scaling enabled, so even though the cluster's physical resources are lower than what the job is asking, that should't that trigger a scale out? (given also that If not, what should be the solution in cases like these where the cluster has autoscaling and scales down to 1 node in idle and can scale up to multiple nodes when required? |
Thanks for using Volcano! Please feel free to comment on #3855 if you are convenient so that we can know each other better and gain more help.
|
@Monokaix Thank you for your reply! We are still testing Volcano out. Our requirement is that Volcano must work with the auto-scheduler since our resources are elastic. Once we figure out if this PodGroup resource issue can be fixed I am definitely open to adding our usecase in #3855 :D
Yes i think this is what exactly is happening in my case even though I have enabled auto-scaling. In my case the queue capacity is larger than the cluster physical resources in idle, but Volcano seems to use the cluster current physical resources as seen in:
Do you have any idea how we could potentially fix this or whether this is a volcano bug or intended feature? |
By the way: |
I removed Thanks! |
Description
We are currently running the Kubeflow spark-operator on Linode LKE kubernetes with auto-scaling enabled.
I have noticed that when trying to trigger a large auto-scaling (i.e. trying to trigger to use all possible nodes in a node-pool) Volcano pods are stuck in Pending phase and PodGroup is also stuck on pending.
If I reduce the RAM and CPU request to half, the scaling up will then be triggered and the PodGroup is successful. Not sure why or how the calculation is done in the volcano side.
My volcano config is:
My queue:
My K8s cluster has auto-scaling enabled in the node-pool with Min 1 and Max 10 nodes (16C,300GB Ram)
My spark job conf:
I have tried various volcano-scheduler.conf options but the same error persists.
PodGroup reports:
1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
queue resource quota insufficient
Is anyone aware on how to fix this issue? I removed gang scheduling plugin as per #2558 but that did not work.
Describe the results you received and expected
PodGroup stuck on Pending
What version of Volcano are you using?
1.10
Any other relevant information
k8s 1.29
The text was updated successfully, but these errors were encountered: