Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PodGroup stuck in Pending Phase #3910

Open
caushie-akamai opened this issue Dec 21, 2024 · 13 comments
Open

PodGroup stuck in Pending Phase #3910

caushie-akamai opened this issue Dec 21, 2024 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@caushie-akamai
Copy link

caushie-akamai commented Dec 21, 2024

Description

We are currently running the Kubeflow spark-operator on Linode LKE kubernetes with auto-scaling enabled.

I have noticed that when trying to trigger a large auto-scaling (i.e. trying to trigger to use all possible nodes in a node-pool) Volcano pods are stuck in Pending phase and PodGroup is also stuck on pending.
If I reduce the RAM and CPU request to half, the scaling up will then be triggered and the PodGroup is successful. Not sure why or how the calculation is done in the volcano side.
My volcano config is:

actions: "enqueue, allocate,preempt, backfill"
tiers:
- plugins:
  - name: priority
  - name: conformance
- plugins:
  - name: overcommit
    arguments:  
    overcommit-factor: 15.0
  - name: drf
    enablePreemptable: false
  - name: predicates
  - name: capacity
  - name: nodeorder
  - name: binpack

My queue:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: myqueue
spec:
  reclaimable: true
  weight: 1
  capability:
    cpu: "200"
    memory: "1200G"
status:
  state: Open

My K8s cluster has auto-scaling enabled in the node-pool with Min 1 and Max 10 nodes (16C,300GB Ram)

My spark job conf:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-sleep-2
  namespace: spark
spec:
  type: Python
  mode: cluster
  image: "redacted-private-repo/spark:k8s-3.5.1"
  imagePullSecrets:
    - "regcred"
  imagePullPolicy: Always
  mainApplicationFile: "local:///opt/spark/work-dir/sleep_forever.py"
  sparkVersion: "3.5.1"
  batchScheduler: volcano
  batchSchedulerOptions:
    priorityClassName: urgent
    queue: myqueue
  restartPolicy:
    type: Never
  driver:
    cores: 2
    memory: "8G"
    labels:
      version: 3.5.1
    serviceAccount: spark
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: "node.kubernetes.io/instance-type"
                  operator: In
                  values:
                    - "g6-dedicated-32"
  executor:
    cores: 15
    instances: 8
    memory: "200G"
    labels:
      version: 3.5.1
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: "node.kubernetes.io/instance-type"
                  operator: In
                  values:
                    - "g7-highmem-16"

I have tried various volcano-scheduler.conf options but the same error persists.
PodGroup reports:
1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
queue resource quota insufficient

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: spark-spark-sleep-2-pg
  namespace: spark
.
.
.
.
status:
  phase: Pending
spec:
  minMember: 1
  minResources:
    cpu: '122'
    memory: 1608G
  priorityClassName: urgent
  queue: myqueue

Is anyone aware on how to fix this issue? I removed gang scheduling plugin as per #2558 but that did not work.

Describe the results you received and expected

PodGroup stuck on Pending

What version of Volcano are you using?

1.10

Any other relevant information

k8s 1.29

@caushie-akamai caushie-akamai added the kind/bug Categorizes issue or PR as related to a bug. label Dec 21, 2024
@lowang-bh
Copy link
Member

1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable queue resource quota insufficient your quota in queue spec. capability is not enough.

@caushie-akamai
Copy link
Author

caushie-akamai commented Dec 22, 2024

@lowang-bh Thanks for the pointer. Even after updating the queue with a larger capacity of:

spec:
  capability:
    cpu: '200'
    memory: 2000G
  reclaimable: true
  weight: 1

The PodGroup still does not get scheduled and is stuck on Pending: (Note the cluster has autoscaling enabled to go up to 3TB of ram)

1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable

resource in cluster is overused

There is a single node in this NodePool with nothing running in it, so in theory this should have caused a scale-up.
Do you have any idea what could be causing this?

@lowang-bh
Copy link
Member

the spec in podgroup shows it need at least 112c and 1608g memory. Does your cluster has enouth idle resource with a overcommit factor equasl 15.0 (real idle resource multiple 15 should large than 112C-608G)?
spec: minMember: 1 minResources: cpu: '122' memory: 1608G

@caushie-akamai
Copy link
Author

caushie-akamai commented Dec 22, 2024

Thank you for your response @lowang-bh

Yes, on idle the cluster in total has 112vCPU cores and 492 GB Ram, autoscaling is enabled as well in one of the node-pools where spark jobs are scheduled.
That Node Pool has autoscaling with Min Nodes = 1 and Max Nodes = 10. (16C-300GB type VMs)

With an overcommit factor of 15 the idle resources should be ~7TB ram. I also tried an over-commit factor of 20.0 but that did not work as well and got the same issue like before.

@caushie-akamai
Copy link
Author

@lowang-bh just following up in this issue. have you had a chance to look at this? It is currently affecting some of our jobs in production.

Thank you!

@lowang-bh
Copy link
Member

Please paste a log of scheduler.

@caushie-akamai
Copy link
Author

caushie-akamai commented Dec 28, 2024

@lowang-bh Attaching some of the scheduler logs with Log Level =5

0:17:24.109606       1 cache.go:1347] The priority of job <spark/spark-spark-sleep-2-pg> is <urgent/10>
I1228 20:17:24.109625       1 cache.go:1383] There are <1> Jobs, <2> Queues and <4> Nodes in total for scheduling.
I1228 20:17:24.109639       1 session.go:190] Open Session 0c47cbe6-a664-4c83-9133-fc96d1e446df with <1> Job and <2> Queues
I1228 20:17:24.110044       1 overcommit.go:75] Enter overcommit plugin ...
I1228 20:17:24.110070       1 overcommit.go:142] Leaving overcommit plugin.
I1228 20:17:24.110081       1 drf.go:191] Total Allocatable cpu 112000.00, memory 518674661376.00, pods 440.00, attachable-volumes-csi-linodebs.csi.linode.com 252.00, ephemeral-storage 3975709572275000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00
I1228 20:17:24.110148       1 capacity.go:82] The total resource is <cpu 112000.00, memory 518674661376.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, pods 440.00, attachable-volumes-csi-linodebs.csi.linode.com 252.00, ephemeral-storage 3975709572275000.00>
I1228 20:17:24.110168       1 capacity.go:90] The total guarantee resource is <cpu 0.00, memory 0.00>
I1228 20:17:24.110177       1 capacity.go:93] Considering Job <spark/spark-spark-sleep-2-pg>.
I1228 20:17:24.110187       1 capacity.go:127] Added Queue <myqueue> attributes.
I1228 20:17:24.110199       1 capacity.go:160] Queue myqueue allocated <cpu 0.00, memory 0.00> request <cpu 0.00, memory 0.00> inqueue <cpu 0.00, memory 0.00> elastic <cpu 0.00, memory 0.00>
I1228 20:17:24.110211       1 capacity.go:173] The attributes of queue <myqueue> in capacity: deserved <cpu 0.00, memory 0.00>, realCapability <cpu 112000.00, memory 518674661376.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, pods 440.00, attachable-volumes-csi-linodebs.csi.linode.com 252.00, ephemeral-storage 3975709572275000.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 0.00, memory 0.00>, elastic <cpu 0.00, memory 0.00>, share <0.00>
I1228 20:17:24.110269       1 binpack.go:165] Enter binpack plugin ...
I1228 20:17:24.110286       1 binpack.go:183] resources [] record in weight but not found on any node
I1228 20:17:24.110295       1 binpack.go:167] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], cpu[1], memory[1] ...
I1228 20:17:24.110302       1 enqueue.go:45] Enter Enqueue ...
I1228 20:17:24.110309       1 enqueue.go:63] Added Queue <myqueue> for Job <spark/spark-spark-sleep-2-pg>
I1228 20:17:24.110319       1 enqueue.go:74] Added Job <spark/spark-spark-sleep-2-pg> into Queue <myqueue>
I1228 20:17:24.110325       1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I1228 20:17:24.110332       1 overcommit.go:125] Sufficient resources, permit job <spark/spark-spark-sleep-2-pg> to be inqueue
I1228 20:17:24.110348       1 capacity.go:304] job spark-spark-sleep-2-pg min resource <cpu 132000.00, memory 1708000000000.00>, queue myqueue capability <cpu 112000.00, memory 518674661376.00, hugepages-2Mi 0.00, pods 440.00, attachable-volumes-csi-linodebs.csi.linode.com 252.00, ephemeral-storage 3975709572275000.00, hugepages-1Gi 0.00> allocated <cpu 0.00, memory 0.00> inqueue <cpu 0.00, memory 0.00> elastic <cpu 0.00, memory 0.00>
I1228 20:17:24.110370       1 capacity.go:310] job spark-spark-sleep-2-pg inqueue false
I1228 20:17:24.110408       1 enqueue.go:104] Leaving Enqueue ...
I1228 20:17:24.110428       1 allocate.go:56] Enter Allocate ...
I1228 20:17:24.110440       1 allocate.go:85] Job <spark/spark-spark-sleep-2-pg> Queue <myqueue> skip allocate, reason: job status is pending.
I1228 20:17:24.110450       1 allocate.go:75] Try to allocate resource to 0 Queues
I1228 20:17:24.110458       1 allocate.go:77] Leaving Allocate ...
I1228 20:17:24.110467       1 preempt.go:53] Enter Preempt ...
I1228 20:17:24.110485       1 statement.go:379] Committing operations ...
I1228 20:17:24.110492       1 preempt.go:223] Leaving Preempt ...
I1228 20:17:24.110499       1 backfill.go:53] Enter Backfill ...
I1228 20:17:24.110507       1 backfill.go:108] Leaving Backfill ...

I may be mistaken, but it seems that capacity.go:310 is not respecting the overcommit factor? In the log line capacity.go:304 It seems the queue capability is different from the capability that I gave the queue and instead it is using the cluster Idle Resources (I have 3 nodes of 32C-64GB RAM and 1 node of 16C 300GB Ram for a total of 112C and 492GB of ram).

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: myqueue
spec:
  reclaimable: true
  weight: 1
  capability:
    cpu: "200"
    memory: "3000G"
status:
  state: Open

Hope this is helpful. Let me know if anything else is needed.

@lowang-bh
Copy link
Member

@lowang-bh Attaching some of the scheduler logs with Log Level =5

I1228 20:17:24.110348       1 capacity.go:304] job spark-spark-sleep-2-pg min resource <cpu 132000.00, memory 1708000000000.00>, queue myqueue capability <cpu 112000.00, memory 518674661376.00, hugepages-2Mi 0.00, pods 440.00, attachable-volumes-csi-linodebs.csi.linode.com 252.00, ephemeral-storage 3975709572275000.00, hugepages-1Gi 0.00> allocated <cpu 0.00, memory 0.00> inqueue <cpu 0.00, memory 0.00> elastic <cpu 0.00, memory 0.00>
I1228 20:17:24.110370       1 capacity.go:310] job spark-spark-sleep-2-pg inqueue false
I1228 20:17:24.110408       1 enqueue.go:104] Leaving Enqueue ...

queue myqueue capability <cpu 112000.00...> this is real capability, which can not exceed cluster's physical resource.

@caushie-akamai
Copy link
Author

caushie-akamai commented Dec 29, 2024

Cluster has auto-scaling enabled, so even though the cluster's physical resources are lower than what the job is asking, that should't that trigger a scale out? (given also that cluster resources * overcommit < minResources i guess)
In theory be that taken into consideration right?

If not, what should be the solution in cases like these where the cluster has autoscaling and scales down to 1 node in idle and can scale up to multiple nodes when required?

@Monokaix
Copy link
Member

Cluster has auto-scaling enabled, so even though the cluster's physical resources are lower than what the job is asking, that should't that trigger a scale out? (given also that cluster resources * overcommit < minResources i guess) In theory be that taken into consideration right?

If not, what should be the solution in cases like these where the cluster has autoscaling and scales down to 1 node in idle and can scale up to multiple nodes when required?

Thanks for using Volcano! Please feel free to comment on #3855 if you are convenient so that we can know each other better and gain more help.

  1. First of all, The autoscaler and scheduler need to work in coordination, meaning that the autoscaler needs to walk through a simulated scheduling to decide whether it needs to expand or not, so the scaling up and down is predicated on the assumption that they both believe there are not enough resources in cluster.
  2. Theenqueue action is enabled by default, which means pg is in Pending state and Pods under the vcjob weill not be created, and autoscaler actually cann't see the Pending Pod and will not trigger a scale out, you can check the whether the Pods are created.

@caushie-akamai
Copy link
Author

@Monokaix Thank you for your reply! We are still testing Volcano out. Our requirement is that Volcano must work with the auto-scheduler since our resources are elastic. Once we figure out if this PodGroup resource issue can be fixed I am definitely open to adding our usecase in #3855 :D

  1. The enqueue action is enabled by default, which means pg is in Pending state and Pods under the vcjob weill not be created, and autoscaler actually cann't see the Pending Pod and will not trigger a scale out, you can check the whether the Pods are created.

Yes i think this is what exactly is happening in my case even though I have enabled auto-scaling. In my case the queue capacity is larger than the cluster physical resources in idle, but Volcano seems to use the cluster current physical resources as seen in: queue myqueue capability <cpu 112000.00, memory 518674661376.00... so the PG is in pending state and the autoscaler can't see the pending pods to trigger a scale out. I can actually see the pod, but its in pending state:

$ kubectl get pods
spark-sleep-2-driver                                0/1     Pending     0          20s

$ kubectl describe pods spark-sleep-2-driver
.
.
.
Status:               Pending
.
.
Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedScheduling  55s   volcano  pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable

Do you have any idea how we could potentially fix this or whether this is a volcano bug or intended feature?

@hwdef
Copy link
Member

hwdef commented Dec 31, 2024

By the way:
enqueue is conflict with preempt. You can only keep one.

@caushie-akamai
Copy link
Author

I removed preempt and the issue still persists. Why would enqueue be in conflict with preempt?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants