-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.
Description
Description
When GPU and NPU (310P) cards exist at the same time in a cluster, the pod of NPU will be in the pending state.
Steps to reproduce the issue
- Prepare a cluster with GPU and NPU types of cards. As far as I am concerned, I have two nodes:
NodeA: T4 * 2
NodeB: Ascend310P * 2 - prepared two workload
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
annotations:
volcano.sh/vgpu-mode: "hami-core" # (Optional, 'hami-core' or 'mig')
spec:
schedulerName: volcano
containers:
- name: cuda-container
image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/pytorch:2.1.2-cuda12.1-cudnn8-runtime-ubuntu22.04
command: ["sleep"]
args: ["100000"]
resources:
limits:
volcano.sh/vgpu-number: 1
volcano.sh/vgpu-memory: 1000
volcano.sh/vgpu-cores: 10
---
apiVersion: v1
kind: Pod
metadata:
name: npu-pod-310p
spec:
schedulerName: volcano
containers:
- name: npu-container
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
command: ["sleep"]
args: ["100000"]
resources:
limits:
cpu: "1"
memory: 1000Mi
huawei.com/Ascend310P: "1"
huawei.com/Ascend310P-memory: "3072"
requests:
cpu: "1"
memory: 1000Mi
huawei.com/Ascend310P: "1"
huawei.com/Ascend310P-memory: "3072"- The pod of GPU can work normally, while the pod of NPU has been in the pending state.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 13m volcano pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
I can schedule normally after removing huawei.com/Ascend310P-memory. I'm not sure if it's the same problem as this 4778
Describe the results you received and expected
I hope the pod of NPU can be scheduled normally.
What version of Volcano are you using?
latest
Any other relevant information
No response
Metadata
Metadata
Assignees
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.