-
Notifications
You must be signed in to change notification settings - Fork 437
Open
Description
Description
Problem
We are running an NVIDIA A100 GPU with MIG (Multi-Instance GPU) enabled. The GPU is partitioned as follows:
- 1x
4g.40gb - 1x
2g.20gb - 1x
1g.20gb
We are using the gpu-burn container image to fully stress the 4g.40gb MIG instance and expect to observe ~100% GPU utilization in DCGM metrics. However, although the pod runs successfully, GPU utilization does not reach 100% in DCGM exporter/monitoring metrics.
MIG Configuration
mig-enabled: true
mig-devices:
"4g.40gb": 1
"2g.20gb": 1
"1g.20gb": 1Pod Manifest
apiVersion: v1
kind: Pod
metadata:
name: gpu-stress-burn-single
namespace: mlops-development
spec:
restartPolicy: Never
containers:
- name: gpu
image: iankoulski/gpuburn
command: ["/app/gpu_burn"]
args:
- "-tc" # Tensor Core enabled
- "-d" # Double Precision enabled
- "14400" # 4 hours (in seconds)
resources:
limits:
nvidia.com/mig-4g.40gb: "1"
requests:
nvidia.com/mig-4g.40gb: "1"Expected Behavior
- While
gpu-burnis running:- DCGM metrics should show ~100% GPU utilization for the allocated MIG instance
Actual Behavior
- Pod starts and
gpu_burnruns successfully - However:
- GPU utilization in DCGM metrics appears low
- It never reaches ~100%
Questions / Suspicions
- Does DCGM exporter report utilization differently for MIG devices?
- Are DCGM metrics calculated at physical GPU level instead of per-MIG instance?
- Is a multi-process workload or additional flags required?
Additional Context
- GPU: NVIDIA A100
- MIG: Enabled
- Environment: Kubernetes with NVIDIA device plugin
- Monitoring: DCGM Exporter
Screenshots

Metadata
Metadata
Assignees
Labels
No labels