GPU utilization does not reach 100% in DCGM metrics when running GPU Burn on A100 MIG (4g.40gb ; 2g-20gb)


## Description

### Problem
We are running an NVIDIA A100 GPU with MIG (Multi-Instance GPU) enabled. The GPU is partitioned as follows:
- 1x `4g.40gb`
- 1x `2g.20gb`
- 1x `1g.20gb`

We are using the `gpu-burn` container image to fully stress the `4g.40gb` MIG instance and expect to observe ~100% GPU utilization in DCGM metrics. However, although the pod runs successfully, GPU utilization does not reach 100% in DCGM exporter/monitoring metrics.

### MIG Configuration
```yaml
mig-enabled: true
mig-devices:
  "4g.40gb": 1
  "2g.20gb": 1
  "1g.20gb": 1
```

### Pod Manifest
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-stress-burn-single
  namespace: mlops-development
spec:
  restartPolicy: Never
  containers:
    - name: gpu
      image: iankoulski/gpuburn
      command: ["/app/gpu_burn"]
      args:
        - "-tc"        # Tensor Core enabled
        - "-d"         # Double Precision enabled
        - "14400"      # 4 hours (in seconds)
      resources:
        limits:
          nvidia.com/mig-4g.40gb: "1"
        requests:
          nvidia.com/mig-4g.40gb: "1"
```

### Expected Behavior
- While `gpu-burn` is running:
  - DCGM metrics should show ~100% GPU utilization for the allocated MIG instance

### Actual Behavior
- Pod starts and `gpu_burn` runs successfully
- However:
  - GPU utilization in DCGM metrics appears low 
  - It never reaches ~100%

### Questions / Suspicions
1. Does DCGM exporter report utilization differently for MIG devices?
2. Are DCGM metrics calculated at physical GPU level instead of per-MIG instance?
3. Is a multi-process workload or additional flags required?

### Additional Context
- **GPU:** NVIDIA A100
- **MIG:** Enabled
- **Environment:** Kubernetes with NVIDIA device plugin
<img width="348" height="156" alt="Image" src="https://github.com/user-attachments/assets/45dae432-fc5c-4f29-b22d-081610fc6145" />

- **Monitoring:** DCGM Exporter

### Screenshots

<img width="2249" height="710" alt="Image" src="https://github.com/user-attachments/assets/8cd5e6ff-3db4-4671-8857-8cf30f1121c1" />





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU utilization does not reach 100% in DCGM metrics when running GPU Burn on A100 MIG (4g.40gb ; 2g-20gb) #2022

Description

Problem

MIG Configuration

Pod Manifest

Expected Behavior

Actual Behavior

Questions / Suspicions

Additional Context

Screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU utilization does not reach 100% in DCGM metrics when running GPU Burn on A100 MIG (4g.40gb ; 2g-20gb) #2022

Description

Description

Problem

MIG Configuration

Pod Manifest

Expected Behavior

Actual Behavior

Questions / Suspicions

Additional Context

Screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions