Skip to content

GPU utilization does not reach 100% in DCGM metrics when running GPU Burn on A100 MIG (4g.40gb ; 2g-20gb) #2022

@dogukan1047

Description

@dogukan1047

Description

Problem

We are running an NVIDIA A100 GPU with MIG (Multi-Instance GPU) enabled. The GPU is partitioned as follows:

  • 1x 4g.40gb
  • 1x 2g.20gb
  • 1x 1g.20gb

We are using the gpu-burn container image to fully stress the 4g.40gb MIG instance and expect to observe ~100% GPU utilization in DCGM metrics. However, although the pod runs successfully, GPU utilization does not reach 100% in DCGM exporter/monitoring metrics.

MIG Configuration

mig-enabled: true
mig-devices:
  "4g.40gb": 1
  "2g.20gb": 1
  "1g.20gb": 1

Pod Manifest

apiVersion: v1
kind: Pod
metadata:
  name: gpu-stress-burn-single
  namespace: mlops-development
spec:
  restartPolicy: Never
  containers:
    - name: gpu
      image: iankoulski/gpuburn
      command: ["/app/gpu_burn"]
      args:
        - "-tc"        # Tensor Core enabled
        - "-d"         # Double Precision enabled
        - "14400"      # 4 hours (in seconds)
      resources:
        limits:
          nvidia.com/mig-4g.40gb: "1"
        requests:
          nvidia.com/mig-4g.40gb: "1"

Expected Behavior

  • While gpu-burn is running:
    • DCGM metrics should show ~100% GPU utilization for the allocated MIG instance

Actual Behavior

  • Pod starts and gpu_burn runs successfully
  • However:
    • GPU utilization in DCGM metrics appears low
    • It never reaches ~100%

Questions / Suspicions

  1. Does DCGM exporter report utilization differently for MIG devices?
  2. Are DCGM metrics calculated at physical GPU level instead of per-MIG instance?
  3. Is a multi-process workload or additional flags required?

Additional Context

  • GPU: NVIDIA A100
  • MIG: Enabled
  • Environment: Kubernetes with NVIDIA device plugin
Image
  • Monitoring: DCGM Exporter

Screenshots

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions