Skip to content

NeMo ASR Inference Fails on RTX 4000 Pro Blackwell Without Disabling CUDA Memory Caching #15338

@ShakibSanatgar

Description

@ShakibSanatgar

NeMo ASR Inference Fails on Blackwell GPU Without PYTORCH_NO_CUDA_MEMORY_CACHING=1

Describe the bug

NeMo Canary-1B-v2 ASR model inference fails on NVIDIA RTX 4000 Pro (Blackwell architecture, compute capability 12.0) unless PYTORCH_NO_CUDA_MEMORY_CACHING=1 is set. The same model works correctly on RTX 3080 (Ampere, compute capability 8.6) without this workaround.

Model loads successfully on both GPUs. Issue only manifests during model.transcribe().

Steps/Code to reproduce bug

import torch
from nemo.collections.asr.models import EncDecMultiTaskModel

# Load model
model = EncDecMultiTaskModel.from_pretrained(
    "nvidia/canary-1b-v2",
    map_location="cuda",
)
model.eval()

# Configure beam decoding
decode_cfg = model.cfg.multitask_decoding
decode_cfg.strategy = "beam"
decode_cfg.beam.beam_size = 5
model.change_decoding_strategy(decode_cfg)

# Transcribe - FAILS on Blackwell without PYTORCH_NO_CUDA_MEMORY_CACHING=1
with torch.inference_mode():
    outputs = model.transcribe(
        audio=["test_audio.wav"],
        batch_size=1,
        source_lang="de",
        target_lang="de",
    )
print(outputs[0].text)

Expected behavior

Model should transcribe audio successfully on Blackwell GPUs without requiring PYTORCH_NO_CUDA_MEMORY_CACHING=1.

Environment overview

  • Environment location: Docker
  • Method of NeMo install: NVIDIA Docker image
  • Docker pull: docker pull nvcr.io/nvidia/nemo:25.11.01
  • Docker run:
docker run --gpus '"device=1"' \
  --shm-size=16gb \
  --ulimit memlock=-1 \
  -e CUDA_VISIBLE_DEVICES=0 \
  nvcr.io/nvidia/nemo:25.11.01

Environment details

Using NVIDIA docker image nvcr.io/nvidia/nemo:25.11.01 (pre-installed PyTorch, CUDA, NeMo).

Additional context

GPU Architecture VRAM Compute Capability Status
RTX 3080 Ampere 10GB 8.6 Works
RTX 4000 Pro Blackwell 24GB 12.0 Fails without workaround

Workaround:

export PYTORCH_NO_CUDA_MEMORY_CACHING=1

This suggests a compatibility issue between PyTorch's CUDA memory caching allocator and Blackwell architecture GPUs.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions