-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
NeMo ASR Inference Fails on Blackwell GPU Without PYTORCH_NO_CUDA_MEMORY_CACHING=1
Describe the bug
NeMo Canary-1B-v2 ASR model inference fails on NVIDIA RTX 4000 Pro (Blackwell architecture, compute capability 12.0) unless PYTORCH_NO_CUDA_MEMORY_CACHING=1 is set. The same model works correctly on RTX 3080 (Ampere, compute capability 8.6) without this workaround.
Model loads successfully on both GPUs. Issue only manifests during model.transcribe().
Steps/Code to reproduce bug
import torch
from nemo.collections.asr.models import EncDecMultiTaskModel
# Load model
model = EncDecMultiTaskModel.from_pretrained(
"nvidia/canary-1b-v2",
map_location="cuda",
)
model.eval()
# Configure beam decoding
decode_cfg = model.cfg.multitask_decoding
decode_cfg.strategy = "beam"
decode_cfg.beam.beam_size = 5
model.change_decoding_strategy(decode_cfg)
# Transcribe - FAILS on Blackwell without PYTORCH_NO_CUDA_MEMORY_CACHING=1
with torch.inference_mode():
outputs = model.transcribe(
audio=["test_audio.wav"],
batch_size=1,
source_lang="de",
target_lang="de",
)
print(outputs[0].text)Expected behavior
Model should transcribe audio successfully on Blackwell GPUs without requiring PYTORCH_NO_CUDA_MEMORY_CACHING=1.
Environment overview
- Environment location: Docker
- Method of NeMo install: NVIDIA Docker image
- Docker pull:
docker pull nvcr.io/nvidia/nemo:25.11.01 - Docker run:
docker run --gpus '"device=1"' \
--shm-size=16gb \
--ulimit memlock=-1 \
-e CUDA_VISIBLE_DEVICES=0 \
nvcr.io/nvidia/nemo:25.11.01Environment details
Using NVIDIA docker image nvcr.io/nvidia/nemo:25.11.01 (pre-installed PyTorch, CUDA, NeMo).
Additional context
| GPU | Architecture | VRAM | Compute Capability | Status |
|---|---|---|---|---|
| RTX 3080 | Ampere | 10GB | 8.6 | Works |
| RTX 4000 Pro | Blackwell | 24GB | 12.0 | Fails without workaround |
Workaround:
export PYTORCH_NO_CUDA_MEMORY_CACHING=1This suggests a compatibility issue between PyTorch's CUDA memory caching allocator and Blackwell architecture GPUs.