NeMo ASR Inference Fails on RTX 4000 Pro Blackwell Without Disabling CUDA Memory Caching

# NeMo ASR Inference Fails on Blackwell GPU Without PYTORCH_NO_CUDA_MEMORY_CACHING=1

**Describe the bug**

NeMo Canary-1B-v2 ASR model inference fails on NVIDIA RTX 4000 Pro (Blackwell architecture, compute capability 12.0) unless `PYTORCH_NO_CUDA_MEMORY_CACHING=1` is set. The same model works correctly on RTX 3080 (Ampere, compute capability 8.6) without this workaround.

Model loads successfully on both GPUs. Issue only manifests during `model.transcribe()`.

**Steps/Code to reproduce bug**

```python
import torch
from nemo.collections.asr.models import EncDecMultiTaskModel

# Load model
model = EncDecMultiTaskModel.from_pretrained(
    "nvidia/canary-1b-v2",
    map_location="cuda",
)
model.eval()

# Configure beam decoding
decode_cfg = model.cfg.multitask_decoding
decode_cfg.strategy = "beam"
decode_cfg.beam.beam_size = 5
model.change_decoding_strategy(decode_cfg)

# Transcribe - FAILS on Blackwell without PYTORCH_NO_CUDA_MEMORY_CACHING=1
with torch.inference_mode():
    outputs = model.transcribe(
        audio=["test_audio.wav"],
        batch_size=1,
        source_lang="de",
        target_lang="de",
    )
print(outputs[0].text)
```

**Expected behavior**

Model should transcribe audio successfully on Blackwell GPUs without requiring `PYTORCH_NO_CUDA_MEMORY_CACHING=1`.

**Environment overview**

- Environment location: Docker
- Method of NeMo install: NVIDIA Docker image
- Docker pull: `docker pull nvcr.io/nvidia/nemo:25.11.01`
- Docker run:

```bash
docker run --gpus '"device=1"' \
  --shm-size=16gb \
  --ulimit memlock=-1 \
  -e CUDA_VISIBLE_DEVICES=0 \
  nvcr.io/nvidia/nemo:25.11.01
```

**Environment details**

Using NVIDIA docker image `nvcr.io/nvidia/nemo:25.11.01` (pre-installed PyTorch, CUDA, NeMo).

**Additional context**

| GPU          | Architecture | VRAM | Compute Capability | Status                   |
| ------------ | ------------ | ---- | ------------------ | ------------------------ |
| RTX 3080     | Ampere       | 10GB | 8.6                | Works                    |
| RTX 4000 Pro | Blackwell    | 24GB | 12.0               | Fails without workaround |

**Workaround:**

```bash
export PYTORCH_NO_CUDA_MEMORY_CACHING=1
```

This suggests a compatibility issue between PyTorch's CUDA memory caching allocator and Blackwell architecture GPUs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeMo ASR Inference Fails on RTX 4000 Pro Blackwell Without Disabling CUDA Memory Caching #15338

NeMo ASR Inference Fails on Blackwell GPU Without PYTORCH_NO_CUDA_MEMORY_CACHING=1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU	Architecture	VRAM	Compute Capability	Status
RTX 3080	Ampere	10GB	8.6	Works
RTX 4000 Pro	Blackwell	24GB	12.0	Fails without workaround

NeMo ASR Inference Fails on RTX 4000 Pro Blackwell Without Disabling CUDA Memory Caching #15338

Description

NeMo ASR Inference Fails on Blackwell GPU Without PYTORCH_NO_CUDA_MEMORY_CACHING=1

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions