-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Describe the bug
When performing batched inference using EncDecMultiTaskModel (Canary), the transcribe method fails with an AssertionError if a list of strings is passed to the source_lang or target_lang arguments.
Even if the list of languages matches the length of the list of audio files, the PromptFormatter attempts to validate the entire list against a single prompt slot, leading to a modality mismatch. This forces users to write temporary manifest files to disk to achieve mixed-language batching, which is inefficient for online inference.
Steps/Code to reproduce bug
import torch
from nemo.collections.asr.models import EncDecMultiTaskModel
model = EncDecMultiTaskModel.from_pretrained("nvidia/canary-180m-flash")
audio_list = ["test_1.wav", "test_2.wav"]
# This triggers: AssertionError: slot='source_lang' received value=['en', 'fr']
# which does not match modality <class 'nemo.collections.common.prompts.formatter.Text'>
results = model.transcribe(
audio=audio_list,
source_lang=["en", "fr"],
target_lang=["en", "fr"],
pnc="yes"
)Expected behavior
The transcribe method should accept lists for source_lang and target_lang of the same length as the audio list. It should map each language code to the corresponding audio file in the batch, allowing for efficient mixed-language inference without manifest-file overhead.
Environment overview (please complete the following information)
- Environment location: Cloud (AWS)
- Method of NeMo install:
pip install "nemo_toolkit[asr]"
Environment details
- OS version: Ubuntu 22.04
- PyTorch version: 2.10 (User specified)
- Python version: 3.10
- CUDA version: 12.8