Skip to content

Mixed-language inference with EncDecMultiTaskModel (Canary) #15349

@yonas-g

Description

@yonas-g

Describe the bug

When performing batched inference using EncDecMultiTaskModel (Canary), the transcribe method fails with an AssertionError if a list of strings is passed to the source_lang or target_lang arguments.

Even if the list of languages matches the length of the list of audio files, the PromptFormatter attempts to validate the entire list against a single prompt slot, leading to a modality mismatch. This forces users to write temporary manifest files to disk to achieve mixed-language batching, which is inefficient for online inference.

Steps/Code to reproduce bug

import torch
from nemo.collections.asr.models import EncDecMultiTaskModel

model = EncDecMultiTaskModel.from_pretrained("nvidia/canary-180m-flash")

audio_list = ["test_1.wav", "test_2.wav"]

# This triggers: AssertionError: slot='source_lang' received value=['en', 'fr'] 
# which does not match modality <class 'nemo.collections.common.prompts.formatter.Text'>
results = model.transcribe(
    audio=audio_list,
    source_lang=["en", "fr"], 
    target_lang=["en", "fr"],
    pnc="yes"
)

Expected behavior

The transcribe method should accept lists for source_lang and target_lang of the same length as the audio list. It should map each language code to the corresponding audio file in the batch, allowing for efficient mixed-language inference without manifest-file overhead.

Environment overview (please complete the following information)

  • Environment location: Cloud (AWS)
  • Method of NeMo install: pip install "nemo_toolkit[asr]"

Environment details

  • OS version: Ubuntu 22.04
  • PyTorch version: 2.10 (User specified)
  • Python version: 3.10
  • CUDA version: 12.8

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions