Repetition with canary-1b? #8776

deklanw · 2024-04-01T15:18:31Z

deklanw
Apr 1, 2024

I'm testing out canary-1b using the instructions on Hugging Face. I'm trying to transcribe an hour long podcast. My workflow is:

Split my file into chunks of X seconds
Convert each chunk into mono wav @ 16kHz
Transcribe each with canary-1b

Here is the snippet for 3:

from nemo.collections.asr.models import EncDecMultiTaskModel

# load model
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')

# update dcode params
decode_cfg = canary_model.cfg.decoding
decode_cfg.beam.beam_size = 1
canary_model.change_decoding_strategy(decode_cfg)

predicted_text = canary_model.transcribe(
    paths2audio_files=input_files,
    batch_size=1,
)

I notice that with higher X (like, e.g., 180 seconds) the predicted text will start fine and then degenerate into repeated words like "yeah yeah yeah yeah" etc. As I lower X that problem happens less often. For this issue not to occur at all in any of my chunks it seems I need to lower all the way to 30 seconds.

Am I doing something wrong here? Is there some kind of repetition penalty in sampling I can use?

Answered by titu1994

Apr 1, 2024

Canary is an AED model, like Whisper. It is not trained with an alignment loss (CTC, RNNT, TDT) but next token prediction. So the decoder model, which has never seen text longer than 30-40 seconds, loses attention tracking once it goes to 1-2 minutes.

View full answer

titu1994 · 2024-04-01T17:35:21Z

titu1994
Apr 1, 2024
Maintainer

If you're performing long audio inference, it would be better to use the buffered inference scripts rather than depending on the model for long form inference (even if it's capable).

https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_multitask/speech_to_text_aed_chunked_infer.py

5 replies

deklanw Apr 1, 2024
Author

Thanks, I'll look into this.

But, my original questions remain. In particular, I see degeneration even at 120s. Is 2 minutes considered long audio? What is causing the degeneration?

titu1994 Apr 1, 2024
Maintainer

Canary is an AED model, like Whisper. It is not trained with an alignment loss (CTC, RNNT, TDT) but next token prediction. So the decoder model, which has never seen text longer than 30-40 seconds, loses attention tracking once it goes to 1-2 minutes.

Answer selected by deklanw

deklanw Apr 2, 2024
Author

Thanks, that makes sense. A note somewhere to that effect might be helpful for others (maybe I missed it if such a note exists somewhere already)

sukeyxu Apr 15, 2024

Hey @deklanw , I got the same problem and looked into the discussion board in hugging face. it seems like there's official code for long audios: https://huggingface.co/nvidia/canary-1b/discussions/21

kunibald413 Aug 23, 2024

at time of writing chunked infer script was moved, and is here examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py

NeMo/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py

Line 32 in 6d1be93

python speech_to_text_aed_chunked_infer.py \

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repetition with canary-1b? #8776

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Repetition with canary-1b? #8776

deklanw Apr 1, 2024

Replies: 1 comment · 5 replies

titu1994 Apr 1, 2024 Maintainer

deklanw Apr 1, 2024 Author

titu1994 Apr 1, 2024 Maintainer

deklanw Apr 2, 2024 Author

sukeyxu Apr 15, 2024

kunibald413 Aug 23, 2024

deklanw
Apr 1, 2024

Replies: 1 comment 5 replies

titu1994
Apr 1, 2024
Maintainer

deklanw Apr 1, 2024
Author

titu1994 Apr 1, 2024
Maintainer

deklanw Apr 2, 2024
Author