-
I'm testing out canary-1b using the instructions on Hugging Face. I'm trying to transcribe an hour long podcast. My workflow is:
Here is the snippet for 3: from nemo.collections.asr.models import EncDecMultiTaskModel
# load model
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
# update dcode params
decode_cfg = canary_model.cfg.decoding
decode_cfg.beam.beam_size = 1
canary_model.change_decoding_strategy(decode_cfg)
predicted_text = canary_model.transcribe(
paths2audio_files=input_files,
batch_size=1,
) I notice that with higher X (like, e.g., 180 seconds) the predicted text will start fine and then degenerate into repeated words like "yeah yeah yeah yeah" etc. As I lower X that problem happens less often. For this issue not to occur at all in any of my chunks it seems I need to lower all the way to 30 seconds. Am I doing something wrong here? Is there some kind of repetition penalty in sampling I can use? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
If you're performing long audio inference, it would be better to use the buffered inference scripts rather than depending on the model for long form inference (even if it's capable). |
Beta Was this translation helpful? Give feedback.
Canary is an AED model, like Whisper. It is not trained with an alignment loss (CTC, RNNT, TDT) but next token prediction. So the decoder model, which has never seen text longer than 30-40 seconds, loses attention tracking once it goes to 1-2 minutes.