-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word timestamps of an each individual word in the inference #987
Comments
Hey! Here https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/Decode.cpp#L641 you will have per frame token indices in the rawTokenPrediction, so you can do any postprocessing and print computed word timings here. The only thing to have in mind to convert to original time is the model stride. |
Hi @tlikhomanenko, Thank you very much for the response, could you please help me by telling how to convert the per frame token indices to word timings. Please provide an example if possible. I think my model has frame stride set to 10ms. Thanks |
Well, I can navigate only for Decode.cpp (not the online inference if you are referring to it). The qq I have before going further: what are values for flags
did you set? Also what is the model architecture (what is the stride happens inside model itself)? |
Hi @tlikhomanenko, Okay, thank you. I am referring to Decode.cpp. Please find the info below: Please find the model architecture below: Stride: I haven't changed anything, what ever is the default value that's there in the streaming convnets. More Info:
Thank you |
So you have from https://github.com/flashlight/flashlight/blob/master/flashlight/app/asr/Decode.cpp#L641
Now For example if tokens have ["_hel", "_hel", "_hel", "lo", "lo", "lo", "_world", "_world"] then you have "hello" from 0-480ms and "world" 480ms-640ms. |
Thank you @tlikhomanenko for the response, i appreciate it. I did test the decoder after making the code changes. The word timings I calculated based on the info that's there in the "rawTokenPrediction and tokenDict" data-structures and the timings of the words in the audio doesn't seems to be matching. Is the frame size 80ms correct ?, please correct me, if i am wrong and what does the "#" represent in the tokenDict entry ?. Here are the output details of the two audio files I tested with the decoder:
More info:
|
Well, "#" means CTC blank token. Also if I remember correctly (https://github.com/flashlight/flashlight/blob/master/flashlight/lib/text/decoder/LexiconDecoder.cpp#L257, https://github.com/flashlight/flashlight/blob/master/flashlight/lib/text/decoder/LexiconDecoder.cpp#L27), you need to remove first and last silence tokens as we add them artificially during decoding. Then seems duration in frames is similar to what you have in audio using 80ms per frame. Not sure if CTC is good criterion for learning very good alignment, better to use ASG and moreover do it with letter tokens rather than with word-pieces. The overall task is to predict correctly transcription not the alignment while still alignment is necessary to do ASR in a good way. |
Hi @tlikhomanenko, Thank you for your comments.
|
Yep, it looks correct to me. Again, total duration after removing first and last frame now looks correct. The problem with segmentation is what I said about the model itself and word-pieces. About config changes, please have a look at this model for example https://github.com/flashlight/wav2letter/tree/main/recipes/lexicon_free or recent with transformer https://github.com/flashlight/wav2letter/tree/main/recipes/slimIPL - they are trained with letters: you need to change tokens, lexicon, decrease stride in the model itself (because it is too large otherwise, it should be 2 or 3). You should not fork model because it only reset the optimizer but not the model itself. Also first I would check without decoder if viterbi path gives meaningful alignment, otherwise it is definitely the problem of word-piece usage. Also have a look at the tool here https://github.com/flashlight/flashlight/tree/master/flashlight/app/asr/tools/alignment to perform alignment without a language model. |
Hi @tlikhomanenko, I realized later that you might be referring to the total duration. Thank you for ur comments. Okay, I did actually explore the other Wav2letter recipes and figured it out that the lexicon_free, conv_glu and learnable frontend recipes uses ASG criterion. I also ran the decoder with the the lexicon_free recipe pre-trained models (am & lm) and files (tokens & lexicon and so on). Is the frame size used in the lexicon_free arch is 10ms? (Arch file: https://github.com/flashlight/wav2letter/blob/main/recipes/lexicon_free/librispeech/am.arch). The "framestridems" is set to 10 in the base AM model and I assume the stride is "1" ?, if so then it seems to be that the word timings reported are more accurate, when compared to the streaming convnets pre-trained recipe models/files. Few Questions I have, please address them:
I did run the AM alone (using Test binary) for the Streaming_Convnets recipe with my Models, same config shown in the previous threads. The timing is almost same as the Decoder results (not correct). Thank you for this, i used "align" executable in wav2letter 0.2v with my streaming convnets recipe models, same config as mentioned in the previous comments in this thread. It didn't help actually, the timing was off, I am not sure If I interpreted it correctly. Please see the screenshot below: |
Yep, correct, stride of the arch is 1 and data preprocessing is 10ms, so frame after network corresponds to 10ms audio.
So stride can be done in conv, pooling layers. So you can simply check these types of layers if they have striding.
Where do you see this parameter? I don't see it in lexfree train config.
Yes, and potentially this should work because I believe the arch itself is good and can be used for other token set and stride too. The main question is what and why are you doing? Do you need online alignment model (because streaming convnets are online in the sense of not using large future)? Otherwise you can retrain lexfree model with specaugment and use it or use recent rasr transformer model with ctc or retrain rasr transformer model with asg. There are a lot of options.
If you use the same type of architecture but only change its params, like stride, number of layers, etc. then it should work (cc @vineelpratap). About inference decoding - not sure, cc @xuqiantong, maybe you need to change the decoding with respect to asg (no blanks but repetition tokens). I would test if rasr transfromer model works good for alignment: if yes, then retrain streaming convnets with ctc as it was but with letters tokens and stride 2-3 instead. This will give online model which you can use + better alignment. |
Thank you for your comments.
|
Hi @tlikhomanenko, I changed the total stride to 3 from 7 in streaming convents recipe architecture and trained it from scratch on Librispeech data with Letter tokens. AM Model seems to be trained fine, but the word-timings reported by the AM/Decoder are still bad. Please find the modified architecture below, could you please let me know if the changes I made makes sense?. Note: I tried with the total stride as 4 and 2 as well, no luck. I also did experiments by removing 2nd/3rd/4th layer PD+CN+R+DO+LN+TDS layer blocks in corresponding training experiments. Original Arch File: Modified Arch File: |
Question:
Is there a way to accurately calculate or compute an individual word level timing of each word as it appeared after the start of the audio?.
Note:
I referred the following existing ticket - #809, but it looks like there is no solution in that ticket. Could you please help me pointing to the right resource that would help me finding the accurate word level timings.
Ticket I referred to: #809
Thanks
The text was updated successfully, but these errors were encountered: