Inference Run Examples

Tutorial of running the examples

Table of content

Download the example trained models from AWS S3
Download LibriSpeech audio samples from openslr.org
Quickly run Streaming ASR Examples using Docker (super easy)
- Simple Streaming ASR Example using Docker
- Multithreaded Streaming ASR Example using Docker
Download and build wav2letter-inference
Simple Streaming ASR Example
Examine the transcription quality
Multithreaded Streaming ASR Example
Interactive Streaming ASR Example

Overview

The supplied examples can be used directly to quickly bootstrap a demo. There are two executables and two libraries. One executable is convenient for transcribing a single audio file or stream. The other is convenient for transcribing a large number of audio files.

simple_streaming_asr_example can by used to quickly transcribe a single audio file.
multithreaded_streaming_asr_example can by used to quickly transcribe many audio files.

The two libraries are useful for bootstrapping your own executables.

AudioToWords has convenient functions to transcribe an audio file or audio stream.
examples/Util.h has utilites to read input from any stream into the inference pipeline. It also has a TimeElapsedReporter a scooped performance measurement utility.

Download the example trained models from AWS S3

~$> mkdir model
~$> cd model
for f in acoustic_model.bin tds_streaming.arch decoder_options.json feature_extractor.bin language_model.bin lexicon.txt tokens.txt ; do wget http://dl.fbaipublicfiles.com/wav2letter/inference/examples/model/${f} ; done

~/model$>ls -sh
total 270M
254M acoustic_model.bin  
1.0K tds_streaming.arch	 
512 decoder_options.json   
512 feature_extractor.bin   
13M language_model.bin	
4.0M lexicon.txt   
82K tokens.txt

Download LibriSpeech audio samples from openslr.org

~$> mkdir audio
~$> cd audio
~/audio$> wget -qO- http://www.openslr.org/resources/12/dev-clean.tar.gz | tar xvz
~/audio$> find LibriSpeech/dev-clean -type f -name "*.flac" -exec sox {} {}".wav"  \;
~/audio$> find "$(pwd)"/LibriSpeech/dev-clean -type f -name "*.wav" > LibriSpeech-dev-clean-wav-all.lst

We should have 2703 audio files.

~/audio$> wc -l LibriSpeech-dev-clean-wav.lst
2703 LibriSpeech-dev-clean-wav.lst

Quickly run Streaming ASR Examples using Docker

Simple Streaming ASR Example using Docker

wav2letter inference Docker allows you to quickly run inference binaries without needing to install or build anything! You only need to have Docker installed and an internet connection for automatically downloading the image on the first run.

sudo docker run --rm -v ~:/root/host/ -it --ipc=host --name w2l -a stdin -a stdout -a stderr wav2letter/wav2letter:inference-latest sh -c "cat /root/host/audio/LibriSpeech/dev-clean/777/126732/777-126732-0070.flac.wav | /root/wav2letter/build/inference/inference/examples/simple_streaming_asr_example --input_files_base_path /root/host/model"

Multithreaded Streaming ASR Example using Docker

wav2letter inference Docker allows you to also quickly run the high performance multi-threaded inference binaries without needing to install or build anything! In this example we transcribe a single audio file into a transcription file.

# Create output directory
mkdir -p ~/audio/LibriSpeech-dev-clean-transcribed

# Transcribe a single file named 777-126732-0070.flac.wav
sudo docker run --rm -v ~:/root/host/ -it --ipc=host --name w2l -a stdin -a stdout -a stderr \
wav2letter/wav2letter:inference-latest sh -c \
"/root/wav2letter/build/inference/inference/examples/multithreaded_streaming_asr_example \
--input_files_base_path /root/host/model \
--input_audio_files /root/host/audio/LibriSpeech/dev-clean/777/126732/777-126732-0070.flac.wav \
--output_files_base_path /root/host/audio/LibriSpeech-dev-clean-transcribed"

Download and build wav2letter-inference

Download and build kenlm

~/$> git clone https://github.com/kpu/kenlm.git
~/$> cd kenlm
~/kenlm> mkdir build
~/kenlm/build$> cmake .. 
~/kenlm/build$> make -j $(nproc)

Export MKL path. On most Linux-based systems, it is installed at /opt/intel/mkl.

export MKLROOT=/opt/intel/mkl

Download and build wav2letter-inference

~/$> git clone https://github.com/facebookresearch/wav2letter.git
~/$> cd wav2letter
~/wav2letter$> mkdir build
~/wav2letter/build$> KENLM_ROOT_DIR=~/kenlm/build cmake .. -DW2L_BUILD_LIBRARIES_ONLY=ON -DW2L_BUILD_INFERENCE=ON -DW2L_LIBRARIES_USE_CUDA=OFF

Simple Streaming ASR Example

simple_streaming_asr_example can be used as a unix pipe to dump translation for a wav stream.

~/wav2letter/build$> make simple_streaming_asr_example -j $(nproc)
~/wav2letter/build$ cat ~/audio/LibriSpeech/dev-clean/777/126732/777-126732-0070.flac.wav | inference/inference/examples/simple_streaming_asr_example --input_files_base_path ~/model

Started features model file loading ...
Completed features model file loading elapsed time=46557 microseconds

Started acoustic model file loading ...
Completed acoustic model file loading elapsed time=2058 milliseconds

Started tokens file loading ...
Completed tokens file loading elapsed time=1318 microseconds

Tokens loaded - 9998 tokens
Started decoder options file loading ...
Completed decoder options file loading elapsed time=388 microseconds

Started create decoder ...
[Letters] 9998 tokens loaded.
[Words] 200001 words loaded.
Completed create decoder elapsed time=884 milliseconds

Started converting audio input from stdin to text... ...
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,he was out of his
2000,3000,mind with something
3000,4000,he overheard about eating
4000,5000,people's flesh
5000,6000,and drinking blood
6000,7000,what's the good of
7000,7315,of talking like that
Completed converting audio input from stdin to text... elapsed time=1302 milliseconds

Examine the transcription quality

We can exmin the transcription quality by inspecting the audio's file transcription. We find the transcription by the file number. For example for the file:~/audio/LibriSpeech/dev-clean/777/126732/777-126732-0070.flac.wav The file number is 0070

~/wav2letter/build$> grep 0070 ~/audio/LibriSpeech/dev-clean/777/126732/777-126732.trans.txt
777-126732-0070 HE WAS OUT OF HIS MIND WITH SOMETHING HE OVERHEARD ABOUT EATING PEOPLE'S FLESH AND DRINKING BLOOD WHAT'S THE GOOD OF TALKING LIKE THAT

It can also transcribe a single file by directly pointing to it:

~/wav2letter/build$> make simple_streaming_asr_example -j $(nproc)
~/wav2letter/build$> inference/inference/examples/simple_streaming_asr_example --input_files_base_path ~/model --input_audio_file ~/audio/LibriSpeech/dev-clean/777/126732/777-126732-0076.flac.wav
...
#start (msec), end(msec), transcription
0,1000,
1000,2000,i wish he
2000,3000,had never been to school
3000,4000,missus
4000,4260,began again brusquely
Completed converting audio input file=/home/audio/LibriSpeech/dev-clean/777/126732/777-126732-0076.flac.wav to text... elapsed time=914 milliseconds

Multithreaded Streaming ASR Example

multithreaded_streaming_asr_example can convert a large list of audio files using multiple threads.

For running this example quickly on a laptop we can start with a smaller number of files. Here we prepare an audio file list with 50 files.

~/wav2letter/build$>head -50 ~/audio/LibriSpeech-dev-clean-wav-all.lst > ~/audio/LibriSpeech-dev-clean-wav-50-files.lst

Now let's run wav2letter on the entire list.

~/wav2letter/build:> mkdir ~/audio/LibriSpeech-dev-clean-transcribed
~/wav2letter/build$> make multithreaded_streaming_asr_example -j $(nproc)

~/wav2letter/build$ inference/inference/examples/multithreaded_streaming_asr_example   --input_audio_file_of_paths ~/audio/LibriSpeech-dev-clean-wav-50-files.lst  --output_files_base_path ~/audio/LibriSpeech-dev-clean-transcribed  --input_files_base_path ~/model --max_num_threads $(nproc)

...
Will process 50 files.
Started features model file loading ...
Completed features model file loading elapsed time=44232 microseconds

Started acoustic model file loading ...
Completed acoustic model file loading elapsed time=2205 milliseconds

Started tokens file loading ...
Completed tokens file loading elapsed time=1521 microseconds

Tokens loaded - 9998 tokens
Started decoder options file loading ...
Completed decoder options file loading elapsed time=342 microseconds

Started create decoder ...
[Letters] 9998 tokens loaded.
[Words] 200001 words loaded.
Completed create decoder elapsed time=804 milliseconds

Started converting audio input files to text ...
Creating thread pool with 80 threads.
audioFileToWordsFile() processing audioFileToWordsFile() processing 1/50 input=2//home/audio/LibriSpeech/dev-clean/2277/149896/2277-149896-0005.flac.wav50 output=/home/audio/LibriSpeech-dev-clean-transcribed/2277-149896-0005.flac.wav.txt
audioFileToWordsFile() processing audioFileToWordsFile() processing 4/50 input=/home/audio/LibriSpeech/dev-clean/2277/149896/2277-149896-0006.flac.wav output=/home/audio/LibriSpeech-dev-clean-transcribed/2277-149896-0006.flac.wav.txt
audioFileToWordsFile() processing 50/50 input=/home/audio/LibriSpeech/dev-clean/2277/149897/2277-149897-0004.flac.wav output=/home/audio/LibriSpeech-dev-clean-transcribed/2277-149897-0004.flac.wav.txt
...
Completed converting audio input files to text elapsed time=18648 milliseconds
Completed create decoder elapsed time=908 milliseconds

Interactive Streaming ASR Example

interactive_streaming_asr_example can be used to quickly poke around and transcribe audio files on the fly. Interactive mode loads the modules once (loading the modules takes 10-40 seconds) and opens a simple command line shell. From that shell you may transcribe audio files chosen on the fly and redirect the output. This executable is also very convenient to use from Python using popen(). In the following example we make and run the interactive mode example and ask it to transcribe a single audio file.

~/wav2letter/build$> make interactive_streaming_asr_example -j $(nproc)
~/wav2letter/build$> inference/inference/examples/interactive_streaming_asr_example --input_files_base_path ~/model 
...
Entering interactive command line shell. enter '?' for help.
------------------------------------------------------------
$>input=/home/audio/LibriSpeech/dev-clean/777/126732/777-126732-0011.flac.wav
Transcribing file:/home/audio/LibriSpeech/dev-clean/777/126732/777-126732-0011.flac.wav to:stdout
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
0,1000,
1000,2000,the
2000,3000,possessor of property had not
3000,4000,only to face the awakened
4000,5000,proletariat
5000,6000,but they had also
6000,7000,to fight amongst
7000,7450,themselves
#finish transcribing

It is possible to redirect the output to a file using as in:

$>output=/tmp/txt

This will create a file or append an existing one.

How we can I use this from Python

One can simply use shell process of interactive_streaming_asr_example in Python to process sequence of audios:

import os
import signal
from subprocess import Popen, PIPE  


model_path = "path/to/model/dir"
w2l_bin = "path/to/interactive_streaming_asr_example"
w2l_process = Popen(['{} --input_files_base_path={}'.format(w2l_bin, model_path)],
                      stdin=PIPE, stdout=PIPE, stderr=PIPE,
                      shell=True)

paths_to_audio_file = ["path/to/audio/file", "path/to/audio/file2", "path/to/audio/file3", ...]
for path_to_audio_file in paths_to_audio_file:
    # write to the stdin of the process
    # make sure to flush and add \n to the string
    w2l_process.stdin.write("input={}\n".format(path_to_audio_file).encode())
    w2l_process.stdin.flush()
    
    while True:
        # read from process stdout
        output = w2l_process.stdout.readline()
        if output == b'#finish transcribing\n':
            # finish transcribing an audio
            break
        else:
            print(output)
# finish the process
os.killpg(os.getpgid(w2l_process.pid), signal.SIGTERM)

One can change saving output into file with and the decoding end token:

w2l_process.stdin.write(b"output=/tmp/out.txt\n")
w2l_process.stdin.write(b"endtoken=DONE\n")
w2l_process.write(b"input={}\n".format(path_to_audio_file))
w2l_process.stdin.flush()
# the output will be saved in `/tmp/out.txt` and the last line will be `DONE` instead of '#finish transcribing'.