Skip to content
This repository has been archived by the owner on Feb 15, 2021. It is now read-only.

Latest commit

 

History

History
 
 

rnnt

1. Problem

Speech recognition accepts raw audio samples and produces a corresponding character transcription, without an external language model.

2. Directions

Open run.sh. Set the stage variable to "-1". Set "work_dir" to a path backed by a disk with at least 30 GB of space. Most space is used by loadgen logs, not the data or model. You need conda and a C/C++ compiler on your PATH. I used conda 4.8.2. This script is responsible for downloading dependencies, data, and the model.

Run ./run.sh from this directory. Note that stage 3 runs all of the scenarios for the reference implementation, which will take a long time, so you may want to exist before then.

As you complete individual stages, you can set the variable "stage" to a higher number for restarting from a later stage.

3. Dataset/Environment

Publication/Attribution

"OpenSLR LibriSpeech Corpus" provides over 1000 hours of speech data in the form of raw audio. We use dev-clean, which is approximately 5 hours. We remove all samples with a length exceeding 15 seconds.

Data preprocessing

Log filterbanks of size 80 are extracted every 10 milliseconds, from windows of size 20 milliseconds. Note that every three filterbanks are concatenated together ("feature splicing"), so the model's effective frame rate is actually 30 milliseconds.

No dithering takes place.

This is not typical preprocessing, since it takes place as part of the model's measured runtime, not before the model runs.

Test data order

Look at dev-clean-wav.json generated by run.sh. It looks like this:

[
  {
    "files": [
      {
        "channels": 1,
        "sample_rate": 16000.0,
        "bitrate": 16,
        "duration": 6.59,
        "num_samples": 105440,
        "encoding": "Signed Integer PCM",
        "silent": false,
        "fname": "dev-clean-wav/2277/149896/2277-149896-0000.wav",
        "speed": 1
      }
    ],
    "original_duration": 6.59,
    "original_num_samples": 105440,
    "transcript": "he was in a fevered state of mind owing to the blight his wife's action threatened to cast upon his entire future"
  },
  {
    "files": [
      {
        "channels": 1,
        "sample_rate": 16000.0,
        "bitrate": 16,
        "duration": 7.145,
        "num_samples": 114320,
        "encoding": "Signed Integer PCM",
        "silent": false,
        "fname": "dev-clean-wav/2277/149896/2277-149896-0001.wav",
        "speed": 1
      }
    ],
    "original_duration": 7.145,
    "original_num_samples": 114320,
    "transcript": "he would have to pay her the money which she would now regularly demand or there would be trouble it did not matter what he did"
  },
  ...
]

The data is loaded into memory. Then all samples with a duration above 15 seconds are filtered out. Then the first object in the array is assigned query id 0, the second is assigned query id 1, etc. The unfiltered file is uploaded to the directory containing README in case you do not want to recreate this file.

4. Model

This is a variant of the model described in sections 3.1 and 6.2 of:

@article{, title={STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES}, author={Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-yiin Chang, Kanishka Rao, Alexander Gruenstein}, journal={arXiv preprint arXiv:1811.06621}, year={2018} }

The differences are as follows:

  1. The model has 45.3 million parameters, rather than 120 million parameters
  2. The LSTMs are not followed by projection layers
  3. No layer normalization is used
  4. Hidden dimensions are smaller.
  5. The prediction network is made of two LSTMs, rather than seven.
  6. The labels are characters, rather than word pieces.
  7. No quantization is done at this time for inference.
  8. A greedy decoder is used, rather than a beamsearch decoder. This greatly reduces inference complexity.

5. Quality

Quality metric

7.452253714852645% Word Error Rate (WER) across all words in the output text of all samples less than 15 seconds in length in the dev-clean set, using a greedy decoder and a fully FP32 model.