"mozilla-deepspeech-0.8.2" is a speech recognition neural network pre-trained by Mozilla based on DeepSpeech architecture (CTC decoder with beam search and n-gram language model) with changed neural network topology.
For details on the original DeepSpeech, see paper https://arxiv.org/abs/1412.5567.
For details on this model, see https://github.com/mozilla/DeepSpeech/releases/tag/v0.8.2.
Metric | Value |
---|---|
Type | Speech recognition |
GFlops per audio frame | 0.0472 |
GFlops per second of audio | 2.36 |
MParams | 47.2 |
Source framework | TensorFlow* |
Metric | Value | Parameters |
---|---|---|
WER @ LibriSpeech test-clean | 8.39% | with LM, beam_width = 32, Python CTC decoder, accuracy checker |
WER @ LibriSpeech test-clean | 6.13% | with LM, beam_width = 500, C++ CTC decoder, accuracy checker |
WER @ LibriSpeech test-clean | 6.15% | with LM, beam_width = 500, C++ CTC decoder, demo |
NB: beam_width=32 is a low value for a CTC decoder, and was used to achieve reasonable evaluation time with Python CTC decoder in Accuracy Checker. Increasing beam_width improves WER metric and slows down decoding. Speech recognition demo has a faster C++ CTC decoder module.
-
Audio MFCC coefficients, name:
input_node
, shape: [1x16x19x26], format: [BxNxTxC], where:- B - batch size, fixed to 1
- N -
input_lengths
(see below) - number of audio frames in this section of audio - T - context frames: along with the current frame, the network expects 9 preceding frames and 9 succeeding frames. The absent context frames are filled with zeros.
- C - 26 MFCC coefficients per each frame
See
accuracy-check.yml
for all audio preprocessing and feature extraction parameters. -
Number of audio frames, INT32 value, name:
input_lengths
, shape [1]. -
LSTM in-state (c) vector, name:
previous_state_c
, shape: [1x2048], format: [BxC]. -
LSTM input (h, a.k.a hidden state) vector, name:
previous_state_h
, shape: [1x2048], format: [BxC].
When splitting a long audio into chunks, these two last inputs must be fed with the corresponding outputs from the previous chunk. Chunk processing order must be from early to late audio positions.
-
Audio MFCC coefficients, name:
input_node
, shape: [1x16x19x26], format: [BxNxTxC], where:- B - batch size, fixed to 1
- N - number of audio frames in this section of audio, fixed to 16
- T - context frames: along with the current frame, the network expects 9 preceding frames and 9 succeeding frames. The absent context frames are filled with zeros.
- C - 26 MFCC coefficients in each frame
See
accuracy-check.yml
for all audio preprocessing and feature extraction parameters. -
LSTM in-state vector, name:
previous_state_c
, shape: [1x2048], format: [BxC]. -
LSTM input vector, name:
previous_state_h
, shape: [1x2048], format: [BxC].
When splitting a long audio into chunks, these two last inputs must be fed with the corresponding outputs from the previous chunk. Chunk processing order must be from early to late audio positions.
-
Per-frame probabilities (after softmax) for every symbol in the alphabet, name:
logits
, shape: [16x1x29], format: [NxBxC]- N - number of audio frames in this section of audio
- B - batch size, fixed to 1
- C - alphabet size, including the CTC blank symbol
The per-frame probabilities are to be decoded with a CTC decoder. The alphabet is: 0 = space, 1...26 = "a" to "z", 27 = apostrophe, 28 = CTC blank symbol.
NB:
logits
is probabilities after softmax, despite its name. -
LSTM out-state vector, name:
new_state_c
, shape: [1x2048], format: [BxC]. See Inputs. -
LSTM output vector, name:
new_state_h
, shape: [1x2048], format: [BxC]. See Inputs.
-
Per-frame probabilities (after softmax) for every symbol in the alphabet, name:
logits
, shape: [16x1x29], format: [NxBxC]- N - number of audio frames in this section of audio, fixed to 16
- B - batch size, fixed to 1
- C - alphabet size, including the CTC blank symbol
The per-frame probabilities are to be decoded with a CTC decoder. The alphabet is: 0 = space, 1...26 = "a" to "z", 27 = apostrophe, 28 = CTC blank symbol.
NB:
logits
is probabilities after softmax, despite its name. -
LSTM out-state vector, name:
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BlockLSTM/TensorIterator.2
(fornew_state_c
), shape: [1x2048], format: [BxC]. See Inputs. -
LSTM output vector, name:
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BlockLSTM/TensorIterator.1
(fornew_state_h
), shape: [1x2048], format: [BxC]. See Inputs.
The original model is distributed under the Mozilla Public License, Version 2.0. A copy of the license is provided in MPL-2.0-Mozilla-Deepspeech.txt.