|
| 1 | +# MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for speech recognition |
| 2 | + |
| 3 | +<p align="center"> |
| 4 | + <a href="modelscope.md">ModelScope</a>  |  <a href="https://arxiv.org/abs/2212.00500">Paper </a>  |
| 5 | +</p> |
| 6 | + |
| 7 | +We propose a novel multi-modal multi-task encoder-decoder pre-training framework~(MMSpeech) for Mandarin automatic speech recognition~(ASR), which employs a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. |
| 8 | +Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods. |
| 9 | + |
| 10 | +<p align="center"> |
| 11 | + <br> |
| 12 | + <img src="examples/mmspeech.png" width="700" /> |
| 13 | + <br> |
| 14 | +<p> |
| 15 | +<br> |
| 16 | + |
| 17 | +## Datasets & Checkpoints |
| 18 | +| Model | Model Size | Unlabeled Speech | Unlabeled Text | labeled | Pre-Training | Fine-Tuning | |
| 19 | +|:---------------|:----------:|:--------------------------------------------------:|:---------------------------------------------:|:----------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:| |
| 20 | +| MMSpeech-Base1 | 210M | [AISHELL-2](https://www.aishelltech.com/aishell_2) | [M6-Corpus](https://arxiv.org/abs/2103.00823) | [AISHELL-1](http://www.openslr.org/33/) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base1_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base1_aishell1.pt) | |
| 21 | +| MMSpeech-Base2 | 210M | [WenetSpeech](https://wenet.org.cn/WenetSpeech/) | M6-Corpus | AISHELL-1 | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base2_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base2_aishell1.pt) | |
| 22 | +| MMSpeech-Large | 609M | WenetSpeech | M6-Corpus | AISHELL-1 | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_large_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_large_aishell1.pt) | |
| 23 | + |
| 24 | +## Results on AISHELL-1 |
| 25 | +- Compare MMSpeech-Base1 with the model of the same encoder size and amount of unlabeled speech data. |
| 26 | + |
| 27 | +| Model | dev (w/o LM) | dev (wit LM) | test (w/o LM) | test (with LM) | |
| 28 | +|:---------------------------------|:------------:|:------------:|:-------------:|:--------------:| |
| 29 | +| w/o pre-training | 6.4 | 5.2 | 6.8 | 5.7 | |
| 30 | +| Data2Vec | 3.8 | 3.7 | 4.1 | 3.9 | |
| 31 | +| MMSpeech-Base1 | 2.4 | 2.1 | 2.6 | 2.3 | |
| 32 | +| MMSpeech-Base1 (w/o Fine-Tuning) | 2.5 | 2.3 | 2.6 | 2.3 | |
| 33 | + |
| 34 | +- Compare MMSpeech-Base2 with the model of the same encoder size and amount of unlabeled speech data. |
| 35 | + |
| 36 | +| Model | dev (wit LM) | test (with LM) | |
| 37 | +|:-----------------|:------------:|:--------------:| |
| 38 | +| Wav2vec 2.0-Base | 4.2 | 4.7 | |
| 39 | +| HuBERT-Base | 4.1 | 4.3 | |
| 40 | +| MMSpeech-Base2 | 2.0 | 2.1 | |
| 41 | + |
| 42 | +- Compare MMSpeech-Large with the model of the same encoder size and amount of unlabeled speech data. |
| 43 | + |
| 44 | +| Model | dev (wit LM) | test (with LM) | |
| 45 | +|:------------------|:------------:|:--------------:| |
| 46 | +| Wav2vec 2.0-Large | 3.8 | 4.1 | |
| 47 | +| HuBERT-Large | 3.1 | 3.3 | |
| 48 | +| MMSpeech-Large | 1.6 | 1.9 | |
| 49 | + |
| 50 | + |
| 51 | +## Quick start |
| 52 | +### Data preparation |
| 53 | + |
| 54 | +Input files for all tasks include three columns: "speech_id, wav_path, text", delimited by a "\t". |
| 55 | +- "wav_path" denotes the path for the wav files. |
| 56 | +- "text" denotes raw text inputs. |
| 57 | +- "pseduo-codes" can be obtained by following the steps in [wav2seq](https://github.com/asappresearch/wav2seq). |
| 58 | + |
| 59 | +| Data | Task | speech_id_col | wav_path_col | text_col | |
| 60 | +|:----------------------|:--------:|:-------------:|:------------:|:------------:| |
| 61 | +| unlabeled speech data | S2C, MSP | speech_id | wav_path | pseduo-codes | |
| 62 | +| unlabeled text data | P2T | speech_id | un-used | text | |
| 63 | +| speech-text data | S2T | speech_id | wav_path | text | |
| 64 | + |
| 65 | +We also provide example config_yaml of input fbank features for your reference in [here](http://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/data/fbank_config.yaml). |
| 66 | + |
| 67 | +### training |
| 68 | +```commandline |
| 69 | +cd run_scripts/mmspeech |
| 70 | +sh mmspeech_cn_base_stage1.sh |
| 71 | +sh mmspeech_cn_base_stage2.sh |
| 72 | +sh mmspeech_cn_base_stage3.sh |
| 73 | +``` |
| 74 | +### evaluation |
| 75 | +```commandline |
| 76 | +cd run_scripts/mmspeech |
| 77 | +sh evaluate_mmspeech_base.sh |
| 78 | +``` |
0 commit comments