Skip to content

Commit 4cd2a78

Browse files
author
实一
committed
mmspeech update
1 parent 24d2b8f commit 4cd2a78

26 files changed

+56398
-7
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ We support the inference of OFA in Huggingface Transformers. Check the [README](
4646

4747

4848
# News
49+
* 2022.12.7: Released the MMSpeech an ASR pre-training method based on OFA. Check our paper [here](https://arxiv.org/abs/2212.00500)! Please see the [README_mmspeech.md](README_mmspeech.md) for further details.
4950
* 2022.8.16: Released the **Chinese** version of OFA. **OFA-CN** needs only switching to `bpe_dir=../../utils/BERT_CN_dict` and `bpe=bert` and using our provided Chinese checkpoints in [checkpoints_cn.md](checkpoints_cn.md). Temporarily, we only provide base-size and large-size pretrained checkpoints and finetuned checkpoints on [MUGE Caption](https://tianchi.aliyun.com/muge) and the Chinese version of RefCOCO(-/+/g) (to release soon).
5051
* 2022.8.5: Released support of **prompt tuning** for OFA. Check our paper [here](https://arxiv.org/abs/2208.02532)! Please see the [prompt_tuning.md](prompt_tuning.md) for further details.
5152
* 2022.7.7: Updated support of OFA on **huggingface transformers** (fixed bugs in forward, add sequence generator from Fairseq to ensure performance, etc.). Refer to the doc [transformers.md](transformers.md) and the branch `feature/add_transformers`.

README_mmspeech.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for speech recognition
2+
3+
<p align="center">
4+
<a href="modelscope.md">ModelScope</a>&nbsp | &nbsp<a href="https://arxiv.org/abs/2212.00500">Paper </a>&nbsp
5+
</p>
6+
7+
We propose a novel multi-modal multi-task encoder-decoder pre-training framework~(MMSpeech) for Mandarin automatic speech recognition~(ASR), which employs a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
8+
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
9+
10+
<p align="center">
11+
<br>
12+
<img src="examples/mmspeech.png" width="700" />
13+
<br>
14+
<p>
15+
<br>
16+
17+
## Datasets & Checkpoints
18+
| Model | Model Size | Unlabeled Speech | Unlabeled Text | labeled | Pre-Training | Fine-Tuning |
19+
|:---------------|:----------:|:--------------------------------------------------:|:---------------------------------------------:|:----------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|
20+
| MMSpeech-Base1 | 210M | [AISHELL-2](https://www.aishelltech.com/aishell_2) | [M6-Corpus](https://arxiv.org/abs/2103.00823) | [AISHELL-1](http://www.openslr.org/33/) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base1_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base1_aishell1.pt) |
21+
| MMSpeech-Base2 | 210M | [WenetSpeech](https://wenet.org.cn/WenetSpeech/) | M6-Corpus | AISHELL-1 | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base2_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base2_aishell1.pt) |
22+
| MMSpeech-Large | 609M | WenetSpeech | M6-Corpus | AISHELL-1 | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_large_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_large_aishell1.pt) |
23+
24+
## Results on AISHELL-1
25+
- Compare MMSpeech-Base1 with the model of the same encoder size and amount of unlabeled speech data.
26+
27+
| Model | dev (w/o LM) | dev (wit LM) | test (w/o LM) | test (with LM) |
28+
|:---------------------------------|:------------:|:------------:|:-------------:|:--------------:|
29+
| w/o pre-training | 6.4 | 5.2 | 6.8 | 5.7 |
30+
| Data2Vec | 3.8 | 3.7 | 4.1 | 3.9 |
31+
| MMSpeech-Base1 | 2.4 | 2.1 | 2.6 | 2.3 |
32+
| MMSpeech-Base1 (w/o Fine-Tuning) | 2.5 | 2.3 | 2.6 | 2.3 |
33+
34+
- Compare MMSpeech-Base2 with the model of the same encoder size and amount of unlabeled speech data.
35+
36+
| Model | dev (wit LM) | test (with LM) |
37+
|:-----------------|:------------:|:--------------:|
38+
| Wav2vec 2.0-Base | 4.2 | 4.7 |
39+
| HuBERT-Base | 4.1 | 4.3 |
40+
| MMSpeech-Base2 | 2.0 | 2.1 |
41+
42+
- Compare MMSpeech-Large with the model of the same encoder size and amount of unlabeled speech data.
43+
44+
| Model | dev (wit LM) | test (with LM) |
45+
|:------------------|:------------:|:--------------:|
46+
| Wav2vec 2.0-Large | 3.8 | 4.1 |
47+
| HuBERT-Large | 3.1 | 3.3 |
48+
| MMSpeech-Large | 1.6 | 1.9 |
49+
50+
51+
## Quick start
52+
### Data preparation
53+
54+
Input files for all tasks include three columns: "speech_id, wav_path, text", delimited by a "\t".
55+
- "wav_path" denotes the path for the wav files.
56+
- "text" denotes raw text inputs.
57+
- "pseduo-codes" can be obtained by following the steps in [wav2seq](https://github.com/asappresearch/wav2seq).
58+
59+
| Data | Task | speech_id_col | wav_path_col | text_col |
60+
|:----------------------|:--------:|:-------------:|:------------:|:------------:|
61+
| unlabeled speech data | S2C, MSP | speech_id | wav_path | pseduo-codes |
62+
| unlabeled text data | P2T | speech_id | un-used | text |
63+
| speech-text data | S2T | speech_id | wav_path | text |
64+
65+
We also provide example config_yaml of input fbank features for your reference in [here](http://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/data/fbank_config.yaml).
66+
67+
### training
68+
```commandline
69+
cd run_scripts/mmspeech
70+
sh mmspeech_cn_base_stage1.sh
71+
sh mmspeech_cn_base_stage2.sh
72+
sh mmspeech_cn_base_stage3.sh
73+
```
74+
### evaluation
75+
```commandline
76+
cd run_scripts/mmspeech
77+
sh evaluate_mmspeech_base.sh
78+
```

criterions/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22
from .label_smoothed_cross_entropy import AdjustLabelSmoothedCrossEntropyCriterion
33
from .clip_scst_loss import ClipScstRewardCriterion
44
from .label_smoothed_encouraging_loss import AdjustLabelSmoothedEncouragingLossCriterion
5+
from .speech_pretrain_loss import SpeechPretrainLoss

0 commit comments

Comments
 (0)