Acoustic encoder -> Acoustic decoder -> Sementic encoder -> Decoder
Speech input -> Pre-net -> Transformer -> Label smoothing loss function
Pre-net is a module that has two 2D-convolutional layers followed by two stacked attention layers. Transformer is the main structure of the encoder, and here is an article to illustrate it. Label smoothing is the loss function for classification problems to prevent the model from predicting the training examples too confidently.
Outputs -> Embeddings -> Positional encoding -> Transformer -> Softmax -> Output probabilities
Positional encoding provides relative positional information for a word at the sentence such that words sharing similar meaning and position can be closer in d-dimensional space.
They have similar structure to acoustic encoder and decoder. The loss function of semantic encoder is sparse categorical cross entropy loss.
We can look at the model at two perspectives:
-
Acoustic encoder, acoustic decoder and semantic encoder can be regarded as an integrated encoder, it extracts information from source speech. Decoder transforms information to target language.
-
Acoustic encoder and acoustic decoder can be regarded as a transcriber that transcribes speech to text. Semantic encoder and decoder can be thought as a NMT model that translates the text to target language.
A pytorch version of speech transformer model: https://github.com/kaituoxu/Speech-Transformer
A bash tutoral to implement parallelization of batch jobs: https://jerkwin.github.io/2013/12/14/Bash%E8%84%9A%E6%9C%AC%E5%AE%9E%E7%8E%B0%E6%89%B9%E9%87%8F%E4%BD%9C%E4%B8%9A%E5%B9%B6%E8%A1%8C%E5%8C%96/
Coverting flac file to wav file: http://magento4newbies.blogspot.com/2014/11/converting-wav-files-to-flac-with.html
We reimplemented almost all the classes and functions by tensorflow at ./model/modules/ . The original source is at https://github.com/kaituoxu/Speech-Transformer/src/transformer
Specify path of speech corpus at ./config/hparams_transcriber.yaml Specify path of text corpus at ./config/hparams_translator.yaml You can also set other hyper-parameters at ./config/hparams_transcriber.yaml and ./config/hparams_translator.yaml
Then, you just need
python run.py
Or
python transcriber.py
python translator.py
To evaluate results, you need to build a directory named results at the root directory, store results at results/candidate.txt and ground truth at results/ref.txt. Then run
python evaluation.py
python=3.6.1
nltk==3.5
tensorflow==2.3.1
tensorflow-datasets==4.1.0
speechpy==2.4
-
Open Translation Project from Ted is a set of subtitles in many languages.
-
LibriSpeech is an English speech corpus.
Rafael Müller, Simon Kornblith, Geoffrey Hinton, "When Does Label Smoothing Help?" (arXiv:1906.02629 [cs.LG]).
Yuchen Liu, Junnan Zhu, Jiajun Zhang, and Chengqing Zong, "Bridging the Modality Gap for Speech-to-Text Translation" (arXiv:2010.14920 [cs.CL]).
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 2818-2826, doi: 10.1109/CVPR.2016.308.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (p./pp. 5998--6008), .
L. Dong, S. Xu and B. Xu, "Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 5884-5888, doi: 10.1109/ICASSP.2018.8462506.