The goal of this repository is to make it easy to experiment with audiodatasets and diffusion. This is powered by hydra, scorebased generative models (Song et al.) and Grad-TTS (Popov et al.).
At the moment it's currently a revamp of Grad-TTS
Python 3.9.17
pip install cython
pip install -r requirements.txt
cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..
Create a filelist in the form f'{audio_path}|{transcription}|{speaker_id}'
where speaker_id
is an integer. Edit config/data/data.yaml
to suit your dataset.
Run
python train_multi_speaker.py --config-name=config +data=data
Edit config as desired, or make use of hydra's multirun utility.
First generate predictions for your dataset
python generate_tts_preds.py --config-name=config +data=delete_this +eval=eval
Then calculate log-f0 rmse
python evaluate_tts.py --config-name=config +data=data +eval=eval
Coming soon
@Misc{Cross2023SpeechDiff,
author = {Mattias Cross},
title = {Speech diff, a framework for diffusion applied to speech},
howpublished = {Github},
year = {2023},
url = {https://github.com/Mattias421/speech-diff}
}