Skip to content

A framework for using diffusion models with speech technology applications e.g. TTS, ASR

Notifications You must be signed in to change notification settings

Mattias421/speechdiff

Repository files navigation

Speechdiff, a framework for diffusion applied to speech

The goal of this repository is to make it easy to experiment with audiodatasets and diffusion. This is powered by hydra, scorebased generative models (Song et al.) and Grad-TTS (Popov et al.).

At the moment it's currently a revamp of Grad-TTS

Installation

Python 3.9.17

pip install cython
pip install -r requirements.txt
cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..

Train a Grad-TTS model

Create a filelist in the form f'{audio_path}|{transcription}|{speaker_id}' where speaker_id is an integer. Edit config/data/data.yaml to suit your dataset.

Run

python train_multi_speaker.py --config-name=config +data=data

Edit config as desired, or make use of hydra's multirun utility.

Evaluate a TTS model

First generate predictions for your dataset

python generate_tts_preds.py --config-name=config +data=delete_this +eval=eval

Then calculate log-f0 rmse

python evaluate_tts.py --config-name=config +data=data +eval=eval

Compute log-likelihoods

Coming soon

Citation

@Misc{Cross2023SpeechDiff,
  author =       {Mattias Cross},
  title =        {Speech diff, a framework for diffusion applied to speech},
  howpublished = {Github},
  year =         {2023},
  url =          {https://github.com/Mattias421/speech-diff}
}

About

A framework for using diffusion models with speech technology applications e.g. TTS, ASR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published