Skip to content

Latest commit

 

History

History
51 lines (33 loc) · 1.62 KB

README.md

File metadata and controls

51 lines (33 loc) · 1.62 KB

✨ Supervoice VALL-E 2

Feel free to join my Discord Server to discuss this model!

An independent VALL-E 2 reproduction for voice synthesis with voice cloning.

supervoice_valle.mp4

Features

  • ⚡️ Narural sounding and voice cloning on human level
  • 🎤 High quality - 24khz audio
  • 🤹‍♂️ Versatile - synthesiszed voice has high variability
  • 📕 Currently only English language is supported, but nothing stops us from adding more languages.

Tips and tricks

  • Network can follow voices, but they better to be in-domain and from librilight, libritts and from others similar sources

Architecture

Repdorduction tries to follow papers as close as possible, but some minor changes include

  • Linear annielation replaced with cosine one
  • Not implemented codec grouping
  • No padding masking used during training, since it would train 5 times slower using flash attention

valle-2 arcitecture

How to use

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model
model = torch.hub.load(repo_or_dir='ex3ndr/supervoice-vall-e-2', model='supervoice')
model = model.to(device)

# Synthesize
in_voice_1 = model.synthesize("voice_1", "What time is it, Steve?", top_p = 0.2).cpu()
in_voice_2 = model.synthesize("voice_2", "What time is it, Steve?", top_p = 0.2).cpu()

# Experimental voices
in_emo_1 = model.synthesize("emo_1", "What time is it, Steve?", top_p = 0.2).cpu()
in_emo_2 = model.synthesize("emo_2", "What time is it, Steve?", top_p = 0.2).cpu()

License

MIT