forked from babysor/MockingBird
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit e46cd60
Showing
75 changed files
with
6,984 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
*.ipynb linguist-vendored |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
*.pyc | ||
*.aux | ||
*.log | ||
*.out | ||
*.synctex.gz | ||
*.suo | ||
*__pycache__ | ||
*.idea | ||
*.ipynb_checkpoints | ||
*.pickle | ||
*.npy | ||
*.blg | ||
*.bbl | ||
*.bcf | ||
*.toc | ||
*.wav | ||
*.sh | ||
encoder/saved_models/* | ||
synthesizer/saved_models/* | ||
vocoder/saved_models/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
MIT License | ||
|
||
Modified & original work Copyright (c) 2019 Corentin Jemine (https://github.com/CorentinJ) | ||
Original work Copyright (c) 2018 Rayhane Mama (https://github.com/Rayhane-mamah) | ||
Original work Copyright (c) 2019 fatchord (https://github.com/fatchord) | ||
Original work Copyright (c) 2015 braindead (https://github.com/braindead) | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
## 实时语音克隆 - 中文/普通话 | ||
![WechatIMG2968](https://user-images.githubusercontent.com/7423248/128490653-f55fefa8-f944-4617-96b8-5cc94f14f8f6.png) | ||
|
||
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/) | ||
> 该库是从仅支持英语的[Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) 分叉出来的。 | ||
### [English](README.md) | 中文 | ||
|
||
## 特性 | ||
🌍 **中文** 支持普通话并使用数据集进行测试:adatatang_200zh | ||
|
||
🤩 **PyTorch** 适用于 pytorch,已在 1.9.0 版本(最新于 2021 年 8 月)中测试,GPU Tesla T4 和 GTX 2060 | ||
|
||
🌍 **Windows + Linux** 在修复 nits 后在 Windows 操作系统和 linux 操作系统中进行测试 | ||
|
||
🤩 **Easy & Awesome** 仅使用新训练的合成器(synthesizer)就有良好效果,复用预训练的编码器/声码器 | ||
|
||
## 快速开始 | ||
|
||
### 1. 安装要求 | ||
> 按照原始存储库测试您是否已准备好所有环境。 | ||
**Python 3.7 或更高版本 ** 需要运行工具箱。 | ||
|
||
* 安装 [PyTorch](https://pytorch.org/get-started/locally/)。 | ||
* 安装 [ffmpeg](https://ffmpeg.org/download.html#get-packages)。 | ||
* 运行`pip install -r requirements.txt` 来安装剩余的必要包。 | ||
|
||
|
||
### 2. 使用 aidatatang_200zh 训练合成器 | ||
* 下载 adatatang_200zh 数据集并解压:确保您可以访问 *train* 文件夹中的所有 .wav | ||
* 使用音频和梅尔频谱图进行预处理: | ||
`python synthesizer_preprocess_audio.py <datasets_root>` | ||
|
||
* 预处理嵌入: | ||
`python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer` | ||
|
||
* 训练合成器: | ||
`python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer` | ||
|
||
* 当您在训练文件夹 *synthesizer/saved_models/* 中看到注意线显示和损失满足您的需要时,请转到下一步。 | ||
> 仅供参考,我的注意力是在 18k 步之后出现的,并且在 50k 步之后损失变得低于 0.4。 | ||
|
||
### 3. 启动工具箱 | ||
然后您可以尝试使用工具箱: | ||
`python demo_toolbox.py -d <datasets_root>` | ||
|
||
## TODO | ||
- 添加演示视频 | ||
- 添加对更多数据集的支持 | ||
- 上传预训练模型 | ||
- 🙏 欢迎补充 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
![WechatIMG2968](https://user-images.githubusercontent.com/7423248/128490653-f55fefa8-f944-4617-96b8-5cc94f14f8f6.png) | ||
|
||
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/) | ||
> This repository is forked from [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) which only support English. | ||
> English | [中文](README-CN.md) | ||
## Features | ||
🌍 **Chinese** supported mandarin and tested with dataset: aidatatang_200zh | ||
|
||
🤩 **PyTorch** worked for pytorch, tested in version of 1.9.0(latest in August 2021), with GPU Tesla T4 and GTX 2060 | ||
|
||
🌍 **Windows + Linux** tested in both Windows OS and linux OS after fixing nits | ||
|
||
🤩 **Easy & Awesome** effect with only newly-trained synthesizer, by reusing the pretrained encoder/vocoder | ||
|
||
## Quick Start | ||
|
||
### 1. Install Requirements | ||
> Follow the original repo to test if you got all environment ready. | ||
**Python 3.7 or higher ** is needed to run the toolbox. | ||
|
||
* Install [PyTorch](https://pytorch.org/get-started/locally/). | ||
* Install [ffmpeg](https://ffmpeg.org/download.html#get-packages). | ||
* Run `pip install -r requirements.txt` to install the remaining necessary packages. | ||
|
||
|
||
### 2. Train synthesizer with aidatatang_200zh | ||
* Download aidatatang_200zh dataset and unzip: make sure you can access all .wav in *train* folder | ||
* Preprocess with the audios and the mel spectrograms: | ||
`python synthesizer_preprocess_audio.py <datasets_root>` | ||
|
||
* Preprocess the embeddings: | ||
`python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer` | ||
|
||
* Train the synthesizer: | ||
`python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer` | ||
|
||
* Go to next step when you see attention line show and loss meet your need in training folder *synthesizer/saved_models/*. | ||
> FYI, my attention came after 18k steps and loss became lower than 0.4 after 50k steps. | ||
### 3. Launch the Toolbox | ||
You can then try the toolbox: | ||
|
||
`python demo_toolbox.py -d <datasets_root>` | ||
or | ||
`python demo_toolbox.py` | ||
|
||
## TODO | ||
- Add demo video | ||
- Add support for more dataset | ||
- Upload pretrained model | ||
- 🙏 Welcome to add more |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,225 @@ | ||
from encoder.params_model import model_embedding_size as speaker_embedding_size | ||
from utils.argutils import print_args | ||
from utils.modelutils import check_model_paths | ||
from synthesizer.inference import Synthesizer | ||
from encoder import inference as encoder | ||
from vocoder import inference as vocoder | ||
from pathlib import Path | ||
import numpy as np | ||
import soundfile as sf | ||
import librosa | ||
import argparse | ||
import torch | ||
import sys | ||
import os | ||
from audioread.exceptions import NoBackendError | ||
|
||
if __name__ == '__main__': | ||
## Info & args | ||
parser = argparse.ArgumentParser( | ||
formatter_class=argparse.ArgumentDefaultsHelpFormatter | ||
) | ||
parser.add_argument("-e", "--enc_model_fpath", type=Path, | ||
default="encoder/saved_models/pretrained.pt", | ||
help="Path to a saved encoder") | ||
parser.add_argument("-s", "--syn_model_fpath", type=Path, | ||
default="synthesizer/saved_models/pretrained/pretrained.pt", | ||
help="Path to a saved synthesizer") | ||
parser.add_argument("-v", "--voc_model_fpath", type=Path, | ||
default="vocoder/saved_models/pretrained/pretrained.pt", | ||
help="Path to a saved vocoder") | ||
parser.add_argument("--cpu", action="store_true", help=\ | ||
"If True, processing is done on CPU, even when a GPU is available.") | ||
parser.add_argument("--no_sound", action="store_true", help=\ | ||
"If True, audio won't be played.") | ||
parser.add_argument("--seed", type=int, default=None, help=\ | ||
"Optional random number seed value to make toolbox deterministic.") | ||
parser.add_argument("--no_mp3_support", action="store_true", help=\ | ||
"If True, disallows loading mp3 files to prevent audioread errors when ffmpeg is not installed.") | ||
args = parser.parse_args() | ||
print_args(args, parser) | ||
if not args.no_sound: | ||
import sounddevice as sd | ||
|
||
if args.cpu: | ||
# Hide GPUs from Pytorch to force CPU processing | ||
os.environ["CUDA_VISIBLE_DEVICES"] = "" | ||
|
||
if not args.no_mp3_support: | ||
try: | ||
librosa.load("samples/1320_00000.mp3") | ||
except NoBackendError: | ||
print("Librosa will be unable to open mp3 files if additional software is not installed.\n" | ||
"Please install ffmpeg or add the '--no_mp3_support' option to proceed without support for mp3 files.") | ||
exit(-1) | ||
|
||
print("Running a test of your configuration...\n") | ||
|
||
if torch.cuda.is_available(): | ||
device_id = torch.cuda.current_device() | ||
gpu_properties = torch.cuda.get_device_properties(device_id) | ||
## Print some environment information (for debugging purposes) | ||
print("Found %d GPUs available. Using GPU %d (%s) of compute capability %d.%d with " | ||
"%.1fGb total memory.\n" % | ||
(torch.cuda.device_count(), | ||
device_id, | ||
gpu_properties.name, | ||
gpu_properties.major, | ||
gpu_properties.minor, | ||
gpu_properties.total_memory / 1e9)) | ||
else: | ||
print("Using CPU for inference.\n") | ||
|
||
## Remind the user to download pretrained models if needed | ||
check_model_paths(encoder_path=args.enc_model_fpath, | ||
synthesizer_path=args.syn_model_fpath, | ||
vocoder_path=args.voc_model_fpath) | ||
|
||
## Load the models one by one. | ||
print("Preparing the encoder, the synthesizer and the vocoder...") | ||
encoder.load_model(args.enc_model_fpath) | ||
synthesizer = Synthesizer(args.syn_model_fpath) | ||
vocoder.load_model(args.voc_model_fpath) | ||
|
||
|
||
## Run a test | ||
print("Testing your configuration with small inputs.") | ||
# Forward an audio waveform of zeroes that lasts 1 second. Notice how we can get the encoder's | ||
# sampling rate, which may differ. | ||
# If you're unfamiliar with digital audio, know that it is encoded as an array of floats | ||
# (or sometimes integers, but mostly floats in this projects) ranging from -1 to 1. | ||
# The sampling rate is the number of values (samples) recorded per second, it is set to | ||
# 16000 for the encoder. Creating an array of length <sampling_rate> will always correspond | ||
# to an audio of 1 second. | ||
print("\tTesting the encoder...") | ||
encoder.embed_utterance(np.zeros(encoder.sampling_rate)) | ||
|
||
# Create a dummy embedding. You would normally use the embedding that encoder.embed_utterance | ||
# returns, but here we're going to make one ourselves just for the sake of showing that it's | ||
# possible. | ||
embed = np.random.rand(speaker_embedding_size) | ||
# Embeddings are L2-normalized (this isn't important here, but if you want to make your own | ||
# embeddings it will be). | ||
embed /= np.linalg.norm(embed) | ||
# The synthesizer can handle multiple inputs with batching. Let's create another embedding to | ||
# illustrate that | ||
embeds = [embed, np.zeros(speaker_embedding_size)] | ||
texts = ["test 1", "test 2"] | ||
print("\tTesting the synthesizer... (loading the model will output a lot of text)") | ||
mels = synthesizer.synthesize_spectrograms(texts, embeds) | ||
|
||
# The vocoder synthesizes one waveform at a time, but it's more efficient for long ones. We | ||
# can concatenate the mel spectrograms to a single one. | ||
mel = np.concatenate(mels, axis=1) | ||
# The vocoder can take a callback function to display the generation. More on that later. For | ||
# now we'll simply hide it like this: | ||
no_action = lambda *args: None | ||
print("\tTesting the vocoder...") | ||
# For the sake of making this test short, we'll pass a short target length. The target length | ||
# is the length of the wav segments that are processed in parallel. E.g. for audio sampled | ||
# at 16000 Hertz, a target length of 8000 means that the target audio will be cut in chunks of | ||
# 0.5 seconds which will all be generated together. The parameters here are absurdly short, and | ||
# that has a detrimental effect on the quality of the audio. The default parameters are | ||
# recommended in general. | ||
vocoder.infer_waveform(mel, target=200, overlap=50, progress_callback=no_action) | ||
|
||
print("All test passed! You can now synthesize speech.\n\n") | ||
|
||
|
||
## Interactive speech generation | ||
print("This is a GUI-less example of interface to SV2TTS. The purpose of this script is to " | ||
"show how you can interface this project easily with your own. See the source code for " | ||
"an explanation of what is happening.\n") | ||
|
||
print("Interactive generation loop") | ||
num_generated = 0 | ||
while True: | ||
try: | ||
# Get the reference audio filepath | ||
message = "Reference voice: enter an audio filepath of a voice to be cloned (mp3, " \ | ||
"wav, m4a, flac, ...):\n" | ||
in_fpath = Path(input(message).replace("\"", "").replace("\'", "")) | ||
|
||
if in_fpath.suffix.lower() == ".mp3" and args.no_mp3_support: | ||
print("Can't Use mp3 files please try again:") | ||
continue | ||
## Computing the embedding | ||
# First, we load the wav using the function that the speaker encoder provides. This is | ||
# important: there is preprocessing that must be applied. | ||
|
||
# The following two methods are equivalent: | ||
# - Directly load from the filepath: | ||
preprocessed_wav = encoder.preprocess_wav(in_fpath) | ||
# - If the wav is already loaded: | ||
original_wav, sampling_rate = librosa.load(str(in_fpath)) | ||
preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate) | ||
print("Loaded file succesfully") | ||
|
||
# Then we derive the embedding. There are many functions and parameters that the | ||
# speaker encoder interfaces. These are mostly for in-depth research. You will typically | ||
# only use this function (with its default parameters): | ||
embed = encoder.embed_utterance(preprocessed_wav) | ||
print("Created the embedding") | ||
|
||
|
||
## Generating the spectrogram | ||
text = input("Write a sentence (+-20 words) to be synthesized:\n") | ||
|
||
# If seed is specified, reset torch seed and force synthesizer reload | ||
if args.seed is not None: | ||
torch.manual_seed(args.seed) | ||
synthesizer = Synthesizer(args.syn_model_fpath) | ||
|
||
# The synthesizer works in batch, so you need to put your data in a list or numpy array | ||
texts = [text] | ||
embeds = [embed] | ||
# If you know what the attention layer alignments are, you can retrieve them here by | ||
# passing return_alignments=True | ||
specs = synthesizer.synthesize_spectrograms(texts, embeds) | ||
spec = specs[0] | ||
print("Created the mel spectrogram") | ||
|
||
|
||
## Generating the waveform | ||
print("Synthesizing the waveform:") | ||
|
||
# If seed is specified, reset torch seed and reload vocoder | ||
if args.seed is not None: | ||
torch.manual_seed(args.seed) | ||
vocoder.load_model(args.voc_model_fpath) | ||
|
||
# Synthesizing the waveform is fairly straightforward. Remember that the longer the | ||
# spectrogram, the more time-efficient the vocoder. | ||
generated_wav = vocoder.infer_waveform(spec) | ||
|
||
|
||
## Post-generation | ||
# There's a bug with sounddevice that makes the audio cut one second earlier, so we | ||
# pad it. | ||
generated_wav = np.pad(generated_wav, (0, synthesizer.sample_rate), mode="constant") | ||
|
||
# Trim excess silences to compensate for gaps in spectrograms (issue #53) | ||
generated_wav = encoder.preprocess_wav(generated_wav) | ||
|
||
# Play the audio (non-blocking) | ||
if not args.no_sound: | ||
try: | ||
sd.stop() | ||
sd.play(generated_wav, synthesizer.sample_rate) | ||
except sd.PortAudioError as e: | ||
print("\nCaught exception: %s" % repr(e)) | ||
print("Continuing without audio playback. Suppress this message with the \"--no_sound\" flag.\n") | ||
except: | ||
raise | ||
|
||
# Save it on the disk | ||
filename = "demo_output_%02d.wav" % num_generated | ||
print(generated_wav.dtype) | ||
sf.write(filename, generated_wav.astype(np.float32), synthesizer.sample_rate) | ||
num_generated += 1 | ||
print("\nSaved output as %s\n\n" % filename) | ||
|
||
|
||
except Exception as e: | ||
print("Caught exception: %s" % repr(e)) | ||
print("Restarting\n") |
Oops, something went wrong.