Init to support Chinese Dataset.

2091527886 · Aug 7, 2021 · e46cd60 · e46cd60
commit e46cd60
Show file tree

Hide file tree

Showing 75 changed files with 6,984 additions and 0 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1 @@
+*.ipynb linguist-vendored
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,20 @@
+*.pyc
+*.aux
+*.log
+*.out
+*.synctex.gz
+*.suo
+*__pycache__
+*.idea
+*.ipynb_checkpoints
+*.pickle
+*.npy
+*.blg
+*.bbl
+*.bcf
+*.toc
+*.wav
+*.sh
+encoder/saved_models/*
+synthesizer/saved_models/*
+vocoder/saved_models/*
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,24 @@
+MIT License
+
+Modified & original work Copyright (c) 2019 Corentin Jemine (https://github.com/CorentinJ)
+Original work Copyright (c) 2018 Rayhane Mama (https://github.com/Rayhane-mamah)
+Original work Copyright (c) 2019 fatchord (https://github.com/fatchord)
+Original work Copyright (c) 2015 braindead (https://github.com/braindead)
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README-CN.md b/README-CN.md
@@ -0,0 +1,52 @@
+## 实时语音克隆 - 中文/普通话
+![WechatIMG2968](https://user-images.githubusercontent.com/7423248/128490653-f55fefa8-f944-4617-96b8-5cc94f14f8f6.png)
+
+[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/)
+> 该库是从仅支持英语的[Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) 分叉出来的。
+
+### [English](README.md)  | 中文
+
+## 特性
+🌍 **中文** 支持普通话并使用数据集进行测试：adatatang_200zh
+
+🤩 **PyTorch** 适用于 pytorch，已在 1.9.0 版本（最新于 2021 年 8 月）中测试，GPU Tesla T4 和 GTX 2060
+
+🌍 **Windows + Linux** 在修复 nits 后在 Windows 操作系统和 linux 操作系统中进行测试
+
+🤩 **Easy & Awesome** 仅使用新训练的合成器（synthesizer）就有良好效果，复用预训练的编码器/声码器
+
+## 快速开始
+
+### 1. 安装要求
+> 按照原始存储库测试您是否已准备好所有环境。
+**Python 3.7 或更高版本 ** 需要运行工具箱。
+
+* 安装 [PyTorch](https://pytorch.org/get-started/locally/)。
+* 安装 [ffmpeg](https://ffmpeg.org/download.html#get-packages)。
+* 运行`pip install -r requirements.txt` 来安装剩余的必要包。
+
+
+### 2. 使用 aidatatang_200zh 训练合成器
+* 下载 adatatang_200zh 数据集并解压：确保您可以访问 *train* 文件夹中的所有 .wav
+* 使用音频和梅尔频谱图进行预处理：
+`python synthesizer_preprocess_audio.py <datasets_root>`
+
+* 预处理嵌入：
+`python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer`
+
+* 训练合成器：
+`python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
+
+* 当您在训练文件夹 *synthesizer/saved_models/* 中看到注意线显示和损失满足您的需要时，请转到下一步。
+> 仅供参考，我的注意力是在 18k 步之后出现的，并且在 50k 步之后损失变得低于 0.4。
+
+
+### 3. 启动工具箱
+然后您可以尝试使用工具箱：
+`python demo_toolbox.py -d <datasets_root>`
+
+＃＃ TODO
+- 添加演示视频
+- 添加对更多数据集的支持
+- 上传预训练模型
+- 🙏 欢迎补充
diff --git a/README.md b/README.md
@@ -0,0 +1,53 @@
+![WechatIMG2968](https://user-images.githubusercontent.com/7423248/128490653-f55fefa8-f944-4617-96b8-5cc94f14f8f6.png)
+
+[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg?style=flat)](http://choosealicense.com/licenses/mit/)
+> This repository is forked from [Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) which only support English.
+
+> English | [中文](README-CN.md) 
+
+## Features
+🌍 **Chinese** supported mandarin and tested with dataset: aidatatang_200zh
+
+🤩 **PyTorch** worked for pytorch, tested in version of 1.9.0(latest in August 2021), with GPU Tesla T4 and GTX 2060
+
+🌍 **Windows + Linux** tested in both Windows OS and linux OS after fixing nits 
+
+🤩 **Easy & Awesome** effect with only newly-trained synthesizer, by reusing the pretrained encoder/vocoder
+
+## Quick Start
+
+### 1. Install Requirements
+> Follow the original repo to test if you got all environment ready.
+**Python 3.7 or higher ** is needed to run the toolbox.
+
+* Install [PyTorch](https://pytorch.org/get-started/locally/).
+* Install [ffmpeg](https://ffmpeg.org/download.html#get-packages).
+* Run `pip install -r requirements.txt` to install the remaining necessary packages.
+
+
+### 2. Train synthesizer with aidatatang_200zh
+* Download aidatatang_200zh dataset and unzip: make sure you can access all .wav in *train* folder
+* Preprocess with the audios and the mel spectrograms:
+`python synthesizer_preprocess_audio.py <datasets_root>`
+
+* Preprocess the embeddings:
+`python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer`
+
+* Train the synthesizer:
+`python synthesizer_train.py mandarin <datasets_root>/SV2TTS/synthesizer`
+
+* Go to next step when you see attention line show and loss meet your need in training folder *synthesizer/saved_models/*. 
+> FYI, my attention came after 18k steps and loss became lower than 0.4 after 50k steps.
+
+### 3. Launch the Toolbox
+You can then try the toolbox:
+
+`python demo_toolbox.py -d <datasets_root>`  
+or  
+`python demo_toolbox.py`  
+
+## TODO
+- Add demo video
+- Add support for more dataset
+- Upload pretrained model
+- 🙏 Welcome to add more
diff --git a/demo_cli.py b/demo_cli.py
@@ -0,0 +1,225 @@
+from encoder.params_model import model_embedding_size as speaker_embedding_size
+from utils.argutils import print_args
+from utils.modelutils import check_model_paths
+from synthesizer.inference import Synthesizer
+from encoder import inference as encoder
+from vocoder import inference as vocoder
+from pathlib import Path
+import numpy as np
+import soundfile as sf
+import librosa
+import argparse
+import torch
+import sys
+import os
+from audioread.exceptions import NoBackendError
+
+if __name__ == '__main__':
+    ## Info & args
+    parser = argparse.ArgumentParser(
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("-e", "--enc_model_fpath", type=Path, 
+                        default="encoder/saved_models/pretrained.pt",
+                        help="Path to a saved encoder")
+    parser.add_argument("-s", "--syn_model_fpath", type=Path, 
+                        default="synthesizer/saved_models/pretrained/pretrained.pt",
+                        help="Path to a saved synthesizer")
+    parser.add_argument("-v", "--voc_model_fpath", type=Path, 
+                        default="vocoder/saved_models/pretrained/pretrained.pt",
+                        help="Path to a saved vocoder")
+    parser.add_argument("--cpu", action="store_true", help=\
+        "If True, processing is done on CPU, even when a GPU is available.")
+    parser.add_argument("--no_sound", action="store_true", help=\
+        "If True, audio won't be played.")
+    parser.add_argument("--seed", type=int, default=None, help=\
+        "Optional random number seed value to make toolbox deterministic.")
+    parser.add_argument("--no_mp3_support", action="store_true", help=\
+        "If True, disallows loading mp3 files to prevent audioread errors when ffmpeg is not installed.")
+    args = parser.parse_args()
+    print_args(args, parser)
+    if not args.no_sound:
+        import sounddevice as sd
+
+    if args.cpu:
+        # Hide GPUs from Pytorch to force CPU processing
+        os.environ["CUDA_VISIBLE_DEVICES"] = ""
+
+    if not args.no_mp3_support:
+        try:
+            librosa.load("samples/1320_00000.mp3")
+        except NoBackendError:
+            print("Librosa will be unable to open mp3 files if additional software is not installed.\n"
+                  "Please install ffmpeg or add the '--no_mp3_support' option to proceed without support for mp3 files.")
+            exit(-1)
+
+    print("Running a test of your configuration...\n")
+
+    if torch.cuda.is_available():
+        device_id = torch.cuda.current_device()
+        gpu_properties = torch.cuda.get_device_properties(device_id)
+        ## Print some environment information (for debugging purposes)
+        print("Found %d GPUs available. Using GPU %d (%s) of compute capability %d.%d with "
+            "%.1fGb total memory.\n" % 
+            (torch.cuda.device_count(),
+            device_id,
+            gpu_properties.name,
+            gpu_properties.major,
+            gpu_properties.minor,
+            gpu_properties.total_memory / 1e9))
+    else:
+        print("Using CPU for inference.\n")
+
+    ## Remind the user to download pretrained models if needed
+    check_model_paths(encoder_path=args.enc_model_fpath,
+                      synthesizer_path=args.syn_model_fpath,
+                      vocoder_path=args.voc_model_fpath)
+
+    ## Load the models one by one.
+    print("Preparing the encoder, the synthesizer and the vocoder...")
+    encoder.load_model(args.enc_model_fpath)
+    synthesizer = Synthesizer(args.syn_model_fpath)
+    vocoder.load_model(args.voc_model_fpath)
+
+
+    ## Run a test
+    print("Testing your configuration with small inputs.")
+    # Forward an audio waveform of zeroes that lasts 1 second. Notice how we can get the encoder's
+    # sampling rate, which may differ.
+    # If you're unfamiliar with digital audio, know that it is encoded as an array of floats 
+    # (or sometimes integers, but mostly floats in this projects) ranging from -1 to 1.
+    # The sampling rate is the number of values (samples) recorded per second, it is set to
+    # 16000 for the encoder. Creating an array of length <sampling_rate> will always correspond 
+    # to an audio of 1 second.
+    print("\tTesting the encoder...")
+    encoder.embed_utterance(np.zeros(encoder.sampling_rate))
+
+    # Create a dummy embedding. You would normally use the embedding that encoder.embed_utterance
+    # returns, but here we're going to make one ourselves just for the sake of showing that it's
+    # possible.
+    embed = np.random.rand(speaker_embedding_size)
+    # Embeddings are L2-normalized (this isn't important here, but if you want to make your own 
+    # embeddings it will be).
+    embed /= np.linalg.norm(embed)
+    # The synthesizer can handle multiple inputs with batching. Let's create another embedding to 
+    # illustrate that
+    embeds = [embed, np.zeros(speaker_embedding_size)]
+    texts = ["test 1", "test 2"]
+    print("\tTesting the synthesizer... (loading the model will output a lot of text)")
+    mels = synthesizer.synthesize_spectrograms(texts, embeds)
+
+    # The vocoder synthesizes one waveform at a time, but it's more efficient for long ones. We 
+    # can concatenate the mel spectrograms to a single one.
+    mel = np.concatenate(mels, axis=1)
+    # The vocoder can take a callback function to display the generation. More on that later. For 
+    # now we'll simply hide it like this:
+    no_action = lambda *args: None
+    print("\tTesting the vocoder...")
+    # For the sake of making this test short, we'll pass a short target length. The target length 
+    # is the length of the wav segments that are processed in parallel. E.g. for audio sampled 
+    # at 16000 Hertz, a target length of 8000 means that the target audio will be cut in chunks of
+    # 0.5 seconds which will all be generated together. The parameters here are absurdly short, and 
+    # that has a detrimental effect on the quality of the audio. The default parameters are 
+    # recommended in general.
+    vocoder.infer_waveform(mel, target=200, overlap=50, progress_callback=no_action)
+
+    print("All test passed! You can now synthesize speech.\n\n")
+
+
+    ## Interactive speech generation
+    print("This is a GUI-less example of interface to SV2TTS. The purpose of this script is to "
+          "show how you can interface this project easily with your own. See the source code for "
+          "an explanation of what is happening.\n")
+
+    print("Interactive generation loop")
+    num_generated = 0
+    while True:
+        try:
+            # Get the reference audio filepath
+            message = "Reference voice: enter an audio filepath of a voice to be cloned (mp3, " \
+                      "wav, m4a, flac, ...):\n"
+            in_fpath = Path(input(message).replace("\"", "").replace("\'", ""))
+
+            if in_fpath.suffix.lower() == ".mp3" and args.no_mp3_support:
+                print("Can't Use mp3 files please try again:")
+                continue
+            ## Computing the embedding
+            # First, we load the wav using the function that the speaker encoder provides. This is 
+            # important: there is preprocessing that must be applied.
+
+            # The following two methods are equivalent:
+            # - Directly load from the filepath:
+            preprocessed_wav = encoder.preprocess_wav(in_fpath)
+            # - If the wav is already loaded:
+            original_wav, sampling_rate = librosa.load(str(in_fpath))
+            preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
+            print("Loaded file succesfully")
+
+            # Then we derive the embedding. There are many functions and parameters that the 
+            # speaker encoder interfaces. These are mostly for in-depth research. You will typically
+            # only use this function (with its default parameters):
+            embed = encoder.embed_utterance(preprocessed_wav)
+            print("Created the embedding")
+
+
+            ## Generating the spectrogram
+            text = input("Write a sentence (+-20 words) to be synthesized:\n")
+
+            # If seed is specified, reset torch seed and force synthesizer reload
+            if args.seed is not None:
+                torch.manual_seed(args.seed)
+                synthesizer = Synthesizer(args.syn_model_fpath)
+
+            # The synthesizer works in batch, so you need to put your data in a list or numpy array
+            texts = [text]
+            embeds = [embed]
+            # If you know what the attention layer alignments are, you can retrieve them here by
+            # passing return_alignments=True
+            specs = synthesizer.synthesize_spectrograms(texts, embeds)
+            spec = specs[0]
+            print("Created the mel spectrogram")
+
+
+            ## Generating the waveform
+            print("Synthesizing the waveform:")
+
+            # If seed is specified, reset torch seed and reload vocoder
+            if args.seed is not None:
+                torch.manual_seed(args.seed)
+                vocoder.load_model(args.voc_model_fpath)
+
+            # Synthesizing the waveform is fairly straightforward. Remember that the longer the
+            # spectrogram, the more time-efficient the vocoder.
+            generated_wav = vocoder.infer_waveform(spec)
+
+
+            ## Post-generation
+            # There's a bug with sounddevice that makes the audio cut one second earlier, so we
+            # pad it.
+            generated_wav = np.pad(generated_wav, (0, synthesizer.sample_rate), mode="constant")
+
+            # Trim excess silences to compensate for gaps in spectrograms (issue #53)
+            generated_wav = encoder.preprocess_wav(generated_wav)
+
+            # Play the audio (non-blocking)
+            if not args.no_sound:
+                try:
+                    sd.stop()
+                    sd.play(generated_wav, synthesizer.sample_rate)
+                except sd.PortAudioError as e:
+                    print("\nCaught exception: %s" % repr(e))
+                    print("Continuing without audio playback. Suppress this message with the \"--no_sound\" flag.\n")
+                except:
+                    raise
+
+            # Save it on the disk
+            filename = "demo_output_%02d.wav" % num_generated
+            print(generated_wav.dtype)
+            sf.write(filename, generated_wav.astype(np.float32), synthesizer.sample_rate)
+            num_generated += 1
+            print("\nSaved output as %s\n\n" % filename)
+
+
+        except Exception as e:
+            print("Caught exception: %s" % repr(e))
+            print("Restarting\n")