Yu Zhang, Ziyue Jiang, Ruiqi Li, Changhao Pan, Jinzheng He, Rongjie Huang, Chuxin Wang, Zhou Zhao | Zhejiang University
PyTorch Implementation of TCSinger (EMNLP 2024): Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control.
We provide our implementation and pre-trained models in this repository.
Visit our demo page for audio samples.
- 2024.12: We released the checkpoints of TCSinger!
- 2024.11: We released the code of TCSinger!
- 2024.09: We released the full dataset of GTSinger!
- 2024.09: TCSinger is accepted by EMNLP 2024!
- We present TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. TCSinger excels in personalized and controllable SVS tasks.
- We introduce the clustering style encoder to extract styles, and the Style and Duration Language Model (S&D-LM) to predict both style information and phoneme duration, addressing style modeling, transfer, and control.
- We propose the style adaptive decoder to generate intricately detailed songs using a novel mel-style adaptive normalization method.
- Experimental results show that TCSinger surpasses baseline models in synthesis quality, singer similarity, and style controllability across various tasks: zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer.
We provide an example of how you can generate high-fidelity samples using TCSinger.
To try on your own dataset or GTSinger, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.
You can use all pre-trained models we provide here. Notably, this TCSinger checkpoint only supports Chinese and English! You should train your own model based on GTSinger for multilingual style transfer and control! Details of each folder are as follows:
Model | Description |
---|---|
TCSinger | Acousitic model (config) |
SAD | Decoder model (config) |
SDLM | LM model (config) |
HIFI-GAN | Neural Vocoder |
A suitable conda environment named tcsinger
can be created
and activated with:
conda create -n tcsinger python=3.10
conda install --yes --file requirements.txt
conda activate tcsinger
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count()
.
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module.
Here we provide a speech synthesis pipeline using TCSinger.
- Prepare TCSinger, SAD, SDLM: Download and put checkpoint at
checkpoints/TCSinger
,checkpoints/SAD
,checkpoints/SDLM
. - Prepare HIFI-GAN: Download and put checkpoint at
checkpoints/hifigan
. - Prepare prompt information: Provide a prompt_audio (48k) and input target ph, target note for each ph, target note_dur for each ph, target note_type for each ph (rest: 1, lyric: 2, slur: 3), and prompt audio path, prompt ph, prompt note, note_dur, note_type. Input these information in
Inference/style_transfer.py
. Notably, if you want to use Chinese and English data in GTSinger to infer this checkpoint, refer to phone_set, you have to delete _zh or _en in each ph of GTSinger! - Infer with tcsinger with style transfer:
CUDA_VISIBLE_DEVICES=$GPU python inference/style_transfer.py --config egs/sdlm.yaml --exp_name checkpoints/SDLM
or
- Prepare prompt information: Provide a prompt_audio (48k) and input target ph, target note for each ph, target note_dur for each ph, target note_type for each ph (rest: 1, lyric: 2, slur: 3), and style information. Input these information in
Inference/style_control.py
. Notably, if you want to use Chinese and English data in GTSinger to infer this checkpoint, refer to phone_set, you have to delete _zh or _en in each ph of GTSinger! - Infer with tcsinger with style control (the effectiveness of the style_control feature is suboptimal for certain timbres due to the inclusion of speech and unannotated data. I recommend using GTSinger or other datasets for fine-tuning before style control inference.):
CUDA_VISIBLE_DEVICES=$GPU python inference/style_control.py --config egs/sdlm.yaml --exp_name checkpoints/SDLM
Generated wav files are saved in infer_out
by default.
- Prepare your own singing dataset or download GTSinger.
- Put
metadata.json
(including ph, word, item_name, ph_durs, wav_fn, singer, ep_pitches, ep_notedurs, ep_types for each singing voice) andphone_set.json
(all phonemes of your dictionary) indata/processed/tc
(Note: we providemetadata.json
andphone_set.json
in GTSinger, but you need to change the wav_fn of each wav inmetadata.json
to your own absolute path). - Set
processed_data_dir
(data/processed/tc
),binary_data_dir
,valid_prefixes
(list of parts of item names, like["Chinese#ZH-Alto-1#Mixed_Voice_and_Falsetto#一次就好"]
),test_prefixes
in the config. - Download the global emotion encoder to
emotion_encoder_path
(training on Chinese only) or train your own global emotion encoder referring to Emotion Encoder based on emotion annotations in GTSinger. - Preprocess Dataset:
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config egs/TCSinger.yaml
- Train Main Model:
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/tcsinger.yaml --exp_name TCSinger --reset
- Train SAD:
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/sad.yaml --exp_name SAD --reset
- Train SDLM:
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/sdlm.yaml --exp_name SDLM --reset
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/sdlm.yaml --exp_name SDLM --infer
This implementation uses parts of the code from the following Github repos: NATSpeech, TCSinger as described in our code.
If you find this code useful in your research, please cite our work:
@article{zhang2024tcsinger,
title={TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control},
author={Zhang, Yu and Jiang, Ziyue and Li, Ruiqi and Pan, Changhao and He, Jinzheng and Huang, Rongjie and Wang, Chuxin and Zhao, Zhou},
journal={arXiv preprint arXiv:2409.15977},
year={2024}
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.