Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice Consistency Working Pretty Well -- Plus Zero-Shot Cloning! #139

Open
apresence opened this issue Sep 21, 2024 · 27 comments
Open

Voice Consistency Working Pretty Well -- Plus Zero-Shot Cloning! #139

apresence opened this issue Sep 21, 2024 · 27 comments

Comments

@apresence
Copy link
Contributor

apresence commented Sep 21, 2024

I've managed to get a POC with voice consistency working pretty well. Along the way, I've figured out how to do ok-ish zero-shot voice cloning, too. It took drawing on tidbits spread between several issues posted here, the HF repos, the various github sources linked here and there, and about two weeks of experimentation on my part to get going.

Here is an example of zero-shot voice cloning. Between each sentence, I alternate ground truth and Parler TTS audio between the left and right channels. I also lead the ground truth audio with an upwards tone, and Parler with a downwards tone. I did this primarily for my own purposes so I could compare them more closely myself.

The ground truth audio is from a YouTube interview found here.

Only a 5-second snippet of ground truth audio was required to do the clone. Each sentence in the audio sample is a new Parler-TTS generation using text from the audio transcript. As you can hear, the consistency is pretty good. It's even better for voices in the training dataset.

For comparison, here is an example comparing cloning vs non-cloning generation. All the settings are the same between the two, only the cloning feature being on or off differs.

Code, credits and further details forthcoming -- I have to clean up things and get rid of some bugs first for fear that the code-shamers will eat me alive. 😅

@apresence
Copy link
Contributor Author

apresence commented Sep 21, 2024

Here is an example with the new voice steering feature on, and one with it off.

It's a simple on/off setting. Other than that, you'd use Parler-TTS just like you normally would. I've also added the ability to save voices you like so you can reuse them later, even between program executions.

Again, each sentence is a separate generation. With steering on, the voice consistency is pretty good. With it off, it varies considerably.

The model, voice description, seed, etc. are all the same between the two examples, only the new steering feature was turned on or off.

@apresence
Copy link
Contributor Author

apresence commented Sep 21, 2024

Here's an updated voice clone example. I had used mini before because it's more consistent with it's production. Although it doesn't sound as good, I was able to one-shot it.

Large takes a lot of wrangling to get it to behave, so it took a few passes. It could be that my source audio is not good enough (background hum, mic pops, echos).

Anyway, this is pretty good for a quick POC!

@apresence
Copy link
Contributor Author

apresence commented Sep 22, 2024

I got Parler-TTS zero-shot crying now. Check it out here. 100% of this audio was generated by Parler-TTS, along with some light editing in Audacity,

@apresence
Copy link
Contributor Author

apresence commented Sep 23, 2024

I'm just having way to much fun with Parler. Did a radio DJ voice for a fake podcast I call Under the Covers with ImcE™. Check it out here. ImcE's voice was generated by Parler-TTS, even the parts where he fumbles his speech and does his vocal warm-up. The singing was generated with RVC. Audio clips came from the same YouTube interview I mentioned earlier.

@apresence
Copy link
Contributor Author

PR #141 submitted. This is in preparation for the voice steering feature.

@suman819
Copy link

Could you provide the information about how to implement the voice consistency in audio ?

@apresence
Copy link
Contributor Author

suman819
Could you provide the information about how to implement the voice consistency in audio ?

I'm working on code to do that. In the meantime, I submitted a PR that is required.

When I'm done there will be a working example to start from.

Soon!

@suman819
Copy link

suman819 commented Sep 25, 2024

suman819
Could you provide the information about how to implement the voice consistency in audio ?

I'm working on code to do that. In the meantime, I submitted a PR that is required.

When I'm done there will be a working example to start from.

Soon!

Thank you for support and Update!

@Guppy16
Copy link

Guppy16 commented Sep 27, 2024

This is a working snippet to continue audio in the same style as a speaker.
In the snippet, we require:

  • init_audio_file: path to an audio file (init_audio.wav) containing the speaker's voice
  • init_prompt: what the speaker has said

Currently, this is working with either of the following PRs:

Credit to @ylacombe (from #110 (comment)) for fleshing out this snippet - I've just manipulated it a bit for simplicity.

import soundfile as sf
import torch
import torchaudio
from transformers import AutoFeatureExtractor, AutoTokenizer, set_seed

from parler_tts import ParlerTTSForConditionalGeneration

# TODO: Adapt the following as per your requiremens
init_audio_file = "path/to/init_audio.wav"
init_prompt = "Here, write the transcript of the init audio"
description = (
    "A man speaker speaks quickly with a low-pitched voice. "
    "The recording is of very high quality, with the speaker's voice sounding clear and very close up."
)
prompt = "Is it really working ?"

# Load the Models
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "parler-tts/parler-tts-mini-v1"
model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

SAMPLING_RATE = model.config.sampling_rate

# Load the init audio
init_audio, init_sr = torchaudio.load(init_audio_file)
init_audio = torchaudio.functional.resample(init_audio, init_sr, SAMPLING_RATE)
init_audio = init_audio.mean(0)  # Take the mean across the channel dim
# Encode the init audio using the feature extractor
input_values = feature_extractor(init_audio, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_values.to(device)


input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
# NOTE: concatenate the init_propt and prompt when passing into the model
prompt_input_ids = tokenizer(init_prompt + " " + prompt, return_tensors="pt").input_ids.to(device)
set_seed(2)

# Generate the audio
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, input_values=input_values)

# Save the audio
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)

@MiniXC
Copy link

MiniXC commented Oct 2, 2024

Thanks for sharing! Weirdly it was working for me with LibriSpeech dev-clean recordings (16k) but not my own (44k). But somehow, downsampling to 16k and then up again 44k fixes this! Strange behaviour though, maybe something to do with how data was preprocessed during training...

Effectively I added

init_audio = torchaudio.functional.resample(init_audio, init_sr, 16_000)
init_audio = torchaudio.functional.resample(init_audio, 16_000, SAMPLING_RATE)

to the snippet.

(Edit: by "not working", I mean generating <1s of audio without and speech, just a random sound effectively)

@apresence
Copy link
Contributor Author

Entering the final lap here I think on releasing this code.

One issue I'm seeing is that I get the following from time to time when using voice cloning. I haven't had a chance to look into it yet as I'm focusing on getting the rest of it shored up. Any ideas?

2024-10-03 04:46:17,708 [Thread-11 (_] [ERROR] Exception during generation request f14d6f81-7b14-4736-b7ca-55471ebfc923 with {'prompt_input_ids': tensor([[ 6185, 13830,  1423,    24,    48,    19,   125,  3808, 20253,     7,
           103,    12,   151,     6,   902,   887,     5,   101,   174,    12,
           214,   230,    55,     1,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],
       device='cuda:0'), 'prompt_attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0'), 'input_ids': tensor([[   71,  2335, 12192,    44,    46,  1348,  4974,    28,    46, 16822,
          1929,    16,     3,     9,   182,     3, 24092,  1345,    53,  1164,
            28,   964,  2931,   463,     5,     1,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0'), 'streamer': <parler_tts.streamer.ParlerTTSStreamer object at 0x783572757fd0>, 'min_new_tokens': 10, 'input_values': tensor([[[0.0008, 0.0030, 0.0032,  ..., 0.0028, 0.0019, 0.0011]]],
       device='cuda:0')}: 'Traceback (most recent call last):\n  File "/app/parts/cli/parcls.py", line 1594, in _generation_thread_fn\n    _ = self.model_inst.generate(**gt.generation_kwargs)\n        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/home/appuser/miniconda3/envs/parts/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/app/parts/repo/parler_tts/modeling_parler_tts.py", line 3500, in generate\n    output_ids = output_ids[mask].reshape(batch_size, self.decoder.num_codebooks, -1)\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nRuntimeError: shape \'[1, 9, -1]\' is invalid for input of size 4715\n'

@lukaLLM
Copy link

lukaLLM commented Oct 3, 2024

This is a working snippet to continue audio in the same style as a speaker. In the snippet, we require:

  • init_audio_file: path to an audio file (init_audio.wav) containing the speaker's voice
  • init_prompt: what the speaker has said

Currently, this is working with either of the following PRs:

Credit to @ylacombe (from #110 (comment)) for fleshing out this snippet - I've just manipulated it a bit for simplicity.

import soundfile as sf
import torch
import torchaudio
from transformers import AutoFeatureExtractor, AutoTokenizer, set_seed

from parler_tts import ParlerTTSForConditionalGeneration

# TODO: Adapt the following as per your requiremens
init_audio_file = "path/to/init_audio.wav"
init_prompt = "Here, write the transcript of the init audio"
description = (
    "A man speaker speaks quickly with a low-pitched voice. "
    "The recording is of very high quality, with the speaker's voice sounding clear and very close up."
)
prompt = "Is it really working ?"

# Load the Models
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "parler-tts/parler-tts-mini-v1"
model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

SAMPLING_RATE = model.config.sampling_rate

# Load the init audio
init_audio, init_sr = torchaudio.load(init_audio_file)
init_audio = torchaudio.functional.resample(init_audio, init_sr, SAMPLING_RATE)
init_audio = init_audio.mean(0)  # Take the mean across the channel dim
# Encode the init audio using the feature extractor
input_values = feature_extractor(init_audio, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_values.to(device)


input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
# NOTE: concatenate the init_propt and prompt when passing into the model
prompt_input_ids = tokenizer(init_prompt + " " + prompt, return_tensors="pt").input_ids.to(device)
set_seed(2)

# Generate the audio
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, input_values=input_values)

# Save the audio
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)

Hi, I tried to use it for voice consistency and run like this:

import soundfile as sf
import torch
import torchaudio
from transformers import AutoFeatureExtractor, AutoTokenizer, set_seed

from parler_tts import ParlerTTSForConditionalGeneration

# TODO: Adapt the following as per your requiremens
init_audio_file = "patient_response_good_emotion.wav"
init_prompt = "Here, write the transcript of the init audio"
description = (
    "A man speaker speaks quickly with a low-pitched voice. "
    "The recording is of very high quality, with the speaker's voice sounding clear and very close up."
)
prompt = "Is it really working ?"

# Load the Models
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "parler-tts/parler-tts-mini-v1"
model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

SAMPLING_RATE = model.config.sampling_rate

# Load the init audio
init_audio, init_sr = torchaudio.load(init_audio_file)
init_audio = torchaudio.functional.resample(init_audio, init_sr, SAMPLING_RATE)
init_audio = init_audio.mean(0)  # Take the mean across the channel dim
# Encode the init audio using the feature extractor
input_values = feature_extractor(init_audio, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_values.to(device)


input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
# NOTE: concatenate the init_propt and prompt when passing into the model
prompt_input_ids = tokenizer(init_prompt + " " + prompt, return_tensors="pt").input_ids.to(device)
set_seed(2)

# Generate the audio
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, input_values=input_values)

# Save the audio
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)

but got error TypeError: DACModel.encode() got an unexpected keyword argument 'input_ids'

@Guppy16
Copy link

Guppy16 commented Oct 3, 2024

@lukaLLM which branch r u on? You need to be in the aforementioned branch to be able to run this code.

@lukaLLM
Copy link

lukaLLM commented Oct 3, 2024

@lukaLLM which branch r u on? You need to be in the aforementioned branch to be able to run this code.

Okay my bad got some dependecy issue I don't saw and it was not updated properly, so I need to change my program. @Guppy16 So I updated using pip install git+https://github.com/huggingface/parler-tts.git but still get same error should I do it diffrently when I do parler_tts.version i got version 0.2.

@apresence
Copy link
Contributor Author

Thanks for sharing! Weirdly it was working for me with LibriSpeech dev-clean recordings (16k) but not my own (44k). But somehow, downsampling to 16k and then up again 44k fixes this! Strange behaviour though, maybe something to do with how data was preprocessed during training...

Effectively I added

init_audio = torchaudio.functional.resample(init_audio, init_sr, 16_000)
init_audio = torchaudio.functional.resample(init_audio, 16_000, SAMPLING_RATE)

to the snippet.

(Edit: by "not working", I mean generating <1s of audio without and speech, just a random sound effectively)

I can confirm this. It still goes wonky sometimes, but certainly the unwanted artifacts/blank audio utterances are much less frequent. Of course, the audio quality isn't as good since you're getting 16/32kHz, just resampled to 44.1kHz.

FWIW, I tried 32kHz with similar results. I wonder if it's due to MusicGen (which Parler is based on) being designed for 32kHz? Or perhaps there were 16/32kHz samples in the dataset, thus there is a more diverse pool for the model to pull from. Maybe @ylacombe can comment on this?

@apresence
Copy link
Contributor Author

@eustlb @ylacombe et. al. --

Another little wrinkle with voice steering/cloning. It defeats compilation because the input_values don't support padding or an attention mask as-is. I'm sure it could be done, but I'm trying to focus on getting my code finished.

As an example that we've already covered, let's say I have padding set to 50 for text tokenization. Normally that would result in a guard failure because the cache_position is 1 the first pass, then 51 the second. I know to expect that now.

However, when we have the 50 padding and pass input_values, even more cache entries are created (189 in the following example). This results in another guard failure and another recompilation:

V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] Recompiling function forward in /app/parts/repo/parler_tts/modeling_parler_tts.py:2576
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles]     triggered by the following guard failure(s):
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles]     - tensor 'L['cache_position']' size mismatch at index 0. expected 1, actual 189
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles]     - tensor 'L['cache_position']' size mismatch at index 0. expected 51, actual 189

This is tolerable as long as you are using the same input_values you compiled with, but as soon as you use different values, the size cache changes, resulting in a guard failure, and another recompilation.

In other words, anyone using this would have to wait for compilation (which can take several minutes) any time they'd use a different voice with steering/cloning.

Any ideas?

@apresence
Copy link
Contributor Author

@eustlb @ylacombe et. al. --

Another little wrinkle with voice steering/cloning. It defeats compilation because the input_values don't support padding or an attention mask as-is. I'm sure it could be done, but I'm trying to focus on getting my code finished.

As an example that we've already covered, let's say I have padding set to 50 for text tokenization. Normally that would result in a guard failure because the cache_position is 1 the first pass, then 51 the second. I know to expect that now.

However, when we have the 50 padding and pass input_values, even more cache entries are created (189 in the following example). This results in another guard failure and another recompilation:

V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] Recompiling function forward in /app/parts/repo/parler_tts/modeling_parler_tts.py:2576
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles]     triggered by the following guard failure(s):
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles]     - tensor 'L['cache_position']' size mismatch at index 0. expected 1, actual 189
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles]     - tensor 'L['cache_position']' size mismatch at index 0. expected 51, actual 189

This is tolerable as long as you are using the same input_values you compiled with, but as soon as you use different values, the size cache changes, resulting in a guard failure, and another recompilation.

In other words, anyone using this would have to wait for compilation (which can take several minutes) any time they'd use a different voice with steering/cloning.

Any ideas?

OK, I retract my statement. It does not seem to trigger a recompile.

Thanks!

@apresence
Copy link
Contributor Author

@eustlb @ylacombe et. al. --
Another little wrinkle with voice steering/cloning. It defeats compilation because the input_values don't support padding or an attention mask as-is. I'm sure it could be done, but I'm trying to focus on getting my code finished.
As an example that we've already covered, let's say I have padding set to 50 for text tokenization. Normally that would result in a guard failure because the cache_position is 1 the first pass, then 51 the second. I know to expect that now.
However, when we have the 50 padding and pass input_values, even more cache entries are created (189 in the following example). This results in another guard failure and another recompilation:

V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] Recompiling function forward in /app/parts/repo/parler_tts/modeling_parler_tts.py:2576
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles]     triggered by the following guard failure(s):
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles]     - tensor 'L['cache_position']' size mismatch at index 0. expected 1, actual 189
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles]     - tensor 'L['cache_position']' size mismatch at index 0. expected 51, actual 189

This is tolerable as long as you are using the same input_values you compiled with, but as soon as you use different values, the size cache changes, resulting in a guard failure, and another recompilation.
In other words, anyone using this would have to wait for compilation (which can take several minutes) any time they'd use a different voice with steering/cloning.
Any ideas?

OK, I retract my statement. It does not seem to trigger a recompile.

Thanks!

Scratch my scratch.

It recompiles when the length of input_values is longer than the previous length.

@lukaLLM
Copy link

lukaLLM commented Oct 4, 2024

@lukaLLM which branch r u on? You need to be in the aforementioned branch to be able to run this code.

Thanks @Guppy16 my bad you were right. I got it running without errors but It works a bit differently, I tested both branches and created a fresh environment basically It generates the same audio as in init_audio_file and the text is the same as in recorded audio (generated in sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)). So it doesn't use the prompt? I played with it in my code and generated different prompts but then the voice was different every time though a bit closer than before. Maybe I am doing something there could somebody explain I just wanted to have a voice for continuous conversation.

@apresence
Copy link
Contributor Author

apresence commented Oct 4, 2024

Entering the final lap here I think on releasing this code.

One issue I'm seeing is that I get the following from time to time when using voice cloning. I haven't had a chance to look into it yet as I'm focusing on getting the rest of it shored up. Any ideas?

2024-10-03 04:46:17,708 [Thread-11 (_] [ERROR] Exception during generation request f14d6f81-7b14-4736-b7ca-55471ebfc923 with {'prompt_input_ids': tensor([[ 6185, 13830,  1423,    24,    48,    19,   125,  3808, 20253,     7,
           103,    12,   151,     6,   902,   887,     5,   101,   174,    12,
           214,   230,    55,     1,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],
       device='cuda:0'), 'prompt_attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0'), 'input_ids': tensor([[   71,  2335, 12192,    44,    46,  1348,  4974,    28,    46, 16822,
          1929,    16,     3,     9,   182,     3, 24092,  1345,    53,  1164,
            28,   964,  2931,   463,     5,     1,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0'), 'streamer': <parler_tts.streamer.ParlerTTSStreamer object at 0x783572757fd0>, 'min_new_tokens': 10, 'input_values': tensor([[[0.0008, 0.0030, 0.0032,  ..., 0.0028, 0.0019, 0.0011]]],
       device='cuda:0')}: 'Traceback (most recent call last):\n  File "/app/parts/cli/parcls.py", line 1594, in _generation_thread_fn\n    _ = self.model_inst.generate(**gt.generation_kwargs)\n        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/home/appuser/miniconda3/envs/parts/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/app/parts/repo/parler_tts/modeling_parler_tts.py", line 3500, in generate\n    output_ids = output_ids[mask].reshape(batch_size, self.decoder.num_codebooks, -1)\n                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nRuntimeError: shape \'[1, 9, -1]\' is invalid for input of size 4715\n'

I figured it out. The output_ids look something like this:

tensor([[1025,  438,  438,  ..., 1024, 1024, 1024],
        [1025, 1025,  254,  ..., 1024, 1024, 1024],
        [1025, 1025, 1025,  ..., 1024, 1024, 1024],
        ...,
        [1025, 1025, 1025,  ..., 1024, 1024, 1024],
        [1025, 1025, 1025,  ...,  417, 1024, 1024],
        [1025, 1025, 1025,  ...,  947,  720, 1024]], device='cuda:0')
Shape: [9, 1210]

What you're seeing there is the delay pattern mask. Look at the docstring for the function build_delay_pattern_mask for how that works. Anyway, when the mask is removed, which basically just removes the 1025 (BOS) tokens on the left and 1024 (PAD) tokens on the right, you get a 1d tensor. Something like this:

tensor([438, 438, 698,  ..., 741, 947, 720], device='cuda:0') 
Shape: [10800]

Now, the next thing to do is to break it up into codebooks. As far as I can tell, num_codebooks is always 9. To do that, this code is executed:

output_ids = output_ids.reshape(batch_size, num_codebooks, -1)

And you end up with something like this:

tensor([[[438, 438, 698,  ..., 698, 698, 438],
         [254, 459, 954,  ..., 232, 875, 937],
         [689, 475, 106,  ..., 612, 640,  30],
         ...,
         [426, 522, 639,  ..., 825, 116, 721],
         [364, 520, 257,  ..., 895, 236, 417],
         [702, 223, 462,  ..., 741, 947, 720]]], device='cuda:0')

That works all fine and dandy as long as the 1d tensor is a multiple of 9. If it's not, you get that error about shape [1, 9, -1] being invalid. The 1 is the batch size, 9 is num_codebooks, and -1 picks up the length of the source tensor.

So here's my proposed "fix", to be run after reverting the mask and before reshape:

        num_codebooks = self.decoder.num_codebooks
        rem_len = output_ids.size(0) % num_codebooks
        if rem_len != 0:
            # Calc how many pad tokens needed, then append them to the output_ids
            pad_len = num_codebooks - rem_len
            # Experimenting with different options for padding here, including just repeating the last token
            pad_tok = output_ids[-1] # ... or generation_config.pad_token_id
            output_ids = torch.cat((output_ids, pad_tok.expand(pad_len)), dim=0)

With that the error goes away, but the audio is always blank in the testing I've been doing. So there must be something else deeper going on ...

@apresence
Copy link
Contributor Author

For all those folks who keep commenting about issues with voice steering/cloning. I can assure you, there are lots of little gotchas. I am working on code that takes care of all of them. I will release it when the kinks are out. It makes a lot more sense for one person to do it and share rather than 100 trying to figure it out and running into the same problems ;).

@lukaLLM
Copy link

lukaLLM commented Oct 4, 2024

@apresence thanks for your good work I like the project a lot so I am rooting for you. Would be nice to contribute when I learn a bit

@Guppy16
Copy link

Guppy16 commented Oct 7, 2024

@lukaLLM which branch r u on? You need to be in the aforementioned branch to be able to run this code.

Thanks @Guppy16 my bad you were right. I got it running without errors but It works a bit differently, I tested both branches and created a fresh environment basically It generates the same audio as in init_audio_file and the text is the same as in recorded audio (generated in sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)). So it doesn't use the prompt? I played with it in my code and generated different prompts but then the voice was different every time though a bit closer than before. Maybe I am doing something there could somebody explain I just wanted to have a voice for continuous conversation.

Hey @lukaLLM great to see that you have it working! Unfortunately the voice cloning is very difficult to get right with a "new" voice; this feature is better used to continue the voice that is already generated. Here are some things you can try:

  • use an already trained on voice (e.g. Jenny or the default ParlerTTS voices)
  • try a much longer enrolment
  • keep the seed the same using e.g. set_seeds(42) before every voice generation

Hope this helps

@MiniXC
Copy link

MiniXC commented Oct 8, 2024

FWIW, I tried 32kHz with similar results. I wonder if it's due to MusicGen (which Parler is based on) being designed for 32kHz? Or perhaps there were 16/32kHz samples in the dataset, thus there is a more diverse pool for the model to pull from.

@apresence I looked a bit more into this, and the training data seems to be 24kHz (LibriTTS-R) and 48kHz (MLS) so theoretically either of those should be fine. When training the model, both are resampled to 44kHz - so maybe the model just struggles when there is no resampling? I will have to try 48kHz input next.

@apresence
Copy link
Contributor Author

Just wanted to update those that have been waiting that I have continued to actively work on this and hope to finish soon.

Thanks!

@ylacombe
Copy link
Collaborator

Many thanks to @Guppy16 and @apresence for your work on this!

If you're interested in working towards the next step, it'd be great to:

  1. document how to do voice consistency, as proposed here, in the README and the INFERENCE.md
  2. create a demo or enrich the current demo. The way I see it, a user could provide an optional input audio and an optional transcript. If there's no transcript, we can use automatically Whisper turbo to transcribe the audio.

Would anyone be interested in contributing ?

@apresence
Copy link
Contributor Author

apresence commented Oct 14, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants