Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: support GPT-SoVITS as TTS (Fast voice clone - so users can talk to his/her favorite voice other than generic AI voice). #92

Open
insufficient-will opened this issue Sep 7, 2024 · 5 comments

Comments

@insufficient-will
Copy link

Congratulate and many thanks first! I think the project has great potential into becoming a popular foundation.
If you deem appropriate, would you support GPT-SoVITS as well?

I know there has already been lots of TTS support so far, but GPT-SoVITS has something different. It allows users to clone his/her favorite voice in a very efficient way.

Talking to AI is inspiring, but enjoying response from a particular voice is what intrigues people, and it could be one of the ultimate goals when people are willing to talk to a machine. GPT-SoVITS can do a decent voice clone with a few clips in a few minutes, thus making it an ideal addition to the existing TTS solutions.

Best wishes!

@rs545837
Copy link
Collaborator

rs545837 commented Sep 7, 2024

Did you ever take a look at StyleTTS2?

@insufficient-will
Copy link
Author

insufficient-will commented Sep 8, 2024

It looks promising. I am in dire need for voice clone and multilanguage support. Here is a supplement of the issue.

Use scenario
I am making AI voiced audio books and RAG. My audience is a bunch of Third-person Shooter Gacha gamers (Snowbreak). I will clone characters' voice which I will use in either voicing a book or responding to a question.

The TTS has to excel in voice clone. A pre-trained voice won't do because every audience don't want that voice, they need his/her particularly favorite ones.

And the TTS should support multilanguage scenarios, especially Chinese, English, Italian (the game has a heated character with Italian background) and if possible, Hindi (for an AI bot - I don't know why a bot is popular in a Gacha game, but it happens)

To expand this topic a bit. For professional use cases, like medicine consulting, a pre-trained voice will do, because the key is not the voice, but the accuracy of the content. But for everyday use cases, emotional engagement comes in. It won't limit to Gacha game.

Limitation
Amount of voice clone training datasets.
Training hardware requirement and time consumption.
The fewer the better.

Current Solution
GPT-SoVITS. Can do a decent clone with 10-50 clips, 3-10 seconds each, in 10 minutes (RTX 3090). But not perfect yet, explained below.

Current options
Voice clone quality: everyone claims its best. I don't judge. But I've tied with some available methods, they don't come close to my current solution.
CN support: ChatTTS, Melo, GPT-Sovits OK. Parler Not OK.
EN support: Of course all are OK.
Italian and Hindi: Of course none is OK.

It looks like StyleTTS2 could be my savior after all.

Did you ever take a look at StyleTTS2?

@andimarafioti
Copy link
Member

Hey, I would be more than ok adding support for this TTS. If you want to do it I think it would be cool, I would review it 👍

We are still discussing a bit where to take this library next, thank you for sharing your ideas!

@insufficient-will
Copy link
Author

Hey, I would be more than ok adding support for this TTS. If you want to do it I think it would be cool, I would review it 👍

We are still discussing a bit where to take this library next, thank you for sharing your ideas!

Right now I am using silly tavern, kobold, and GPT-Sovits to do a kind of speech-to-speech (with the voice I cloned). But it's slow even on a 3090, maybe 4090 can do better? I have tried this HF speech to speech on mac, it is a much better experience. Wherever you are heading, may fortune favor your path.

@PaParaZz1
Copy link

Thanks for this awesome project. Based on the similar pipeline, we have released a Chinese Speech-to-Speech project named CleanS2S, supporting more interesting and streaming interactions.

Here is a snapshot of this project:
20241008-173750

Looking forward to more advices and feedbacks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants