-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coqui engine is unusable #237
Comments
"Coqui engine is unusable" sounds a bit harsh. Your hardware should be more than enough to synthesize a sentence in a few seconds. My guess? You've installed CUDA but didn’t configure PyTorch to actually use it. Check the instructions here: Run this and let me know what it says: import torch
print("CUDA is available!" if torch.cuda.is_available() else "CUDA is not available.") If CUDA is installed properly, try enabling DeepSpeed for a speed boost (almost 2x faster): pip install torch==2.1.2+cu121 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install https://github.com/daswer123/deepspeed-windows-wheels/releases/download/11.2/deepspeed-0.11.2+cuda121-cp310-cp310-win_amd64.whl Here’s a quick test script with extended logging: if __name__ == "__main__":
from RealtimeTTS import TextToAudioStream, CoquiEngine
import time
def dummy_generator():
yield "Hey guys! These here are realtime spoken sentences based on local text synthesis. "
yield "With a local, neuronal, cloned voice. So every spoken sentence sounds unique."
import logging
logging.basicConfig(level=logging.INFO)
engine = CoquiEngine(level=logging.INFO, use_deepspeed=True)
stream = TextToAudioStream(engine, muted=True)
print("Starting to play stream")
start_time = time.time()
stream.feed(dummy_generator()).play(log_synthesized_text=True, muted=True, output_wavfile=stream.engine.engine_name + "_output.wav")
end_time = time.time()
print(f"Time taken for play command: {end_time - start_time:.2f} seconds")
engine.shutdown() You should see something like this in the output:
For comparison, on my 4090, I get:
That’s for a 16-second generated audio file, translating to a real-time factor of 0.22625. Your RTX 3060 should easily manage a real-time factor below 1. So yeah, the engine is definitely not "unusable." A project like OpenInterpreter 01, which has 5,000+ GitHub stars, wouldn’t rely on it if that were the case. Let’s figure this out. 😊 |
I'm gonna try this too. The output is currently only speaking a few words then pauses plays a few more then pauses. Even with your couqui example script. I installed everything per instructions in a new folder and virtual env. I'll get back with results. |
Its working now. Apparently had my torch install didn't have cuda. Was able to install torch with cuda 12.1 from https://pytorch.org/get-started/locally/, then Deepspeed 0.13.1 for python 3.11 from the dawser123 repo. I ran the coqui example again and its buttery smooth. |
I've got a problem I've been trying to figure out for 2 weeks now and cannot get it to work.
The Coqui Engine is basically unusable. Time for synthesis takes more than 30 seconds per sentence.
I've got all the dependencies installed, as well as Cuda.
engine = CoquiEngine( device="cuda", language="de", level=logging.INFO, local_models_path=r"C:\Users\Fuat\Desktop\Realtime SST\cacheCustom" ) engine.set_voice("Damien Black")
Also, I've tried switching the model to a different one, but the engine outputs an error once you set the model_name or specific_model to anything other than xtts2...
My PC specs are:
CPU: AMD Ryzen 5 5600X 6-Core Processor 3.70 GHz
GPU: RTX 3060 12GB
RAM: 32 G
I'm pretty lost on this, so any help would be appreciated!
The text was updated successfully, but these errors were encountered: