deepspeed for XTTS #569

hslr4 · 2024-07-02T08:35:19Z

Since the coqui docs recommend the use of deepspeed to speed up their XTTS model I wanted to give this a try.

To make it work I did the following:

I had to rebuild pytorch with USE_NCCL=1 because deepspeed requires nccl.
I also installed libaio-dev since a warning during the installation of deepspeed recommends it (though I am not sure if it is actually required).
I also had to downgrade setuptools to 69.5.1.
Since I built torch 2.1 (I don't know why 😅 ) I also had to rebuild torchaudio, which might not be necessary if I had rebuilt the same version of torch as was installed in the first place.
Finally I installed deepspeed with TORCH_CUDA_ARCH_LIST="7.2;8.7" pip3 install deepspeed.

Indeed inference of XTTS is about twice as fast as without deepspeed (still slower than other TTS models, that do not support voice cloning or multilinguality).

Would it make sense to provide pre-built pytorch versions with nccl for such use-cases?

The text was updated successfully, but these errors were encountered:

dusty-nv · 2024-07-02T14:11:18Z

Thanks @hslr4 , those are interesting findings! I will try rebuilding PyTorch here with USE_NCCL=1 and perhaps make a container component for DeepSpeed. Do you know if it is faster than the use_tensorrt mode I added to XTTS? https://github.com/dusty-nv/TTS/commits/dev/

dusty-nv · 2024-07-02T16:37:05Z

OK, pytorch passed with USE_NCCL=1! The pytorch 2.2 wheel for JP6 / Python 3.10 is up awhile here: http://jetson.webredirect.org/jp6/cu122/+f/3a2/9b5771b21e4e1/torch-2.2.0-cp310-cp310-linux_aarch64.whl

You should not need to recompile torchvision or torchaudio wheels I don't believe.

hslr4 · 2024-07-03T16:19:09Z

Thank you for the quick reply and help @dusty-nv !

When I wanted to do some testing regarding inference speed, I noticed that tensorrt is currently only used in inference_stream but not in inference (https://github.com/dusty-nv/TTS/blob/b452e628316e0fe33b8842e5d6ec1eb01fc46ef3/TTS/tts/models/xtts.py#L584)

After I added it to inference I used (a customized) tts server to generate a set of different length sentences (45 sentences with 65 characters on avg and 2927 in total). These are the mean times for generating the speech for these sentences:

mean time	trt	deepspeed
5.04	false	false
4.94	fp16	false
2.11	false	true
1.94	fp16	true

Seems like using deepspeed on xtts's gpt is more helpful than tensorrt on the hifigan_decoder. Actually the speedup by trt appears comparably low, so I'm still not sure if I'm actually using it right 🙈 Did you test how effective the hifigan_decoder_trt is before?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed for XTTS #569

deepspeed for XTTS #569

hslr4 commented Jul 2, 2024

dusty-nv commented Jul 2, 2024

dusty-nv commented Jul 2, 2024

hslr4 commented Jul 3, 2024

deepspeed for XTTS #569

deepspeed for XTTS #569

Comments

hslr4 commented Jul 2, 2024

dusty-nv commented Jul 2, 2024

dusty-nv commented Jul 2, 2024

hslr4 commented Jul 3, 2024