-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When start firtst_train give errors. I have 96 Gb Ram and 3 P40/24GB/ 1 T4 /16GB/ ?? #178
Comments
You're running out of memory, quite possibly because of your config settings. On a T4 GPU, you won't be able to train much, since it has a very limited VRAM capacity (16GB). The best I could do with a T4 on Google Colab was to fine-tune the LJSpeech model with my own set of 1 - 1.25 seconds long WAV files and settings You can check out the https://github.com/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Finetune_Demo.ipynb Colab notebook for an example of how to fine-tune on a T4, or the other Colab notebooks ( https://github.com/yl4579/StyleTTS2/tree/main/Colab ) for an inspiration there. |
Thanks. |
No, this is not yet possible due to issue #7 and because fine-tuning script is built on top of phase 2 training script which suffers this issue. The best you can get is accelerated fine-tuning on a single processor which is marginally faster and uses a little bit less memory, but it's not a lot: |
Thanks. One more , I have started first stage and is running about 10 hours , then terminal break process. I see it was stopped on 7/200 . Is there a way to continue after stop, not start from zero every time ??? |
Yes, the training process saves checkpoint files - as many of them as you set in config via the To do so, just set the If you wanted to resume 2nd stage training, you'll need to provide 2nd stage checkpoint file and set |
Thank you very much Martin. Have i nice YEAR. |
Sorry for the late reply. I was quite busy recently. Finetuning should use all GPUs. I have tested the finetuning script on 4 NVidia A100 and it worked perfectly well. Have you checked using |
Then is the following statement from the README incorrect or did I simply misunderstand what you meant there?
|
* SYNC CHANGE TO EMO BRANCH (yl4579#162) * Update README.md * 更新 bert_models.json * fix * Update data_utils.py * Update infer.py * performance improve * Feat: support auto split in webui (yl4579#158) * Feat: support auto split in webui * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix: change /voice api to post (yl4579#160) * Fix: change /voice api to post * Fix: support /voice api get * Fix: Add missing torch.cuda.empty_cache() (yl4579#161) --------- Co-authored-by: Sora <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Artrajz <[email protected]> * sync (yl4579#163) * Update README.md * 更新 bert_models.json * fix * Update data_utils.py * Update infer.py * performance improve * Feat: support auto split in webui (yl4579#158) * Feat: support auto split in webui * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix: change /voice api to post (yl4579#160) * Fix: change /voice api to post * Fix: support /voice api get * Fix: Add missing torch.cuda.empty_cache() (yl4579#161) * del emo * del emo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Sora <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Artrajz <[email protected]> * Add files via upload * Update infer.py * add emo * add emo * Update default_config.yml * Fix slice segments GPU perf (yl4579#165) * Fix slice segments GPU perf * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update commons.py --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update infer.py * Update models.py * Update infer.py * remove spec cache * Update data_utils.py * Update data_utils.py * Update train_ms.py * Revert "Fix slice segments GPU perf (yl4579#165)" (yl4579#169) This reverts commit 28430fc76bc628297bb59d8f8d25100dbe46ab59. * Update train_ms.py * Update train_ms.py * Update data_utils.py * Update data_utils.py * Update train_ms.py * Update train_ms.py * Update train_ms.py * Update train_ms.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update default_config.yml * Switch to Japanese wwm DeBERTa (yl4579#172) * Switch to Japanese wwm DeBERTa * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix wrong ellipsis g2p (yl4579#173) * Switch to Japanese wwm DeBERTa * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix ellipsis g2p * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Add files via upload * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix English phones not aligned with BERT features (yl4579#174) * Fix English phones not aligned with BERT features * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix english bert gen (yl4579#175) * Update webui.py * Update webui.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add NCCL timeout * Update train_ms.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update train_ms.py * Update default_config.yml * Update infer.py * Update models.py * Update train_ms.py * Update infer.py * Update emo_gen.py * Feat: Support load and infer 2.0 models (yl4579#178) * Feat: Support load and infer 2.0 models * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * 复用相同逻辑,修正静音添加错误 (yl4579#181) * Refactor: reuse the same part of voice api. * Fix: server_fastapi.py * Update train_ms.py * Update data_utils.py * Update data_utils.py * Update train_ms.py * Update train_ms.py * Update train_ms.py * Update train_ms.py * Update data_utils.py * Update data_utils.py * Add files via upload * Update train_ms.py * Update train_ms.py * Update train_ms.py * Update default_config.yml * Update utils.py * Update train_ms.py * Update utils.py * Update default_config.yml * Update data_utils.py * Update default_config.yml * Update train_ms.py * Update train_ms.py * Update config.py * Update utils.py * Update train_ms.py * Update train_ms.py * feat: add voice mix and tone mix (yl4579#187) * feat: add voice mix and tone mix * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Stardust·减 <[email protected]> * Add files via upload * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Sora <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Artrajz <[email protected]> Co-authored-by: Leng Yue <[email protected]> Co-authored-by: OedoSoldier <[email protected]> Co-authored-by: 潮幻Mark <[email protected]>
The fine tuning is a modified second training so it might work but I don't think so. If the author can create a post showing where the train second fails. Once that's posted then I can take some time to examine it and see if I can debug the issue. |
I have started without the accelerator and it starts.
Thanks for the replies !!!
На сб, 20.01.2024 г. в 18:16 blubee ***@***.***> написа:
… The fine tuning is a modified second training so it might work but I don't
think so. If the author can create a post showing where the train second
fails.
Once that's posted then I can take some time to examine it and see if I
can debug the issue.
—
Reply to this email directly, view it on GitHub
<#178 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYTSUJFS6IFNHU2TI7NZZDYPPUULAVCNFSM6AAAAABBIJ5EH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGE3DKMZTGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
LJSpeech-1.1/wavs/LJ043-0101.wav 22050
LJSpeech-1.1/wavs/LJ028-0188.wav 22050
LJSpeech-1.1/wavs/LJ018-0108.wav 22050
LJSpeech-1.1/wavs/LJ032-0211.wav 22050
LJSpeech-1.1/wavs/LJ034-0082.wav 22050
LJSpeech-1.1/wavs/LJ016-0002.wav 22050
LJSpeech-1.1/wavs/LJ013-0165.wav 22050
LJSpeech-1.1/wavs/LJ046-0247.wav 22050
LJSpeech-1.1/wavs/LJ017-0130.wav 22050
LJSpeech-1.1/wavs/LJ013-0176.wav 22050
LJSpeech-1.1/wavs/LJ042-0162.wav 22050
LJSpeech-1.1/wavs/LJ029-0201.wav 22050
LJSpeech-1.1/wavs/LJ016-0139.wav 22050
LJSpeech-1.1/wavs/LJ017-0258.wav 22050
LJSpeech-1.1/wavs/LJ004-0135.wav 22050
LJSpeech-1.1/wavs/LJ016-0149.wav 22050
LJSpeech-1.1/wavs/LJ024-0108.wav 22050
LJSpeech-1.1/wavs/LJ007-0078.wav 22050
LJSpeech-1.1/wavs/LJ014-0157.wav 22050
LJSpeech-1.1/wavs/LJ047-0208.wav 22050
LJSpeech-1.1/wavs/LJ013-0240.wav 22050
LJSpeech-1.1/wavs/LJ028-0059.wav 22050
Traceback (most recent call last):
File "/home/koce/StyleTTS/train_first.py", line 393, in
main()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
Traceback (most recent call last):
File "/home/koce/StyleTTS/train_first.py", line 214, in main
File "/home/koce/StyleTTS/train_first.py", line 393, in
loss_reg = r1_reg(out, gt)
File "/home/koce/StyleTTS/utils.py", line 60, in r1_reg
grad_dout = torch.autograd.grad(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 394, in grad
main()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 78.00 MiB. GPU 0 has a total capacty of 23.87 GiB of which 25.62 MiB is free. Process 1178071 has 7.50 GiB memory in use. Process 1178069 has 10.88 GiB memory in use. Process 1178070 has 5.46 GiB memory in use. Of the allocated memory 10.44 GiB is allocated by PyTorch, and 249.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/koce/StyleTTS/train_first.py", line 207, in main
mel_rec = model.decoder(en, F0_real, real_norm, s)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/koce/StyleTTS/models.py", line 461, in forward
x = block(x, s)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/koce/StyleTTS/models.py", line 407, in forward
out = self._residual(x, s)
File "/home/koce/StyleTTS/models.py", line 403, in _residual
x = self.conv2(self.dropout(x))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1557, in _call_impl
args_result = hook(self, args)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py", line 67, in call
setattr(module, self.name, self.compute_weight(module))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py", line 26, in compute_weight
return _weight_norm(v, g, self.dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 23.87 GiB of which 9.62 MiB is free. Process 1178071 has 7.50 GiB memory in use. Process 1178069 has 10.89 GiB memory in use. Process 1178070 has 5.46 GiB memory in use. Of the allocated memory 7.08 GiB is allocated by PyTorch, and 254.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The text was updated successfully, but these errors were encountered: