Training problem #11

chenyinlin1 · 2024-06-23T07:18:26Z

Hello. Thank you for your outstanding work. However, I am having some problems reproducing the training portion of the code and am not getting the expected training results. Your code originally appeared to have all losses as nan, as shown below.

I tried to modify the loss function a bit, but it seems that there is no backpropagation, although the losses are no longer nan.

where all the parameters use the default training parameters，Except that batch_size was changed from 36 to 24

theEricMa · 2024-06-23T11:37:40Z

Hi, thanks for your interest in our work. Which dataset are you working with? From the first training log, I can see that none of the training loss items are NaN, so their sum shouldn't be NaN either. This is quite unusual.

chenyinlin1 · 2024-06-24T08:07:52Z

Thanks for the reply, I'm using the Vocaset dataset for the training

zhongshijun · 2024-07-26T03:39:22Z

I have the same problem. All parameters are the author's default Settings, but the final loss does not converge.

zhongshijun · 2024-07-26T03:48:45Z

2024-07-22 18:56:45,895 Epoch 8993: Train_vertice_recon 3.705e-07 Train_vertice_reconv 2.486e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:57:02,721 Epoch 8994: Train_vertice_recon 3.779e-07 Train_vertice_reconv 2.526e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:57:19,895 Epoch 8995: Train_vertice_recon 3.547e-07 Train_vertice_reconv 2.375e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 51.0%
2024-07-22 18:57:36,028 Epoch 8996: Train_vertice_recon 3.612e-07 Train_vertice_reconv 2.399e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:57:51,865 Epoch 8997: Train_vertice_recon 3.704e-07 Train_vertice_reconv 2.469e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:58:07,573 Epoch 8998: Train_vertice_recon 3.607e-07 Train_vertice_reconv 2.420e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:58:22,780 Epoch 8999: Train_vertice_recon 3.760e-07 Train_vertice_reconv 2.518e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:58:23,212 Training done

zhongshijun · 2024-07-26T03:49:43Z

2024-07-21 18:10:19,878 Training started
2024-07-21 18:10:32,082 Epoch 0: Train_vertice_recon 3.591e-07 Train_vertice_reconv 2.394e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 39.7%
2024-07-21 18:10:41,208 Epoch 1: Train_vertice_recon 3.626e-07 Train_vertice_reconv 2.420e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 41.3%
2024-07-21 18:10:52,553 Epoch 2: Train_vertice_recon 3.684e-07 Train_vertice_reconv 2.463e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 42.3%
2024-07-21 18:11:01,868 Epoch 3: Train_vertice_recon 3.645e-07 Train_vertice_reconv 2.435e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 42.3%
2024-07-21 18:11:10,572 Epoch 4: Train_vertice_recon 3.666e-07 Train_vertice_reconv 2.449e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 42.6%

xopclabs · 2024-07-30T19:45:55Z

Same issue for me! I used prepare_data_voca.py from faceformer repo to unpack vocaset data and ran the training script with default parameters.

yangyifan18 · 2024-08-11T07:29:30Z

The nan loss is because of None return from update() function in DIFFUSION_BIAS. Actually, when you overrive the update() of Metric, no return value is expected .

So maybe you should rewrite the loss in allsplit_step, please follow this link , it works for me. #5 (comment)

yangyifan18 · 2024-12-03T05:18:25Z

update还是需要更新的，这样才能将recon loss，recon velocity loss记录下来。但是所有loss应该通过重写allsplit_step汇总（在allsplit函数中汇总并return，相当于update只是起个记录的作用） | | 杨逸凡 | | ***@***.*** | ---- Replied Message ---- | From | ***@***.***> | | Date | 11/28/2024 17:25 | | To | ***@***.***> | | Cc | Yifan ***@***.***> , ***@***.***> | | Subject | Re: [theEricMa/DiffSpeaker] Training problem (Issue #11) | The nan loss is because of None return from update() function in DIFFUSION_BIAS. Actually, when you overrive the update() of Metric, no return value is expected . So maybe you should rewrite the loss in allsplit_step, please follow this link , it works for me. #5 (comment) 你好，只用重写allsplit_step吗？update需要更新吗 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Echo-jyt · 2024-12-03T07:03:35Z

update还是需要更新的，这样才能将recon loss，recon velocity loss记录下来。但是所有loss应该通过重写allsplit_step汇总（在allsplit函数中汇总并return，相当于update只是起个记录的作用） | | 杨逸凡 | | @.*** | ---- Replied Message ---- | From | @.> | | Date | 11/28/2024 17:25 | | To | @.> | | Cc | Yifan @.> , @.> | | Subject | Re: [theEricMa/DiffSpeaker] Training problem (Issue #11) | The nan loss is because of None return from update() function in DIFFUSION_BIAS. Actually, when you overrive the update() of Metric, no return value is expected . So maybe you should rewrite the loss in allsplit_step, please follow this link , it works for me. #5 (comment) 你好，只用重写allsplit_step吗？update需要更新吗 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
感谢回复！

theEricMa added the bug Something isn't working label Jul 31, 2024

theEricMa added the good first issue Good for newcomers label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training problem #11

Training problem #11

chenyinlin1 commented Jun 23, 2024 •

edited

Loading

theEricMa commented Jun 23, 2024

chenyinlin1 commented Jun 24, 2024

zhongshijun commented Jul 26, 2024

zhongshijun commented Jul 26, 2024

zhongshijun commented Jul 26, 2024

xopclabs commented Jul 30, 2024

yangyifan18 commented Aug 11, 2024

yangyifan18 commented Dec 3, 2024 via email

Echo-jyt commented Dec 3, 2024

Training problem #11

Training problem #11

Comments

chenyinlin1 commented Jun 23, 2024 • edited Loading

theEricMa commented Jun 23, 2024

chenyinlin1 commented Jun 24, 2024

zhongshijun commented Jul 26, 2024

zhongshijun commented Jul 26, 2024

zhongshijun commented Jul 26, 2024

xopclabs commented Jul 30, 2024

yangyifan18 commented Aug 11, 2024

yangyifan18 commented Dec 3, 2024 via email

Echo-jyt commented Dec 3, 2024

chenyinlin1 commented Jun 23, 2024 •

edited

Loading