Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training problem #11

Open
chenyinlin1 opened this issue Jun 23, 2024 · 9 comments
Open

Training problem #11

chenyinlin1 opened this issue Jun 23, 2024 · 9 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@chenyinlin1
Copy link

chenyinlin1 commented Jun 23, 2024

Hello. Thank you for your outstanding work. However, I am having some problems reproducing the training portion of the code and am not getting the expected training results. Your code originally appeared to have all losses as nan, as shown below.
image

I tried to modify the loss function a bit, but it seems that there is no backpropagation, although the losses are no longer nan.
image

where all the parameters use the default training parameters,Except that batch_size was changed from 36 to 24

@theEricMa
Copy link
Owner

Hi, thanks for your interest in our work. Which dataset are you working with? From the first training log, I can see that none of the training loss items are NaN, so their sum shouldn't be NaN either. This is quite unusual.

@chenyinlin1
Copy link
Author

Thanks for the reply, I'm using the Vocaset dataset for the training

@zhongshijun
Copy link

I have the same problem. All parameters are the author's default Settings, but the final loss does not converge.

@zhongshijun
Copy link

2024-07-22 18:56:45,895 Epoch 8993: Train_vertice_recon 3.705e-07 Train_vertice_reconv 2.486e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:57:02,721 Epoch 8994: Train_vertice_recon 3.779e-07 Train_vertice_reconv 2.526e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:57:19,895 Epoch 8995: Train_vertice_recon 3.547e-07 Train_vertice_reconv 2.375e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 51.0%
2024-07-22 18:57:36,028 Epoch 8996: Train_vertice_recon 3.612e-07 Train_vertice_reconv 2.399e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:57:51,865 Epoch 8997: Train_vertice_recon 3.704e-07 Train_vertice_reconv 2.469e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:58:07,573 Epoch 8998: Train_vertice_recon 3.607e-07 Train_vertice_reconv 2.420e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:58:22,780 Epoch 8999: Train_vertice_recon 3.760e-07 Train_vertice_reconv 2.518e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 50.9%
2024-07-22 18:58:23,212 Training done

@zhongshijun
Copy link

2024-07-21 18:10:19,878 Training started
2024-07-21 18:10:32,082 Epoch 0: Train_vertice_recon 3.591e-07 Train_vertice_reconv 2.394e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 39.7%
2024-07-21 18:10:41,208 Epoch 1: Train_vertice_recon 3.626e-07 Train_vertice_reconv 2.420e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 41.3%
2024-07-21 18:10:52,553 Epoch 2: Train_vertice_recon 3.684e-07 Train_vertice_reconv 2.463e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 42.3%
2024-07-21 18:11:01,868 Epoch 3: Train_vertice_recon 3.645e-07 Train_vertice_reconv 2.435e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 42.3%
2024-07-21 18:11:10,572 Epoch 4: Train_vertice_recon 3.666e-07 Train_vertice_reconv 2.449e-08 Train_lip_recon 0.000e+00 Train_lip_reconv 0.000e+00 Val_vertice_recon 5.470e-07 Val_vertice_reconv 3.962e-08 Val_lip_recon 0.000e+00 Val_lip_reconv 0.000e+00 Memory 42.6%

@xopclabs
Copy link

Same issue for me! I used prepare_data_voca.py from faceformer repo to unpack vocaset data and ran the training script with default parameters.

@theEricMa theEricMa added the bug Something isn't working label Jul 31, 2024
@yangyifan18
Copy link

The nan loss is because of None return from update() function in DIFFUSION_BIAS. Actually, when you overrive the update() of Metric, no return value is expected .
image

So maybe you should rewrite the loss in allsplit_step, please follow this link , it works for me. #5 (comment)

@theEricMa theEricMa added the good first issue Good for newcomers label Oct 9, 2024
@yangyifan18
Copy link

yangyifan18 commented Dec 3, 2024 via email

@Echo-jyt
Copy link

Echo-jyt commented Dec 3, 2024

update还是需要更新的,这样才能将recon loss,recon velocity loss记录下来。但是所有loss应该通过重写allsplit_step汇总 (在allsplit函数中汇总并return,相当于update只是起个记录的作用) | | 杨逸凡 | | @.*** | ---- Replied Message ---- | From | @.> | | Date | 11/28/2024 17:25 | | To | @.> | | Cc | Yifan @.> , @.> | | Subject | Re: [theEricMa/DiffSpeaker] Training problem (Issue #11) | The nan loss is because of None return from update() function in DIFFUSION_BIAS. Actually, when you overrive the update() of Metric, no return value is expected . So maybe you should rewrite the loss in allsplit_step, please follow this link , it works for me. #5 (comment) 你好,只用重写allsplit_step吗?update需要更新吗 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
感谢回复!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

6 participants