Nasty ResNet50 issue #17

huangzy2333 · 2024-09-05T10:20:17Z

Hi there, many thanks for your excellent work and codes!

When I followed your codes to train a nasty resnet50 on cifar100 dataset, the training process fail from the beginning, the loss increases and becomes nan during the first several batches. I followed your instructions to train an original resnet50 first and then train the nasty teacher by following codes, without modifying any params.json file.

python .\train_scratch.py --save_path .\experiments\CIFAR100\baseline\resnet50 --gpu_id 0

python .\train_nasty.py --save_path .\experiments\CIFAR100\kd_nasty_resnet50\nasty_resnet50 --gpu_id 0

And here are the training logs for these two models.

baseline_resnet50_training.log

nasty-resnet50-training.log

I check the hyper parameters in the nasty resnet50 json file and they do align the paper. It also seems strange that other nasty networks trained on cifar100 (resnet18, resnext29) do not appear this issue.

Have you encountered this kind of problem and would you have any suggestions? And would it be possible to share your training log of nasty resnet50 on cifar100? Thank you so much for your time in advance and looking forward to your reply!

The text was updated successfully, but these errors were encountered:

HowieMa · 2024-09-07T01:07:32Z

Hi, unfortunately, I have graduated and cannot get the original training logs for resnet50+cifar100. I have posted a similar log for ResNext-29 + CIFAR-100 before (#14) before, and you can take it as a reference. Besides, one potential solution is to reduce the weights for the nasty loss during training, which may help the collapse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nasty ResNet50 issue #17

Nasty ResNet50 issue #17

huangzy2333 commented Sep 5, 2024

HowieMa commented Sep 7, 2024

Nasty ResNet50 issue #17

Nasty ResNet50 issue #17

Comments

huangzy2333 commented Sep 5, 2024

HowieMa commented Sep 7, 2024