You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have noticed that in your official tutorial of multi-gpus training, when facing with 2 gpus, you set args.learning_rate = 3e-4 and args.backbone_coef_lr = 0.05, which means your backbone will reach 1.5e-5 after warm up epoch.
And in your official tensorboard_log extracted from msrvtt-table1, your model used 16 gpus. Meanwhile, after warm up epoch, tensorboard shows your learning rate also reached 1.5*e-5, which is a same number with respect to 2-gpus situation mentioned above.
It seems a problem to me that if you should change the learning rate according to the world size? In my opinion, the learning rate should be bigger when facing with bigger world size, but I haven't seen any relevant operation in your code.
Looking forward for your apply!
The text was updated successfully, but these errors were encountered:
Thank you for the question. In our experiments, we use 16 gpus for training. If you use a different number of gpus, the parameters should be adjusted manually.
Hello, thanks for your awesome work!@kevinlin311tw
I have noticed that in your official tutorial of multi-gpus training, when facing with 2 gpus, you set args.learning_rate = 3e-4 and args.backbone_coef_lr = 0.05, which means your backbone will reach 1.5e-5 after warm up epoch.
And in your official tensorboard_log extracted from msrvtt-table1, your model used 16 gpus. Meanwhile, after warm up epoch, tensorboard shows your learning rate also reached 1.5*e-5, which is a same number with respect to 2-gpus situation mentioned above.
It seems a problem to me that if you should change the learning rate according to the world size? In my opinion, the learning rate should be bigger when facing with bigger world size, but I haven't seen any relevant operation in your code.
Looking forward for your apply!
The text was updated successfully, but these errors were encountered: