You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We do two identical runs with the same data, global batch size, number of GPUs, and number of nodes. The only difference is we use micro_batch_size=8 vs micro_batch_size=4. There is no other difference in configuration or settings. When we look at loss curves for both training and validation datasets, the performance gradually diverges between the two and becomes significant once passing 5-10k steps.
Steps/Code to reproduce bug
We are using MegatronGPTSFTModel and other training settings include megatron_amp_O2=false, bf16-mixed precision, and tensor_model_parallel_size=2.
We couldn't find existing issues or bug fixes that could cause this, are there suggestions for settings or other potential causes of the problem?
Expected behavior
Since the global batch size is the same, we expect the models to train very similarly.
Environment overview (please complete the following information)
Describe the bug
We do two identical runs with the same data, global batch size, number of GPUs, and number of nodes. The only difference is we use
micro_batch_size=8
vsmicro_batch_size=4
. There is no other difference in configuration or settings. When we look at loss curves for both training and validation datasets, the performance gradually diverges between the two and becomes significant once passing 5-10k steps.Steps/Code to reproduce bug
We are using
MegatronGPTSFTModel
and other training settings includemegatron_amp_O2=false
,bf16-mixed
precision, andtensor_model_parallel_size=2
.We couldn't find existing issues or bug fixes that could cause this, are there suggestions for settings or other potential causes of the problem?
Expected behavior
Since the global batch size is the same, we expect the models to train very similarly.
Environment overview (please complete the following information)
Nemo provided via nvidia Nemo 24.07 image
Environment details
OS: Debian GNU/Linux 12 (bookworm)
Pytorch: 2.3.0
Nemo: 2.0.0
Additional context
The text was updated successfully, but these errors were encountered: