Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MegatronGPTModel trains much worse when reducing micro_batch_size #11939

Open
m-harmonic opened this issue Jan 23, 2025 · 0 comments
Open

MegatronGPTModel trains much worse when reducing micro_batch_size #11939

m-harmonic opened this issue Jan 23, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@m-harmonic
Copy link

m-harmonic commented Jan 23, 2025

Describe the bug

We do two identical runs with the same data, global batch size, number of GPUs, and number of nodes. The only difference is we use micro_batch_size=8 vs micro_batch_size=4. There is no other difference in configuration or settings. When we look at loss curves for both training and validation datasets, the performance gradually diverges between the two and becomes significant once passing 5-10k steps.

Steps/Code to reproduce bug

We are using MegatronGPTSFTModel and other training settings include megatron_amp_O2=false, bf16-mixed precision, and tensor_model_parallel_size=2.

We couldn't find existing issues or bug fixes that could cause this, are there suggestions for settings or other potential causes of the problem?

Expected behavior

Since the global batch size is the same, we expect the models to train very similarly.

Environment overview (please complete the following information)

Nemo provided via nvidia Nemo 24.07 image

Environment details

OS: Debian GNU/Linux 12 (bookworm)
Pytorch: 2.3.0
Nemo: 2.0.0

Additional context

@m-harmonic m-harmonic added the bug Something isn't working label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant