MegatronGPTModel trains much worse when reducing micro_batch_size #11939

m-harmonic · 2025-01-23T15:33:33Z

Describe the bug

We do two identical runs with the same data, global batch size, number of GPUs, and number of nodes. The only difference is we use micro_batch_size=8 vs micro_batch_size=4. There is no other difference in configuration or settings. When we look at loss curves for both training and validation datasets, the performance gradually diverges between the two and becomes significant once passing 5-10k steps.

Steps/Code to reproduce bug

We are using MegatronGPTSFTModel and other training settings include megatron_amp_O2=false, bf16-mixed precision, and tensor_model_parallel_size=2.

We couldn't find existing issues or bug fixes that could cause this, are there suggestions for settings or other potential causes of the problem?

Expected behavior

Since the global batch size is the same, we expect the models to train very similarly.

Environment overview (please complete the following information)

Nemo provided via nvidia Nemo 24.07 image

Environment details

OS: Debian GNU/Linux 12 (bookworm)
Pytorch: 2.3.0
Nemo: 2.0.0

Additional context

The text was updated successfully, but these errors were encountered:

m-harmonic added the bug Something isn't working label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MegatronGPTModel trains much worse when reducing micro_batch_size #11939

MegatronGPTModel trains much worse when reducing micro_batch_size #11939

m-harmonic commented Jan 23, 2025 •

edited

Loading

MegatronGPTModel trains much worse when reducing micro_batch_size #11939

MegatronGPTModel trains much worse when reducing micro_batch_size #11939

Comments

m-harmonic commented Jan 23, 2025 • edited Loading

m-harmonic commented Jan 23, 2025 •

edited

Loading