You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If WORLD_SIZE is not exported in run_pretrain_gpt_fugaku.sh, --tensor-model-parallel-size and --pipeline-model-parallel-size have no effect. This also causes problems when restarting from checkpoints and causes OOM on Fugaku when using validation or using larger batch sizes.
If
WORLD_SIZE
is not exported inrun_pretrain_gpt_fugaku.sh
,--tensor-model-parallel-size
and--pipeline-model-parallel-size
have no effect. This also causes problems when restarting from checkpoints and causes OOM on Fugaku when using validation or using larger batch sizes.One example is
DeepSpeedFugaku/megatron/arguments.py
Lines 76 to 78 in 9b42cdb
Megatron takes the minimum of
tensor_model_parallel_size
andworld_size
.The text was updated successfully, but these errors were encountered: