Export WORLD_SIZE #1

Mittagskogel · 2023-11-14T07:13:27Z

If WORLD_SIZE is not exported in run_pretrain_gpt_fugaku.sh, --tensor-model-parallel-size and --pipeline-model-parallel-size have no effect. This also causes problems when restarting from checkpoints and causes OOM on Fugaku when using validation or using larger batch sizes.

One example is

DeepSpeedFugaku/megatron/arguments.py

Lines 76 to 78 in 9b42cdb

    
           args.pipeline_model_parallel_size = min( 
        
               args.pipeline_model_parallel_size, 
        
               (args.world_size // args.tensor_model_parallel_size))

Megatron takes the minimum of tensor_model_parallel_size and world_size.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export WORLD_SIZE #1

Export WORLD_SIZE #1

Mittagskogel commented Nov 14, 2023 •

edited

Loading

Export WORLD_SIZE #1

Export WORLD_SIZE #1

Comments

Mittagskogel commented Nov 14, 2023 • edited Loading

Mittagskogel commented Nov 14, 2023 •

edited

Loading