Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export WORLD_SIZE #1

Open
Mittagskogel opened this issue Nov 14, 2023 · 0 comments
Open

Export WORLD_SIZE #1

Mittagskogel opened this issue Nov 14, 2023 · 0 comments

Comments

@Mittagskogel
Copy link

Mittagskogel commented Nov 14, 2023

If WORLD_SIZE is not exported in run_pretrain_gpt_fugaku.sh, --tensor-model-parallel-size and --pipeline-model-parallel-size have no effect. This also causes problems when restarting from checkpoints and causes OOM on Fugaku when using validation or using larger batch sizes.

One example is

args.pipeline_model_parallel_size = min(
args.pipeline_model_parallel_size,
(args.world_size // args.tensor_model_parallel_size))

Megatron takes the minimum of tensor_model_parallel_size and world_size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant