Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FSDP requires torch optimizer, not transformer_engine or apex #1322

Open
prrathi opened this issue Dec 15, 2024 · 0 comments
Open

[BUG] FSDP requires torch optimizer, not transformer_engine or apex #1322

prrathi opened this issue Dec 15, 2024 · 0 comments

Comments

@prrathi
Copy link

prrathi commented Dec 15, 2024

Describe the bug
Enabling FSDP doesn't work with transformer_engine.pytorch.optimizers.FusedAdam or apex.optimizers.Adam, and requires torch.optim.AdamW which isn't the default in /work/nvme/bddk/prathi3/Megatron-LM/megatron/core/optimizer/__init__.py.

To Reproduce
/usr/local/bin/torchrun --max_restarts 1 --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr {} --master_port {} --start_method spawn --rdzv_backend static --rdzv_endpoint {} --rdzv_conf 'distributed_backend=nccl' pretrain_gpt.py --num-layers 16 --hidden-size 2048 --ffn-hidden-size 8192 --num-attention-heads 32 --seq-length 8192 --max-position-embeddings 8192 --swiglu --train-iters 20 --eval-iters 1 --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --context-parallel-size 1 --use-torch-fsdp2 --no-gradient-accumulation-fusion --micro-batch-size 1 --global-batch-size 2 --save-interval 21 --log-interval 1 --log-throughput --logging-level 10 --lr 0.003 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --clip-grad 0.0 --lr-warmup-fraction .01 --weight-decay 0.1 --vocab-size 128256 --bf16 --use-flash-attn --use-mcore-models --untie-embeddings-and-output-weights --position-embedding-type rope --normalization LayerNorm --disable-bias-linear on 2 A40 GPUs from branch core_r0.10.0

Expected behavior
Using transformer_engine or apex's optimizer should be disabled if FSDP is enabled

Stack trace/logs

For Apex Adam:

[rank0]: Traceback (most recent call last):
[rank0]:   File "Megatron-LM/pretrain_gpt.py", line 284, in <module>
[rank0]:     pretrain(
[rank0]:   File "Megatron-LM/megatron/training/training.py", line 376, in pretrain
[rank0]:     iteration, num_floating_point_operations_so_far = train(
[rank0]:   File "Megatron-LM/megatron/training/training.py", line 1431, in train
[rank0]:     train_step(forward_step_func,
[rank0]:   File "Megatron-LM/megatron/training/training.py", line 775, in train_step
[rank0]:     update_successful, grad_norm, num_zeros_in_grad = optimizer.step()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "Megatron-LM/megatron/core/optimizer/optimizer.py", line 473, in step
[rank0]:     success = self.step_with_ready_grads()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "Megatron-LM/megatron/core/optimizer/optimizer.py", line 430, in step_with_ready_grads
[rank0]:     self.optimizer.step()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 478, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py", line 293, in step
[rank0]:     multi_tensor_applier(self.multi_tensor_adam,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
[rank0]:     return op(self.chunk_size,
[rank0]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Environment (please complete the following information):

  • Megatron-LM commit ID: 25a4125
  • PyTorch version: 2.5.0
  • CUDA version: 12.6
  • NCCL version: 2.22.3

Proposed fix
Vanilla torch.optim.AdamW worked for me, so maybe make this the default if fsdp is enabled

Additional context
N/A

@prrathi prrathi changed the title [BUG] [BUG] FSDP requires torch optimizer, not transformer_engine or apex Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant