[BUG] FSDP requires torch optimizer, not transformer_engine or apex #1322

prrathi · 2024-12-15T03:02:19Z

Describe the bug
Enabling FSDP doesn't work with transformer_engine.pytorch.optimizers.FusedAdam or apex.optimizers.Adam, and requires torch.optim.AdamW which isn't the default in /work/nvme/bddk/prathi3/Megatron-LM/megatron/core/optimizer/__init__.py.

To Reproduce
/usr/local/bin/torchrun --max_restarts 1 --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr {} --master_port {} --start_method spawn --rdzv_backend static --rdzv_endpoint {} --rdzv_conf 'distributed_backend=nccl' pretrain_gpt.py --num-layers 16 --hidden-size 2048 --ffn-hidden-size 8192 --num-attention-heads 32 --seq-length 8192 --max-position-embeddings 8192 --swiglu --train-iters 20 --eval-iters 1 --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --context-parallel-size 1 --use-torch-fsdp2 --no-gradient-accumulation-fusion --micro-batch-size 1 --global-batch-size 2 --save-interval 21 --log-interval 1 --log-throughput --logging-level 10 --lr 0.003 --lr-decay-iters 320000 --lr-decay-style cosine --min-lr 1.0e-5 --clip-grad 0.0 --lr-warmup-fraction .01 --weight-decay 0.1 --vocab-size 128256 --bf16 --use-flash-attn --use-mcore-models --untie-embeddings-and-output-weights --position-embedding-type rope --normalization LayerNorm --disable-bias-linear on 2 A40 GPUs from branch core_r0.10.0

Expected behavior
Using transformer_engine or apex's optimizer should be disabled if FSDP is enabled

Stack trace/logs

For Apex Adam:

[rank0]: Traceback (most recent call last):
[rank0]:   File "Megatron-LM/pretrain_gpt.py", line 284, in <module>
[rank0]:     pretrain(
[rank0]:   File "Megatron-LM/megatron/training/training.py", line 376, in pretrain
[rank0]:     iteration, num_floating_point_operations_so_far = train(
[rank0]:   File "Megatron-LM/megatron/training/training.py", line 1431, in train
[rank0]:     train_step(forward_step_func,
[rank0]:   File "Megatron-LM/megatron/training/training.py", line 775, in train_step
[rank0]:     update_successful, grad_norm, num_zeros_in_grad = optimizer.step()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "Megatron-LM/megatron/core/optimizer/optimizer.py", line 473, in step
[rank0]:     success = self.step_with_ready_grads()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "Megatron-LM/megatron/core/optimizer/optimizer.py", line 430, in step_with_ready_grads
[rank0]:     self.optimizer.step()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 478, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py", line 293, in step
[rank0]:     multi_tensor_applier(self.multi_tensor_adam,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
[rank0]:     return op(self.chunk_size,
[rank0]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Environment (please complete the following information):

Megatron-LM commit ID: 25a4125
PyTorch version: 2.5.0
CUDA version: 12.6
NCCL version: 2.22.3

Proposed fix
Vanilla torch.optim.AdamW worked for me, so maybe make this the default if fsdp is enabled

Additional context
N/A

The text was updated successfully, but these errors were encountered:

prrathi changed the title ~~[BUG]~~ [BUG] FSDP requires torch optimizer, not transformer_engine or apex Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] FSDP requires torch optimizer, not transformer_engine or apex #1322

[BUG] FSDP requires torch optimizer, not transformer_engine or apex #1322

prrathi commented Dec 15, 2024

[BUG] FSDP requires torch optimizer, not transformer_engine or apex #1322

[BUG] FSDP requires torch optimizer, not transformer_engine or apex #1322

Comments

prrathi commented Dec 15, 2024