Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in saving nemo checkpoint with Llama-3.1-70B SFT. /opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py #12157

Open
songwang41 opened this issue Feb 12, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@songwang41
Copy link

songwang41 commented Feb 12, 2025

Describe the bug

Traceback (most recent call last):
  File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 173, in main
    sft_trainer.fit()
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 249, in fit
    self.save(metrics, is_train_end=is_train_end)
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 269, in save
    self.ckpt_callback.custom_save(monitor_candidates=monitor_candidates, is_train_end=is_train_end)
  File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 73, in custom_save_ckpt_func
    super(NeMoModelCheckpoint, self)._save_last_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 696, in _save_last_checkpoint
    self._save_checkpoint(trainer, filepath)
  File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 544, in _save_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only, storage_options=storage_options)
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1364, in save_checkpoint
    checkpoint = self._checkpoint_connector.dump_checkpoint(weights_only)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 447, in dump_checkpoint
    optimizer_state = trainer.strategy.optimizer_state(optimizer)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/strategy.py", line 189, in optimizer_state
    return optimizer.state_dict()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 2839, in state_dict
    state_dict = self._state_dict_v2()
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 3247, in _state_dict_v2
    start_all_gather(bucket_id, shard)
  File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 3199, in start_all_gather
    all_gather_into_tensor(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3721, in all_gather_into_tensor
    work = group._allgather_base(output_tensor, input_tensor, opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA calloc async 4 bytes

Steps/Code to reproduce bug

RANK=$1 #0,1,2,3
MASTER_ADDR=<IP>
MASTER_PORT=29500
WORLD_SIZE=4
GPU_NUM=8

max_seq_length=4096
Checkpoint_PATH=
TRAINING_PATH=train.jsonl ]
VALID_PATH=val_10k.jsonl 
OUTPUT_PATH=./test_runs_2502_70b/output3_val_${max_seq_length} 

cd /opt/NeMo-Aligner
torchrun  \
--nnodes  $WORLD_SIZE \
--nproc_per_node=$GPU_NUM   \
--node_rank $RANK   \
--master_addr $MASTER_ADDR \
--master_port 29500    \
/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py    \
trainer.precision=bf16    \
trainer.num_nodes=$WORLD_SIZE    \
trainer.devices=$GPU_NUM    \
trainer.sft.max_steps=-1   \
trainer.sft.max_epochs=2 \
trainer.sft.val_check_interval=4000000    \
trainer.sft.limit_val_batches=20 \
model.megatron_amp_O2=True    \
model.restore_from_path=$Checkpoint_PATH   \
model.tensor_model_parallel_size=8    \
model.pipeline_model_parallel_size=$WORLD_SIZE    \
model.sequence_parallel=False    \
model.encoder_seq_length=${max_seq_length}    \
model.optim.lr=2e-6    \
model.answer_only_loss=True    \
model.data.num_workers=0    \
model.data.train_ds.micro_batch_size=1    \
model.data.train_ds.global_batch_size=128    \
model.data.train_ds.file_path=$TRAINING_PATH    \
model.data.validation_ds.micro_batch_size=1    \
model.data.validation_ds.global_batch_size=128    \
model.data.validation_ds.file_path=${VALID_PATH}    \
exp_manager.create_wandb_logger=False    \
exp_manager.explicit_log_dir=${OUTPUT_PATH}    \
exp_manager.wandb_logger_kwargs.project=sft_run_instruct_data    \
exp_manager.wandb_logger_kwargs.name=sft_run_instruct_data    \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True    \
++exp_manager.checkpoint_callback_params.always_save_nemo=True \
exp_manager.resume_if_exists=True    \
exp_manager.resume_ignore_no_checkpoint=True    \
exp_manager.create_checkpoint_callback=True    \
exp_manager.checkpoint_callback_params.monitor=validation_loss 2>&1 | tee $OUTPUT_PATH/log1.txt

Expected behavior

@songwang41 songwang41 added the bug Something isn't working label Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant