Error in saving nemo checkpoint with Llama-3.1-70B SFT. /opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py #12157

songwang41 · 2025-02-12T20:49:27Z

Describe the bug

Traceback (most recent call last):
  File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 173, in main
    sft_trainer.fit()
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 249, in fit
    self.save(metrics, is_train_end=is_train_end)
  File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 269, in save
    self.ckpt_callback.custom_save(monitor_candidates=monitor_candidates, is_train_end=is_train_end)
  File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 73, in custom_save_ckpt_func
    super(NeMoModelCheckpoint, self)._save_last_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 696, in _save_last_checkpoint
    self._save_checkpoint(trainer, filepath)
  File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 544, in _save_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only, storage_options=storage_options)
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1364, in save_checkpoint
    checkpoint = self._checkpoint_connector.dump_checkpoint(weights_only)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 447, in dump_checkpoint
    optimizer_state = trainer.strategy.optimizer_state(optimizer)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/strategy.py", line 189, in optimizer_state
    return optimizer.state_dict()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 2839, in state_dict
    state_dict = self._state_dict_v2()
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 3247, in _state_dict_v2
    start_all_gather(bucket_id, shard)
  File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 3199, in start_all_gather
    all_gather_into_tensor(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3721, in all_gather_into_tensor
    work = group._allgather_base(output_tensor, input_tensor, opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA calloc async 4 bytes

Steps/Code to reproduce bug

RANK=$1 #0,1,2,3
MASTER_ADDR=<IP>
MASTER_PORT=29500
WORLD_SIZE=4
GPU_NUM=8

max_seq_length=4096
Checkpoint_PATH=
TRAINING_PATH=train.jsonl ]
VALID_PATH=val_10k.jsonl 
OUTPUT_PATH=./test_runs_2502_70b/output3_val_${max_seq_length} 

cd /opt/NeMo-Aligner
torchrun  \
--nnodes  $WORLD_SIZE \
--nproc_per_node=$GPU_NUM   \
--node_rank $RANK   \
--master_addr $MASTER_ADDR \
--master_port 29500    \
/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py    \
trainer.precision=bf16    \
trainer.num_nodes=$WORLD_SIZE    \
trainer.devices=$GPU_NUM    \
trainer.sft.max_steps=-1   \
trainer.sft.max_epochs=2 \
trainer.sft.val_check_interval=4000000    \
trainer.sft.limit_val_batches=20 \
model.megatron_amp_O2=True    \
model.restore_from_path=$Checkpoint_PATH   \
model.tensor_model_parallel_size=8    \
model.pipeline_model_parallel_size=$WORLD_SIZE    \
model.sequence_parallel=False    \
model.encoder_seq_length=${max_seq_length}    \
model.optim.lr=2e-6    \
model.answer_only_loss=True    \
model.data.num_workers=0    \
model.data.train_ds.micro_batch_size=1    \
model.data.train_ds.global_batch_size=128    \
model.data.train_ds.file_path=$TRAINING_PATH    \
model.data.validation_ds.micro_batch_size=1    \
model.data.validation_ds.global_batch_size=128    \
model.data.validation_ds.file_path=${VALID_PATH}    \
exp_manager.create_wandb_logger=False    \
exp_manager.explicit_log_dir=${OUTPUT_PATH}    \
exp_manager.wandb_logger_kwargs.project=sft_run_instruct_data    \
exp_manager.wandb_logger_kwargs.name=sft_run_instruct_data    \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True    \
++exp_manager.checkpoint_callback_params.always_save_nemo=True \
exp_manager.resume_if_exists=True    \
exp_manager.resume_ignore_no_checkpoint=True    \
exp_manager.create_checkpoint_callback=True    \
exp_manager.checkpoint_callback_params.monitor=validation_loss 2>&1 | tee $OUTPUT_PATH/log1.txt

Expected behavior

The text was updated successfully, but these errors were encountered:

songwang41 added the bug Something isn't working label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in saving nemo checkpoint with Llama-3.1-70B SFT. /opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py #12157

Error in saving nemo checkpoint with Llama-3.1-70B SFT. /opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py #12157

songwang41 commented Feb 12, 2025 •

edited

Loading

Error in saving nemo checkpoint with Llama-3.1-70B SFT. /opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py #12157

Error in saving nemo checkpoint with Llama-3.1-70B SFT. /opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py #12157

Comments

songwang41 commented Feb 12, 2025 • edited Loading

songwang41 commented Feb 12, 2025 •

edited

Loading