You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/opt/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 173, in main
sft_trainer.fit()
File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 249, in fit
self.save(metrics, is_train_end=is_train_end)
File "/opt/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 269, in save
self.ckpt_callback.custom_save(monitor_candidates=monitor_candidates, is_train_end=is_train_end)
File "/opt/NeMo-Aligner/nemo_aligner/utils/utils.py", line 73, in custom_save_ckpt_func
super(NeMoModelCheckpoint, self)._save_last_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 696, in _save_last_checkpoint
self._save_checkpoint(trainer, filepath)
File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 544, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only, storage_options=storage_options)
File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1364, in save_checkpoint
checkpoint = self._checkpoint_connector.dump_checkpoint(weights_only)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 447, in dump_checkpoint
optimizer_state = trainer.strategy.optimizer_state(optimizer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/strategy.py", line 189, in optimizer_state
return optimizer.state_dict()
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 2839, in state_dict
state_dict = self._state_dict_v2()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 3247, in _state_dict_v2
start_all_gather(bucket_id, shard)
File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 3199, in start_all_gather
all_gather_into_tensor(
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3721, in all_gather_into_tensor
work = group._allgather_base(output_tensor, input_tensor, opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:328, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA calloc async 4 bytes
Describe the bug
Steps/Code to reproduce bug
Expected behavior
The text was updated successfully, but these errors were encountered: