[HELP] Run into the NaN grad problem while going through the exmaple of official document with fp16 #12134

twotwoiscute · 2025-02-11T10:54:44Z

Describe the bug

I tried to go though the exmaple provided by the offcial document, while V100 does not support bfloat16,I change the trainer.precision to 16-mixed.At the every first iteration, I run into the gradient NaN issue.

Steps/Code to reproduce bug

Follow the step as shown in the official document to format the data.
Change trainer.precision to 16-mixed as shown below:

python examples/nlp/gpt/train_gpt_sft.py \
   trainer.precision=16-mixed \
   trainer.num_nodes=1 \
   trainer.devices=8 \
   trainer.sft.max_steps=-1 \
   trainer.sft.limit_val_batches=40 \
   trainer.sft.val_check_interval=1000 \
   model.megatron_amp_O2=True \
   model.restore_from_path=/path/to/your/mcore_gpt.nemo \
   model.optim.lr=5e-6 \
   model.answer_only_loss=True \
   model.data.num_workers=0 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=128 \
   model.data.train_ds.file_path=/path/to/databricks-dolly-15k-output.jsonl \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=128 \
   model.data.validation_ds.file_path=/path/to/databricks-dolly-15k-output.jsonl \
   exp_manager.create_wandb_logger=True \
   exp_manager.explicit_log_dir=/results \
   exp_manager.wandb_logger_kwargs.project=sft_run \
   exp_manager.wandb_logger_kwargs.name=dolly_sft_run \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=val_loss

Environment overview (please complete the following information)
Officail docker images is used : nvcr.io/nvidia/nemo:24.12

What should I to solve this problem? Thanks

The text was updated successfully, but these errors were encountered:

twotwoiscute · 2025-02-11T10:57:56Z

Similar issue is found at here

twotwoiscute added the bug Something isn't working label Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HELP] Run into the NaN grad problem while going through the exmaple of official document with fp16 #12134

[HELP] Run into the NaN grad problem while going through the exmaple of official document with fp16 #12134

twotwoiscute commented Feb 11, 2025 •

edited

Loading

twotwoiscute commented Feb 11, 2025

[HELP] Run into the NaN grad problem while going through the exmaple of official document with fp16 #12134

[HELP] Run into the NaN grad problem while going through the exmaple of official document with fp16 #12134

Comments

twotwoiscute commented Feb 11, 2025 • edited Loading

twotwoiscute commented Feb 11, 2025

twotwoiscute commented Feb 11, 2025 •

edited

Loading