Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HELP] Run into the NaN grad problem while going through the exmaple of official document with fp16 #12134

Open
twotwoiscute opened this issue Feb 11, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@twotwoiscute
Copy link

twotwoiscute commented Feb 11, 2025

Describe the bug

I tried to go though the exmaple provided by the offcial document, while V100 does not support bfloat16,I change the trainer.precision to 16-mixed.At the every first iteration, I run into the gradient NaN issue.

Steps/Code to reproduce bug

  1. Follow the step as shown in the official document to format the data.
  2. Change trainer.precision to 16-mixed as shown below:
python examples/nlp/gpt/train_gpt_sft.py \
   trainer.precision=16-mixed \
   trainer.num_nodes=1 \
   trainer.devices=8 \
   trainer.sft.max_steps=-1 \
   trainer.sft.limit_val_batches=40 \
   trainer.sft.val_check_interval=1000 \
   model.megatron_amp_O2=True \
   model.restore_from_path=/path/to/your/mcore_gpt.nemo \
   model.optim.lr=5e-6 \
   model.answer_only_loss=True \
   model.data.num_workers=0 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=128 \
   model.data.train_ds.file_path=/path/to/databricks-dolly-15k-output.jsonl \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=128 \
   model.data.validation_ds.file_path=/path/to/databricks-dolly-15k-output.jsonl \
   exp_manager.create_wandb_logger=True \
   exp_manager.explicit_log_dir=/results \
   exp_manager.wandb_logger_kwargs.project=sft_run \
   exp_manager.wandb_logger_kwargs.name=dolly_sft_run \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=val_loss

Environment overview (please complete the following information)
Officail docker images is used : nvcr.io/nvidia/nemo:24.12

What should I to solve this problem? Thanks

@twotwoiscute twotwoiscute added the bug Something isn't working label Feb 11, 2025
@twotwoiscute
Copy link
Author

Similar issue is found at here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant