You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wrote a finetuning script based on the nemo2-sft.ipynb provided by this repository, running on 8 V100 (32GB) machines. However, I noticed that the original script uses bf16 precision, which the V100 does not support—causing it to eventually fall back to fp32. When I switched the precision to fp16 and ran the training process, I observed that the loss failed to converge and the gradient norm reported on wandb was extremely small (on the order of 1e-9). In contrast, using the default setting (precision=bf16, which falls back to fp32) resulted in a much longer training time, but the loss decreased and the gradient norm was significantly larger.
Environment
Docker image: nvcr.io/nvidia/nemo:24.12
torch.__version__
'2.5.0a0+e000cf0ad9.nv24.10'
# Driver
Driver Version: 535.183.01
# CUDA things
root@ai-server:/opt/pytorch/apex# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:18:05_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0
grad_norm for fp16 and bf16, the grad_norm of fp16 is nearly zero.
Loss for fp16 and bf16, the loss of bf16 is decreasing and much smaller than fp16, the loss of fp16 has no sign that the model is learning.
Question
I believe that mixed-precision techniques—such as PyTorch’s Automatic Mixed Precision (AMP) or NVIDIA’s Apex—apply loss scaling to prevent underflow in FP16 gradients. However, it appears that the scaling does not occur when using fp16. What is the recommended workaround? Thanks.
The text was updated successfully, but these errors were encountered:
twotwoiscute
changed the title
Loss Fails to Converge in Nemo2-sft.ipynb with Precision FP16 due to Extremely Small Gradients
Loss Fails to Converge in Nemo2-sft.ipynb with Precision 16
Feb 9, 2025
Description
I wrote a finetuning script based on the nemo2-sft.ipynb provided by this repository, running on 8 V100 (32GB) machines. However, I noticed that the original script uses bf16 precision, which the V100 does not support—causing it to eventually fall back to fp32. When I switched the precision to fp16 and ran the training process, I observed that the loss failed to converge and the gradient norm reported on wandb was extremely small (on the order of 1e-9). In contrast, using the default setting (precision=bf16, which falls back to fp32) resulted in a much longer training time, but the loss decreased and the gradient norm was significantly larger.
Environment
Script
Report from wandb
grad_norm
forfp16
andbf16
, thegrad_norm
offp16
is nearly zero.fp16
andbf16
, the loss ofbf16
is decreasing and much smaller thanfp16
, the loss offp16
has no sign that the model is learning.Question
I believe that mixed-precision techniques—such as PyTorch’s Automatic Mixed Precision (AMP) or NVIDIA’s Apex—apply loss scaling to prevent underflow in FP16 gradients. However, it appears that the scaling does not occur when using fp16. What is the recommended workaround? Thanks.
The text was updated successfully, but these errors were encountered: