You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello there. I've noticed something weird during my pre-training process.
Setup:
TransformerEngine 1.3.0 + Megatron-LM
TP=1 PP=1, only using DP
Structure: LLaMa like LayerNormMLP (swiglu as activation and RMSNorm)
When I train with one host + 8 H100, the training grad norm seems to be okay, like shrinking down in a expected pacing;
However when I add more hosts (each with 8 H100), the grad norm starts to blow up (like in the first 50 step which is normal ~3.4, but starting to blow up in during 50-100 and 100-200), finally blow into 1000 ...
I've tried to set fp8_wgrad=True, plusing setting DelayScheduling amax_history window into 2048 instead of 1024, this somehow mitigated the issue in a way, but my questions are:
Is that multi-host based training with grad norm exploding known issue? which requires to expand amax_history and set wgrad to true?
Is there any performance degrade going to be expected by setting fp_wgrad=True and amax_history to 2048?
Thanks a lot!
The text was updated successfully, but these errors were encountered:
tylaar
changed the title
LayerNormMLP seems to causing grad norm explosion under multi-node
[Pytorch] LayerNormMLP seems to causing grad norm explosion under multi-node
Mar 10, 2024
Hi @tylaar. No, that is definitely not expected and you should not have to change the values of the parameters (fp8_wgrad should be True by default actually). Could you provide us with the information how to reproduce the issue you are seeing?
Hi @ptrendx, sorry for taking so long to reply this thread. I found that after apply all components (like activation swiglu and RMSNorm) by using TE implementation + fp8_wgrad = True + fp8_amax_history_len = 1024 resolved my production issue.
Some takeaway experience from my side is, even though there are some bitwise not aligned issue like we mentioned in #717, TE impl of swiglu and RMSNorm is still reliable and problemless than previous implementation comparing to our in-house impl.
I will close this now, and if I found some other issue I will file another issue. Thanks a lot!
Hello there. I've noticed something weird during my pre-training process.
Setup:
TransformerEngine 1.3.0 + Megatron-LM
TP=1 PP=1, only using DP
Structure: LLaMa like LayerNormMLP (swiglu as activation and RMSNorm)
When I train with one host + 8 H100, the training grad norm seems to be okay, like shrinking down in a expected pacing;
However when I add more hosts (each with 8 H100), the grad norm starts to blow up (like in the first 50 step which is normal ~3.4, but starting to blow up in during 50-100 and 100-200), finally blow into 1000 ...
I've tried to set fp8_wgrad=True, plusing setting DelayScheduling amax_history window into 2048 instead of 1024, this somehow mitigated the issue in a way, but my questions are:
Thanks a lot!
The text was updated successfully, but these errors were encountered: