-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add layer norm weight plus 1 #378
base: main
Are you sure you want to change the base?
Conversation
Hi @tjruwase, Please kindly review~ This PR will fix the non-cuda accelerator layernorm accuracy issue. Thanks! |
@@ -6,7 +6,7 @@ | |||
from apex.normalization import MixedFusedRMSNorm as RMSNorm | |||
else: | |||
from .rmsnorm import RMSNorm | |||
from torch.nn import LayerNorm | |||
from .layer_norm_p1 import LayerNorm1P as LayerNorm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than replacing torch.nn.LayerNorm
and attempting to maintain the logic inside LayerNorm1P
, I think a better and maintainable approach is to instantiate self.input_layernorm
(and others) to either torch.nn.LayerNorm
or LayerNorm1P
based on args.apply_layernorm_1p
. This will simplify LayerNorm1P
to no longer handle the apply_layernorm_1p=False
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, I am curious whether something like the following could work. Thanks!
actualLayerNorm = LayerNorm1P if args.apply_layernorm_1p else torch.nn.LayerNorm
....
self.input_layernorm = actualLayerNorm(...
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, I am curious whether something like the following could work. Thanks!
actualLayerNorm = LayerNorm1P if args.apply_layernorm_1p else torch.nn.LayerNorm .... self.input_layernorm = actualLayerNorm(... ...
Hi @tjruwase, This case could work. we need to replace all 'Layernorm' with 'actualLayerNorm' if using this logic. We can prepare another PR for this logic.
This PR follows the cuda logic to add LayernormP1,
if get_accelerator().device_name() == 'cuda':
from .fused_layer_norm import MixedFusedLayerNorm as LayerNorm
Layernorm1P will return the vanilla layernorm when apply_layernorm_1p=False(default path).
This PR aims to apply the apply_layernorm_1p flag. When set to True, we need to do layernorm.weigth + 1.