-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENHANCEMENT] Add support for Apex RMSNorm for use in qk-norm #1261
Open
wdevazelhes
wants to merge
1
commit into
NVIDIA:main
Choose a base branch
from
wdevazelhes:qkln
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that at a few other places in the code (here, here and here ),
TENorm
is still used for qk-normalization (even if according to the comment above and this commit, usingTENorm
for qk-layernorm is unstable).Let me know if I should also modify these other places 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my case, i exactly did same patch for my own megatron fork.
so changes in this PR looks good to me, but i think we should clarify why it's happening?
like you said, someone still use tenorm for qknorm but model converges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeunghyunSEO thanks!
Regarding the clarification on why this is happening, do you mean that we should check why the TE implementation is diverging ? (I didn't try it myself, I just assumed it does based on your PR and also based on the comment in this commit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean when additional feature is added, at least we should know whether it is necessary or not.
any megatron or TE maintainers know why TEnorm for qk norm diverge sometimes??? i cc sir deepak because he is the only one i communicate with! @deepakn94 (sry for the wrong tagging but i ask you to tag expert in numerical precision issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense I agree 👍 Thanks for tagging @deepakn94 🙏
Also I think Mike Chrzanowski and Shanmugam Ramasamy can be tagged if Nvidia folks know their contact ? (because I couldn't find their github handle)
Because they are the one who created this commit which prevented the use of TENorm, and also Mike Chrzanowski wrote a paper using qk-layernorm 👍