-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FP8 Support for MCore MoE #648
Conversation
93f15d2
to
cb01bbf
Compare
9163446
to
50d180c
Compare
9b520c6
to
8df68fc
Compare
42f28d3
to
8e03976
Compare
I don't like the fact that the layers need to know that they are experts. Can't it be abstracted in some way using the options that we already have or add options that are more generic? |
I see. Good advice. I remove the |
Hi @ptrendx, could you please continue the review and share your comments. To make sure this feature can be included in MCore v0.6, I think it's better to merge this MR this week. |
So, to be honest, I don't quite understand why we need that communication flag at all. MCore should be able to just call te.Linear without setting row or column parallelism to the same effect, no? And then we would not need any special flag on the TE side? Also, you added this rng tracker name option, but did not document it. Handling of the zero token case I think is fine. |
Added documentation. Thanks. |
b3adfa5
to
8dc92d1
Compare
/te-ci pytorch |
8dc92d1
to
c862ac0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM. @Victarry, please add a test that runs the Linear layer with empty input to shw that it works and then we will be able to merge.
Hi @ptrendx, I have added the new unittest for Linear layer with empty input. Since I didn't find an appropriate file to place the new testcase, I created a new test file named |
I would put it in test_sanity.py. It looks good, but please also add some check in it - like checking that the batch size of the output of the linear layer is the same as the input (so if 0 gets passed as input, 0 is also provided as the output). |
Done! Thanks for you advice. 👍🏻 |
Hi, @ptrendx, can this MR be merged now? |
/te-ci pytorch |
Hi, @ptrendx, I just fixed the UT error in CI, could you please trigger the ci again? |
/te-ci pytorch |
Hi, @ptrendx, the CI pipeline is passed, could you please merge this MR? |
Hi @Victarry, we are trying to minimize the changes going into 1.6 release so will merge that PR after 1.6 branch is created. |
Signed-off-by: Dennis Liu <[email protected]>
Signed-off-by: Dennis Liu <[email protected]>
Signed-off-by: Dennis Liu <[email protected]>
Hi @Victarry, now that the 1.6 branch is created, could you resolve conflicts in your PR? Then we will be able to merge it. |
c779207
to
883178e
Compare
Hi, @ptrendx, I just resolved the conflicts, please merge this PR. Thanks! |
/te-ci pytorch |
Hi @ptrendx, I guess the CI failure is due to other code change in main branch. Could you please trigger the pytorch CI again? |
/te-ci pytorch |
Hi @Victarry, I'm wondering when the mcore related changes will be available on the public mcore repository. Or if it's already available, could you point me to the relevant changes or PR? Thanks! |
Hi, @viclzhu, the mcore related change is planed to be published before the end of May. |
Hi @ptrendx, I found that the UT only failed on L40, but I'm not sure why does this happen. Do you have any insights? |
Hi Victarry - I checked and this failure is unrelated to this PR, so I believe it is safe to merge. |
* Add support for MoE with FP8. Signed-off-by: Dennis Liu <[email protected]> * Fix unittest. Signed-off-by: Dennis Liu <[email protected]> * Fix error in linear backward. Signed-off-by: Dennis Liu <[email protected]> --------- Signed-off-by: Dennis Liu <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]>
* Add support for MoE with FP8. Signed-off-by: Dennis Liu <[email protected]> * Fix unittest. Signed-off-by: Dennis Liu <[email protected]> * Fix error in linear backward. Signed-off-by: Dennis Liu <[email protected]> --------- Signed-off-by: Dennis Liu <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>
* Add support for MoE with FP8. Signed-off-by: Dennis Liu <[email protected]> * Fix unittest. Signed-off-by: Dennis Liu <[email protected]> * Fix error in linear backward. Signed-off-by: Dennis Liu <[email protected]> --------- Signed-off-by: Dennis Liu <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>
* Add support for MoE with FP8. Signed-off-by: Dennis Liu <[email protected]> * Fix unittest. Signed-off-by: Dennis Liu <[email protected]> * Fix error in linear backward. Signed-off-by: Dennis Liu <[email protected]> --------- Signed-off-by: Dennis Liu <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>
@@ -335,6 +338,8 @@ at::Tensor fp8_transpose(at::Tensor input, | |||
|
|||
size_t M = static_cast<size_t>(input.size(0)); | |||
size_t N = static_cast<size_t>(input.size(1)); | |||
if (M == 0 || N == 0) | |||
return input; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Victarry This will cause shape mismatch error between wgrad and weight when gradient accumulation fusion is disabled.
Add FP8 support for MoE in MCore.
Related MR in MCore:
https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1089
Implementation details:
rng_tracker_name
for initialize with EP