FP8 Support for MCore MoE #648

Victarry · 2024-01-31T07:44:52Z

Add FP8 support for MoE in MCore.

Related MR in MCore:
https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1089

Implementation details:

Add rng_tracker_name for initialize with EP
Add if statement in kernel to handle zeros tokens passed to an expert.

ptrendx · 2024-03-07T18:48:19Z

I don't like the fact that the layers need to know that they are experts. Can't it be abstracted in some way using the options that we already have or add options that are more generic?

Victarry · 2024-03-08T07:55:18Z

I see. Good advice.

I remove the is_expert flag and use explicit_parallel_comm to indicate the communications are handled outside the te.Linear.

Victarry · 2024-03-12T00:26:53Z

Hi @ptrendx, could you please continue the review and share your comments.

To make sure this feature can be included in MCore v0.6, I think it's better to merge this MR this week.

ptrendx · 2024-03-13T16:58:48Z

So, to be honest, I don't quite understand why we need that communication flag at all. MCore should be able to just call te.Linear without setting row or column parallelism to the same effect, no? And then we would not need any special flag on the TE side?

Also, you added this rng tracker name option, but did not document it.

Handling of the zero token case I think is fine.

Victarry · 2024-03-14T06:01:03Z

Added documentation. Thanks.

ptrendx · 2024-03-18T16:52:24Z

/te-ci pytorch

ptrendx

Changes LGTM. @Victarry, please add a test that runs the Linear layer with empty input to shw that it works and then we will be able to merge.

Victarry · 2024-03-25T10:58:32Z

Hi @ptrendx, I have added the new unittest for Linear layer with empty input.

Since I didn't find an appropriate file to place the new testcase, I created a new test file named test_linear_layer.py.
Please tell me if you have any other suggentions.

ptrendx · 2024-03-25T16:31:44Z

I would put it in test_sanity.py. It looks good, but please also add some check in it - like checking that the batch size of the output of the linear layer is the same as the input (so if 0 gets passed as input, 0 is also provided as the output).

Victarry · 2024-03-26T03:24:32Z

Done! Thanks for you advice. 👍🏻

Victarry · 2024-04-07T01:38:01Z

Hi, @ptrendx, can this MR be merged now?

ptrendx · 2024-04-07T01:48:28Z

/te-ci pytorch

Victarry · 2024-04-07T07:28:13Z

Hi, @ptrendx, I just fixed the UT error in CI, could you please trigger the ci again?

ptrendx · 2024-04-07T15:15:34Z

/te-ci pytorch

Victarry · 2024-04-08T02:04:09Z

Hi, @ptrendx, the CI pipeline is passed, could you please merge this MR?
Thanks a lot!

ptrendx · 2024-04-08T17:28:41Z

Hi @Victarry, we are trying to minimize the changes going into 1.6 release so will merge that PR after 1.6 branch is created.

Signed-off-by: Dennis Liu <[email protected]>

ptrendx · 2024-04-16T17:29:49Z

Hi @Victarry, now that the 1.6 branch is created, could you resolve conflicts in your PR? Then we will be able to merge it.

Victarry · 2024-04-17T02:15:02Z

Hi, @ptrendx, I just resolved the conflicts, please merge this PR. Thanks!

ptrendx · 2024-04-17T02:44:12Z

/te-ci pytorch

Victarry · 2024-04-24T03:51:28Z

Hi @ptrendx, I guess the CI failure is due to other code change in main branch. Could you please trigger the pytorch CI again?

ptrendx · 2024-04-25T19:00:11Z

/te-ci pytorch

viclzhu · 2024-04-25T21:57:18Z

Hi @Victarry, I'm wondering when the mcore related changes will be available on the public mcore repository. Or if it's already available, could you point me to the relevant changes or PR?
It sounds like it will likely be available on mcore-0.7, but I couldn't seem to find the changes yet on the beta-0.7 branch.

Thanks!

Victarry · 2024-04-29T03:25:16Z

Hi, @viclzhu, the mcore related change is planed to be published before the end of May.
Thanks.

Victarry · 2024-04-29T03:26:38Z

Hi @ptrendx, I found that the UT only failed on L40, but I'm not sure why does this happen.

Do you have any insights?

ptrendx · 2024-04-29T20:10:32Z

Hi Victarry - I checked and this failure is unrelated to this PR, so I believe it is safe to merge.

* Add support for MoE with FP8. Signed-off-by: Dennis Liu <[email protected]> * Fix unittest. Signed-off-by: Dennis Liu <[email protected]> * Fix error in linear backward. Signed-off-by: Dennis Liu <[email protected]> --------- Signed-off-by: Dennis Liu <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]>

* Add support for MoE with FP8. Signed-off-by: Dennis Liu <[email protected]> * Fix unittest. Signed-off-by: Dennis Liu <[email protected]> * Fix error in linear backward. Signed-off-by: Dennis Liu <[email protected]> --------- Signed-off-by: Dennis Liu <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

Victarry · 2024-07-04T04:53:48Z

transformer_engine/pytorch/csrc/extensions/transpose.cu

@@ -335,6 +338,8 @@ at::Tensor fp8_transpose(at::Tensor input,

  size_t M = static_cast<size_t>(input.size(0));
  size_t N = static_cast<size_t>(input.size(1));
+  if (M == 0 || N == 0)
+    return input;


@Victarry This will cause shape mismatch error between wgrad and weight when gradient accumulation fusion is disabled.

Victarry force-pushed the denliu/moe_fp8 branch from 93f15d2 to cb01bbf Compare January 31, 2024 09:39

Victarry force-pushed the denliu/moe_fp8 branch from 9163446 to 50d180c Compare February 22, 2024 18:48

Victarry force-pushed the denliu/moe_fp8 branch 2 times, most recently from 9b520c6 to 8df68fc Compare March 6, 2024 05:36

Victarry marked this pull request as ready for review March 6, 2024 08:10

Victarry force-pushed the denliu/moe_fp8 branch 2 times, most recently from 42f28d3 to 8e03976 Compare March 6, 2024 16:10

Victarry force-pushed the denliu/moe_fp8 branch from b3adfa5 to 8dc92d1 Compare March 15, 2024 17:05

Victarry force-pushed the denliu/moe_fp8 branch from 8dc92d1 to c862ac0 Compare March 22, 2024 01:55

ptrendx approved these changes Mar 22, 2024

View reviewed changes

Victarry force-pushed the denliu/moe_fp8 branch from 7ed9c5d to 28ec889 Compare April 7, 2024 04:30

Victarry added 3 commits April 16, 2024 06:26

Add support for MoE with FP8.

fde3bae

Signed-off-by: Dennis Liu <[email protected]>

Fix unittest.

d79d661

Signed-off-by: Dennis Liu <[email protected]>

Fix error in linear backward.

883178e

Signed-off-by: Dennis Liu <[email protected]>

Victarry force-pushed the denliu/moe_fp8 branch from c779207 to 883178e Compare April 17, 2024 02:10

Merge branch 'main' into denliu/moe_fp8

edba411

ptrendx merged commit 32d1eb1 into NVIDIA:main Apr 29, 2024
19 of 20 checks passed

Victarry commented Jul 4, 2024

View reviewed changes

timmoon10 mentioned this pull request Aug 16, 2024

Update FP8 scale-inverse in kernels with FP8 output #1083

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 Support for MCore MoE #648

FP8 Support for MCore MoE #648

Victarry commented Jan 31, 2024 •

edited

Loading

ptrendx commented Mar 7, 2024

Victarry commented Mar 8, 2024

Victarry commented Mar 12, 2024

ptrendx commented Mar 13, 2024

Victarry commented Mar 14, 2024

ptrendx commented Mar 18, 2024

ptrendx left a comment

Victarry commented Mar 25, 2024

ptrendx commented Mar 25, 2024

Victarry commented Mar 26, 2024

Victarry commented Apr 7, 2024

ptrendx commented Apr 7, 2024

Victarry commented Apr 7, 2024

ptrendx commented Apr 7, 2024

Victarry commented Apr 8, 2024

ptrendx commented Apr 8, 2024

ptrendx commented Apr 16, 2024

Victarry commented Apr 17, 2024

ptrendx commented Apr 17, 2024

Victarry commented Apr 24, 2024

ptrendx commented Apr 25, 2024

viclzhu commented Apr 25, 2024

Victarry commented Apr 29, 2024

Victarry commented Apr 29, 2024

ptrendx commented Apr 29, 2024

Victarry Jul 4, 2024

FP8 Support for MCore MoE #648

FP8 Support for MCore MoE #648

Conversation

Victarry commented Jan 31, 2024 • edited Loading

ptrendx commented Mar 7, 2024

Victarry commented Mar 8, 2024

Victarry commented Mar 12, 2024

ptrendx commented Mar 13, 2024

Victarry commented Mar 14, 2024

ptrendx commented Mar 18, 2024

ptrendx left a comment

Choose a reason for hiding this comment

Victarry commented Mar 25, 2024

ptrendx commented Mar 25, 2024

Victarry commented Mar 26, 2024

Victarry commented Apr 7, 2024

ptrendx commented Apr 7, 2024

Victarry commented Apr 7, 2024

ptrendx commented Apr 7, 2024

Victarry commented Apr 8, 2024

ptrendx commented Apr 8, 2024

ptrendx commented Apr 16, 2024

Victarry commented Apr 17, 2024

ptrendx commented Apr 17, 2024

Victarry commented Apr 24, 2024

ptrendx commented Apr 25, 2024

viclzhu commented Apr 25, 2024

Victarry commented Apr 29, 2024

Victarry commented Apr 29, 2024

ptrendx commented Apr 29, 2024

Victarry Jul 4, 2024

Choose a reason for hiding this comment

Victarry commented Jan 31, 2024 •

edited

Loading