Translate MatmulPattern to MmaOp using OptInDispatch #3593

jacobhinkle · 2024-12-16T16:30:20Z

This PR should not introduce any behavior change. I am refactoring MatmulPattern::translateToMmaOp to use OptInDispatch to handle the different types of pattern like mul+sum, LinearOp, and MatmulOp.

OptInDispatch differs from the more common OptOutDispatch by throwing an error when we encounter unhandled statements. This is appropriate in this case because we have a fixed set of supported patterns and we do not wish to fall through to some default translation in the case that we encounter an unsupported pattern; it is better to throw an error.

Original suggestion to use separate methods for each pattern: #3440 (review).

This makes it clear that we're performing the exact same steps as we do on the mma_result

…mination

…tion

…mma_inlining

…er_mma_no_broadcast

…dcasts

Co-authored-by: Ryan Spring <[email protected]>

This PR refactors `autotune_pointwise.py` to support 2D pointwise scheduler. * Create `autotune_utils.py` to hold common utilities such as `ScriptConfiguration`, `collect_data`, `separate_data`, `test_model_rmse`, `test_model` * Added support for `Gelu-Bias`, `Silu-Mul`, `Bcast-Add`, `Mul` fusions * Use `ensemble.RandomForestRegressor` instead of `tree.DecisionTreeRegressor` Simplified script flow into 5 steps: 1. Setup Script Configuration by creating `ScriptConfiguration` and `Autotune` class. 2. Run experiments by calling `collect_data`. 3. Separate data into training and validation set using `separate_data`. 4. Train the regression model using sklearn random forest ensemble. 5. Test the regression model using `test_model_rmse` and `test_model`.

Just patching ComputeAtMap to exclude dead expressions and vals.

This PR implements `scheduleSplitKSum` function to support split-k gemm with the hopper matmul schedule. - It support all operand formats such as TT, NT, TN, NN.

Just a minor fix

This PR renames `unroll_factor` to `iteration_unroll_factor` and adds `reduction_unroll_factor`. `reduction_unroll_factor` adds unroll factor on top of vectorization factor for the inner reduction domain.

…bolicSizes) (#3578) Stacked on #3585 `StmtSort::getStmtsTo` may not grab all active iter domains if IDs are connected in an unconventional way. For example, we can set the loop domain of a tensor as a producer of its logical domain, but due to the nature of `IterVisitor`, such ID dependency patterns are not supported, meaning `StmtSort::getStmtsTo` would fail to grab all valid IDs and their exprs. I just recently noticed this issue while working on #3556, specifically the issue got exposed as an inconsistent replacement of extent vals. I've been experimenting such patterns of domains, but I hadn't seen this before, likely because I was using just static shape tensors for convenience. To fix the issue, I added a variation of `StmtSort::getStmtsTo`, which traverses a fusion as usual but stops at TensorView. For each TensorView, instead of using `IterVisitor`, it uses `TensorDomain::getAllStatements()`, which combines both `TensorDomain::allIDs()` and `TensorDomain::allExprs()`, and traverse the IDs and exprs in the returned order. It's a bit naive implementation, but I think this is good enough for now and also I don't have any other immediate idea to try. I changed `ValReplacementMutator` to use the new interface. That's the only use for now. --------- Co-authored-by: Jacob Hinkle <[email protected]>

When we do not have an epilogue (not even a cast), it might be the case that the original `MmaOp` has output which is a Fusion output. In this case the cached output which we often call `dc` is actually an `mma_result`. Currently this causes us to schedule that tensor once in `scheduleMmaResults` then again in `scheduleEpilogue`, leading to an esoteric error (see included test). This PR simply skips scheduling those tensors directly if they are already known to be mma results.

…on (#3392) Ring-based decomposition for Allgather+GEMM overlap ATen implementation

Followup to #3567. I just found the Loop option is also necessary. With this option, the inlining analysis uses IdModel to understand if inlining is possible. The loop generation lowering pass also uses IdModel loop promotion to figure out which iter domains to use for each `ForLoop` node. The latter is not necessary for the resize scheduler at this moment, though.

This PR updates the comments for `__truediv__` operator defined in python bindings. The current comment does not reflect what the code actually does. Reference: #2837 (review)

…tput - smem epilogue not supported. (#3580) This adds support for scheduling epilogue for the hopper matmul scheduler. We don't support smem epilogue as yet. We also don't honor the vectorization_factor as yet for the store to output. That'll be covered in a separate PR.

# What Refactor Host Ir lowering. https://jirasw.nvidia.com/browse/NVFUSER-105 Dependent on: - [x] #3524 # Why - cleaner code - prepare for integration of lowering of sharded matmul with pipelined algorithms, https://jirasw.nvidia.com/browse/NVFUSER-106 # How - move files `multidevice/lower_communication` to `host_ir/lower` - Move most of `MultiDeviceExecutor`'s constructor logic to `host_ir/lower` I divided in incremental commits to ease the review. The relevant commits start at [move HostIr lowering to host ir folder and special class](25f09e9), the commits before that belong to #3524 --------- Co-authored-by: Jingyue Wu <[email protected]>

jacobhinkle · 2024-12-16T16:30:26Z

!test

rdspring1

OptInDispatch is fancier than what I expected for this. 🎉

csrc/scheduler/mma_utils.cpp

Co-authored-by: Ryan Spring <[email protected]>

jacobhinkle · 2024-12-17T13:20:25Z

!build

jacobhinkle · 2024-12-17T13:25:27Z

!build

Needed to re-run lintrunner init

jacobhinkle · 2024-12-17T13:40:44Z

!build

jacobhinkle · 2024-12-17T13:46:57Z

!build

jacobhinkle and others added 30 commits November 14, 2024 13:14

Schedule Hopper MMA without input broadcasts

9e27416

Remove manual computeAt

ee178df

Reorder operands before scheduling

dca732b

This makes it clear that we're performing the exact same steps as we do on the mma_result

Move OptOutMutator tests to new file and add repro

9bf2645

Add additional_ids arg to big ctor

96dd201

Only check actually used IDs in predicate elimination

c7c790b

Allow inlining loop broadcasts

29fe28b

clang-format

11083a1

clang-tidy of TensorDomain ctor

11c43c4

Merge branch 'mutator_preserve_additional_ids' into mma_predicate_eli…

683ae1e

…mination

Merge branch 'main' into mma_predicate_elimination

c29470f

Check IterType of loop broadcasts

c06917f

Remove debugging comment

64be2c7

Merge remote-tracking branch 'origin/main' into mma_predicate_elimina…

a9cc7aa

…tion

Merge remote-tracking branch 'origin/mma_predicate_elimination' into …

46e6ca8

…mma_inlining

Track IDs used in indexing

cc236fd

Merge remote-tracking branch 'origin/mma_inlining' into schedule_hopp…

a942806

…er_mma_no_broadcast

Rebase on schedule_hopper_mma_no_broadcast

3fa6475

Try to set a sane CTA tile for Hopper

bd5e264

Merge remote-tracking branch 'origin/main' into translate_mma_no_broa…

e736367

…dcasts

Revert changes in matmul_utils.cpp

70c44d0

Guard tests that need automatic scheduler

7619acb

Fix problem with not initializing Batch dim roles

933c7fe

Remove debug print

0de4850

Remove debug print

0823a82

clang-tidy

6c5266a

Error out for mul+sum patterns on hopper where K is not innermost

254980d

Guard against outer K mul+sum patterns on Hopper+

9dd0670

Apply suggestions from code review

e586390

Co-authored-by: Ryan Spring <[email protected]>

naoyam and others added 15 commits December 16, 2024 09:58

Fix ComputeAtMap for non-linear ID dependencies (#3577)

a8cab11

Just patching ComputeAtMap to exclude dead expressions and vals.

Implement basic split-k gemm for hopper matmul scheduler (#3575)

5b37bca

This PR implements `scheduleSplitKSum` function to support split-k gemm with the hopper matmul schedule. - It support all operand formats such as TT, NT, TN, NN.

Fix #3583 (#3585)

5efff49

Don't disable ExprSimplify in shouldUseZero (#3587)

8882aa9

Just a minor fix

Add reduction_unroll_factor to autotuning script (#3487)

36f49db

This PR renames `unroll_factor` to `iteration_unroll_factor` and adds `reduction_unroll_factor`. `reduction_unroll_factor` adds unroll factor on top of vectorization factor for the inner reduction domain.

Ring-based decomposition for Allgather+GEMM overlap ATen implementati…

46b4721

…on (#3392) Ring-based decomposition for Allgather+GEMM overlap ATen implementation

Update __truediv__ comment in python bindings. (#3586)

c64f777

This PR updates the comments for `__truediv__` operator defined in python bindings. The current comment does not reflect what the code actually does. Reference: #2837 (review)

Fix misspelling

c151905

Use OptInDispatch to do matmul translation

dd72b54

Merge remote-tracking branch 'origin/main' into mma_translation_dispatch

fbcdbda

jacobhinkle requested a review from rdspring1 December 16, 2024 16:30

rdspring1 approved these changes Dec 16, 2024

View reviewed changes

csrc/scheduler/mma_utils.cpp Show resolved Hide resolved

jacobhinkle and others added 2 commits December 16, 2024 14:22

Update csrc/scheduler/mma_utils.cpp

99a0de6

Co-authored-by: Ryan Spring <[email protected]>

Add missing "final" keywords

0430423

clang-format

e224860

jacobhinkle added 2 commits December 17, 2024 08:37

clang-format again

7a2f627

Needed to re-run lintrunner init

Fix busted clang-format

7edafe7

Remove "final" added in error

71509c7

jacobhinkle merged commit 1136753 into main Dec 17, 2024
17 checks passed

jacobhinkle deleted the mma_translation_dispatch branch December 17, 2024 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translate MatmulPattern to MmaOp using OptInDispatch #3593

Translate MatmulPattern to MmaOp using OptInDispatch #3593

jacobhinkle commented Dec 16, 2024 •

edited

Loading

jacobhinkle commented Dec 16, 2024

rdspring1 left a comment

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 17, 2024

Translate MatmulPattern to MmaOp using OptInDispatch #3593

Translate MatmulPattern to MmaOp using OptInDispatch #3593

Conversation

jacobhinkle commented Dec 16, 2024 • edited Loading

jacobhinkle commented Dec 16, 2024

rdspring1 left a comment

Choose a reason for hiding this comment

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 16, 2024 •

edited

Loading