Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate MatmulPattern to MmaOp using OptInDispatch #3593

Merged
merged 57 commits into from
Dec 17, 2024

Conversation

jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Dec 16, 2024

This PR should not introduce any behavior change. I am refactoring MatmulPattern::translateToMmaOp to use OptInDispatch to handle the different types of pattern like mul+sum, LinearOp, and MatmulOp.

OptInDispatch differs from the more common OptOutDispatch by throwing an error when we encounter unhandled statements. This is appropriate in this case because we have a fixed set of supported patterns and we do not wish to fall through to some default translation in the case that we encounter an unsupported pattern; it is better to throw an error.

Original suggestion to use separate methods for each pattern: #3440 (review).

jacobhinkle and others added 30 commits November 14, 2024 13:14
This makes it clear that we're performing the exact same steps as we do
on the mma_result
This PR refactors `autotune_pointwise.py` to support 2D pointwise
scheduler.

* Create `autotune_utils.py` to hold common utilities such as
`ScriptConfiguration`, `collect_data`, `separate_data`,
`test_model_rmse`, `test_model`
* Added support for `Gelu-Bias`, `Silu-Mul`, `Bcast-Add`, `Mul` fusions
* Use `ensemble.RandomForestRegressor` instead of
`tree.DecisionTreeRegressor`


Simplified script flow into 5 steps:
1. Setup Script Configuration by creating `ScriptConfiguration` and
`Autotune` class.
2. Run experiments by calling `collect_data`.
3. Separate data into training and validation set using `separate_data`.
4. Train the regression model using sklearn random forest ensemble.
5. Test the regression model using `test_model_rmse` and `test_model`.
naoyam and others added 15 commits December 16, 2024 09:58
Just patching ComputeAtMap to exclude dead expressions and vals.
This PR implements `scheduleSplitKSum` function to support split-k gemm
with the hopper matmul schedule.

- It support all operand formats such as TT, NT, TN, NN.
This PR renames `unroll_factor` to `iteration_unroll_factor` and adds
`reduction_unroll_factor`. `reduction_unroll_factor` adds unroll factor
on top of vectorization factor for the inner reduction domain.
…bolicSizes) (#3578)

Stacked on #3585 

`StmtSort::getStmtsTo` may not grab all active iter domains if IDs are
connected in an unconventional way. For example, we can set the loop
domain of a tensor as a producer of its logical domain, but due to the
nature of `IterVisitor`, such ID dependency patterns are not supported,
meaning `StmtSort::getStmtsTo` would fail to grab all valid IDs and
their exprs.

I just recently noticed this issue while working on #3556, specifically
the issue got exposed as an inconsistent replacement of extent vals.
I've been experimenting such patterns of domains, but I hadn't seen this
before, likely because I was using just static shape tensors for
convenience.

To fix the issue, I added a variation of `StmtSort::getStmtsTo`, which
traverses a fusion as usual but stops at TensorView. For each
TensorView, instead of using `IterVisitor`, it uses
`TensorDomain::getAllStatements()`, which combines both
`TensorDomain::allIDs()` and `TensorDomain::allExprs()`, and traverse
the IDs and exprs in the returned order.

It's a bit naive implementation, but I think this is good enough for now
and also I don't have any other immediate idea to try.

I changed `ValReplacementMutator` to use the new interface. That's the
only use for now.

---------

Co-authored-by: Jacob Hinkle <[email protected]>
When we do not have an epilogue (not even a cast), it might be the case
that the original `MmaOp` has output which is a Fusion output. In this
case the cached output which we often call `dc` is actually an
`mma_result`. Currently this causes us to schedule that tensor once in
`scheduleMmaResults` then again in `scheduleEpilogue`, leading to an
esoteric error (see included test). This PR simply skips scheduling
those tensors directly if they are already known to be mma results.
…on (#3392)

Ring-based decomposition for Allgather+GEMM overlap ATen implementation
Followup to #3567. I just found the Loop option is also necessary. With
this option, the inlining analysis uses IdModel to understand if
inlining is possible. The loop generation lowering pass also uses
IdModel loop promotion to figure out which iter domains to use for each
`ForLoop` node. The latter is not necessary for the resize scheduler at
this moment, though.
This PR updates the comments for `__truediv__` operator defined in
python bindings. The current comment does not reflect what the code
actually does.

Reference:
#2837 (review)
…tput - smem epilogue not supported. (#3580)

This adds support for scheduling epilogue for the hopper matmul
scheduler.
We don't support smem epilogue as yet.

We also don't honor the vectorization_factor as yet for the store to
output. That'll be covered in a separate PR.
# What
Refactor Host Ir lowering. https://jirasw.nvidia.com/browse/NVFUSER-105
Dependent on:
- [x] #3524

# Why
- cleaner code
- prepare for integration of lowering of sharded matmul with pipelined
algorithms, https://jirasw.nvidia.com/browse/NVFUSER-106

# How
- move files `multidevice/lower_communication` to `host_ir/lower`
- Move most of `MultiDeviceExecutor`'s constructor logic to
`host_ir/lower`

I divided in incremental commits to ease the review. The relevant
commits start at [move HostIr lowering to host ir folder and special
class](25f09e9),
the commits before that belong to
#3524

---------

Co-authored-by: Jingyue Wu <[email protected]>
@jacobhinkle
Copy link
Collaborator Author

!test

Copy link
Collaborator

@rdspring1 rdspring1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OptInDispatch is fancier than what I expected for this. 🎉

csrc/scheduler/mma_utils.cpp Show resolved Hide resolved
@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle
Copy link
Collaborator Author

!build

@jacobhinkle jacobhinkle merged commit 1136753 into main Dec 17, 2024
17 checks passed
@jacobhinkle jacobhinkle deleted the mma_translation_dispatch branch December 17, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants