-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translate MatmulPattern to MmaOp using OptInDispatch #3593
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This makes it clear that we're performing the exact same steps as we do on the mma_result
…er_mma_no_broadcast
Co-authored-by: Ryan Spring <[email protected]>
This PR refactors `autotune_pointwise.py` to support 2D pointwise scheduler. * Create `autotune_utils.py` to hold common utilities such as `ScriptConfiguration`, `collect_data`, `separate_data`, `test_model_rmse`, `test_model` * Added support for `Gelu-Bias`, `Silu-Mul`, `Bcast-Add`, `Mul` fusions * Use `ensemble.RandomForestRegressor` instead of `tree.DecisionTreeRegressor` Simplified script flow into 5 steps: 1. Setup Script Configuration by creating `ScriptConfiguration` and `Autotune` class. 2. Run experiments by calling `collect_data`. 3. Separate data into training and validation set using `separate_data`. 4. Train the regression model using sklearn random forest ensemble. 5. Test the regression model using `test_model_rmse` and `test_model`.
Just patching ComputeAtMap to exclude dead expressions and vals.
This PR implements `scheduleSplitKSum` function to support split-k gemm with the hopper matmul schedule. - It support all operand formats such as TT, NT, TN, NN.
Just a minor fix
This PR renames `unroll_factor` to `iteration_unroll_factor` and adds `reduction_unroll_factor`. `reduction_unroll_factor` adds unroll factor on top of vectorization factor for the inner reduction domain.
…bolicSizes) (#3578) Stacked on #3585 `StmtSort::getStmtsTo` may not grab all active iter domains if IDs are connected in an unconventional way. For example, we can set the loop domain of a tensor as a producer of its logical domain, but due to the nature of `IterVisitor`, such ID dependency patterns are not supported, meaning `StmtSort::getStmtsTo` would fail to grab all valid IDs and their exprs. I just recently noticed this issue while working on #3556, specifically the issue got exposed as an inconsistent replacement of extent vals. I've been experimenting such patterns of domains, but I hadn't seen this before, likely because I was using just static shape tensors for convenience. To fix the issue, I added a variation of `StmtSort::getStmtsTo`, which traverses a fusion as usual but stops at TensorView. For each TensorView, instead of using `IterVisitor`, it uses `TensorDomain::getAllStatements()`, which combines both `TensorDomain::allIDs()` and `TensorDomain::allExprs()`, and traverse the IDs and exprs in the returned order. It's a bit naive implementation, but I think this is good enough for now and also I don't have any other immediate idea to try. I changed `ValReplacementMutator` to use the new interface. That's the only use for now. --------- Co-authored-by: Jacob Hinkle <[email protected]>
When we do not have an epilogue (not even a cast), it might be the case that the original `MmaOp` has output which is a Fusion output. In this case the cached output which we often call `dc` is actually an `mma_result`. Currently this causes us to schedule that tensor once in `scheduleMmaResults` then again in `scheduleEpilogue`, leading to an esoteric error (see included test). This PR simply skips scheduling those tensors directly if they are already known to be mma results.
…on (#3392) Ring-based decomposition for Allgather+GEMM overlap ATen implementation
Followup to #3567. I just found the Loop option is also necessary. With this option, the inlining analysis uses IdModel to understand if inlining is possible. The loop generation lowering pass also uses IdModel loop promotion to figure out which iter domains to use for each `ForLoop` node. The latter is not necessary for the resize scheduler at this moment, though.
This PR updates the comments for `__truediv__` operator defined in python bindings. The current comment does not reflect what the code actually does. Reference: #2837 (review)
…tput - smem epilogue not supported. (#3580) This adds support for scheduling epilogue for the hopper matmul scheduler. We don't support smem epilogue as yet. We also don't honor the vectorization_factor as yet for the store to output. That'll be covered in a separate PR.
# What Refactor Host Ir lowering. https://jirasw.nvidia.com/browse/NVFUSER-105 Dependent on: - [x] #3524 # Why - cleaner code - prepare for integration of lowering of sharded matmul with pipelined algorithms, https://jirasw.nvidia.com/browse/NVFUSER-106 # How - move files `multidevice/lower_communication` to `host_ir/lower` - Move most of `MultiDeviceExecutor`'s constructor logic to `host_ir/lower` I divided in incremental commits to ease the review. The relevant commits start at [move HostIr lowering to host ir folder and special class](25f09e9), the commits before that belong to #3524 --------- Co-authored-by: Jingyue Wu <[email protected]>
!test |
rdspring1
approved these changes
Dec 16, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OptInDispatch
is fancier than what I expected for this. 🎉
Co-authored-by: Ryan Spring <[email protected]>
!build |
!build |
Needed to re-run lintrunner init
!build |
!build |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR should not introduce any behavior change. I am refactoring
MatmulPattern::translateToMmaOp
to useOptInDispatch
to handle the different types of pattern like mul+sum,LinearOp
, andMatmulOp
.OptInDispatch
differs from the more commonOptOutDispatch
by throwing an error when we encounter unhandled statements. This is appropriate in this case because we have a fixed set of supported patterns and we do not wish to fall through to some default translation in the case that we encounter an unsupported pattern; it is better to throw an error.Original suggestion to use separate methods for each pattern: #3440 (review).