-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WAR for supporting the rotation + residual pattern #3555
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
!test |
!build |
jacobhinkle
approved these changes
Dec 10, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only comment is minor about naming of the param update_mode
.
!build |
naoyam
added a commit
that referenced
this pull request
Dec 17, 2024
This is a very preliminary version of a new scheduler mainly targeted for RoPE. I will incrementally extend this scheduler to be more flexible and performant, but for now it only handles a fusion that has pointwise ops and a single `Resize`-based tensor op such as `SliceOp` and `PadOp`. The scheduling strategy is currently pretty naive too and is manually demonstrated at #3549 and #3555, but the main point is that inputs of resize-based tensor ops like `SliceOp` or `PadOp` no longer need to have their inputs as fusion inputs. The new scheduler is currently placed after the reduction scheduler and before the transpose and pointwise schedulers: ``` SchedulerType::ExprEval, SchedulerType::NoOp, SchedulerType::Matmul, SchedulerType::Reduction, SchedulerType::Resize, <-- New SchedulerType::Transpose, SchedulerType::PointWise, SchedulerType::InnerPersistent, SchedulerType::OuterPersistent, SchedulerType::InnerOuterPersistent}; ``` https://github.com/NVIDIA/Fuser/pull/3556/files#diff-c0d261d44c61935fa2d5398f0ac52bd6ea077c6892fb5629c03a425a55fc32f2R64-R74 There are several small changes with some of the existing tests, mainly those on segmentation and alias support since this new scheduler may change how a fusion is segmented when resize is used. There's one thing I haven't addressed (#3556 (comment)), which I'm tracking with a separate issue.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stacked on top of #3549. This is also a WAR for #3455 and necessary to schedule RoPE-like rotation patterns.
Because of the issue, a tensor may have two IDs that are exactly mapped. For example, when an ID is sliced to half and then is padded back to the same size, and the final output ID is used with the initial input ID, the initial input and the final output IDs get mapped together. This can make it difficult to use
scheduleLoopDomainsLike
. For example, if a reference has a split that is done with the final output ID, and we want to replay the split on other tensors, it becomes ambiguous whether the split is done with the initial input or the final output since both are exactly mapped.To avoid this ambiguity, this PR adds a flag to indicate that we just want to update the current loop domain with a reference domain. As seen in the added tests, this flag is used to propagate the scheduling of a reference tensor once all resize ops are propagated to inputs. Specifically, the overall scheduling follows this pattern:
scheduleLoopDomainsLike
with the flag is used at step 3. For that step, we know that we don't need to schedule each tensor with a complex replay path, like some backward ops followed by some other forward ops, but we just need to update the current loop domain by replaying the diff with the reference domain.