Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAR for supporting the rotation + residual pattern #3555

Merged
merged 6 commits into from
Dec 10, 2024

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Dec 10, 2024

Stacked on top of #3549. This is also a WAR for #3455 and necessary to schedule RoPE-like rotation patterns.

Because of the issue, a tensor may have two IDs that are exactly mapped. For example, when an ID is sliced to half and then is padded back to the same size, and the final output ID is used with the initial input ID, the initial input and the final output IDs get mapped together. This can make it difficult to use scheduleLoopDomainsLike. For example, if a reference has a split that is done with the final output ID, and we want to replay the split on other tensors, it becomes ambiguous whether the split is done with the initial input or the final output since both are exactly mapped.

To avoid this ambiguity, this PR adds a flag to indicate that we just want to update the current loop domain with a reference domain. As seen in the added tests, this flag is used to propagate the scheduling of a reference tensor once all resize ops are propagated to inputs. Specifically, the overall scheduling follows this pattern:

  1. Propagate all slice and pad ops to fusion inputs
  2. Pick and schedule a reference tensor
  3. Propagate the scheduling of the reference tensor to the other tensors

scheduleLoopDomainsLike with the flag is used at step 3. For that step, we know that we don't need to schedule each tensor with a complex replay path, like some backward ops followed by some other forward ops, but we just need to update the current loop domain by replaying the diff with the reference domain.

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 10, 2024

!test

@naoyam naoyam marked this pull request as ready for review December 10, 2024 05:25
naoyam added a commit that referenced this pull request Dec 10, 2024
Added a scheduler util function schedule a fusion with resize-based ops
such as slice, pad and concat. This propagates resize ops to producers
so that all tensors have the exact-mapped loop domains.

Part of #3425. Extracted so that it can be individually tested.

(There's a follow-up PR: #3555)
Base automatically changed from propagate_resize_to_inputs to main December 10, 2024 17:28
@naoyam
Copy link
Collaborator Author

naoyam commented Dec 10, 2024

!build

@naoyam naoyam requested a review from jacobhinkle December 10, 2024 19:49
Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only comment is minor about naming of the param update_mode.

csrc/scheduler/tools/loop_domain_scheduler.h Outdated Show resolved Hide resolved
csrc/scheduler/tools/loop_domain_scheduler.cpp Outdated Show resolved Hide resolved
@naoyam
Copy link
Collaborator Author

naoyam commented Dec 10, 2024

!build

@naoyam naoyam merged commit 89c47f6 into main Dec 10, 2024
10 of 11 checks passed
@naoyam naoyam deleted the rotation_residual_support branch December 10, 2024 22:46
@naoyam naoyam mentioned this pull request Dec 11, 2024
naoyam added a commit that referenced this pull request Dec 17, 2024
This is a very preliminary version of a new scheduler mainly targeted
for RoPE. I will incrementally extend this scheduler to be more flexible
and performant, but for now it only handles a fusion that has pointwise
ops and a single `Resize`-based tensor op such as `SliceOp` and `PadOp`.
The scheduling strategy is currently pretty naive too and is manually
demonstrated at #3549 and #3555, but the main point is that inputs of
resize-based tensor ops like `SliceOp` or `PadOp` no longer need to have
their inputs as fusion inputs.

The new scheduler is currently placed after the reduction scheduler and
before the transpose and pointwise schedulers:

```
SchedulerType::ExprEval,
    SchedulerType::NoOp,
    SchedulerType::Matmul,
    SchedulerType::Reduction,
    SchedulerType::Resize, <-- New
    SchedulerType::Transpose,
    SchedulerType::PointWise,
    SchedulerType::InnerPersistent,
    SchedulerType::OuterPersistent,
    SchedulerType::InnerOuterPersistent};
```

https://github.com/NVIDIA/Fuser/pull/3556/files#diff-c0d261d44c61935fa2d5398f0ac52bd6ea077c6892fb5629c03a425a55fc32f2R64-R74

There are several small changes with some of the existing tests, mainly
those on segmentation and alias support since this new scheduler may
change how a fusion is segmented when resize is used. There's one thing
I haven't addressed
(#3556 (comment)),
which I'm tracking with a separate issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants