Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for scheduling the epilogue computation when smem_epilogue parameter is true #3581

Merged
merged 16 commits into from
Dec 18, 2024

Conversation

protonu
Copy link
Collaborator

@protonu protonu commented Dec 12, 2024

Refactoring some code and adding some support for smem_epilogue
TODO: add support for smem_epilogue when the output of the mma op is not cast down to half precision.

Copy link
Collaborator

@rdspring1 rdspring1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my transformation helper. It can be used in scheduleMmaResult. Just trying to avoid some merge conflicts.

void HopperMultipleMatmulScheduler::transformLikeMmaOutput(
    TensorView* tv,
    bool is_mma_result) {
  // TODO Add constraints

  auto apply_k_dim_offset = [is_mma_result](int64_t idx) constexpr {
    return (is_mma_result) ? idx - 1 : idx;
  };

  // Original: [..., Mo, No, Mi, Ni]
  tv->split(apply_k_dim_offset(-2), getM(params_->mma_macro));
  tv->split(apply_k_dim_offset(-1), getN(params_->mma_macro));
  // After Split: [..., Mo, No, Mio, Mii, Nio, Nii]
  tv->reorder({{apply_k_dim_offset(-3), apply_k_dim_offset(-2)}});
  // After Reorder: [..., Mo, No, Mio, Nio, Mii, Nii]
  tv->merge(apply_k_dim_offset(-4));
  // After Merge: [..., Mo, No, Mio * Nio, Mii, Nii]
  tv->axis(apply_k_dim_offset(-3))->parallelize(ParallelType::TIDy);
  // After Parallelize: [..., Mo, No, Mio * Nio (TIDy), Mii, Nii]
}

@protonu protonu force-pushed the pbasu_mma_epilogue_hopper_no_smem_epilogue branch from 94f39a1 to be5f18b Compare December 13, 2024 18:41
@protonu protonu force-pushed the pbasu_mma_epilogue_hopper_smem_epilogue branch from b9df449 to aa86b8b Compare December 13, 2024 20:59
@protonu protonu requested a review from rdspring1 December 13, 2024 20:59
@protonu protonu marked this pull request as ready for review December 13, 2024 21:00
@protonu
Copy link
Collaborator Author

protonu commented Dec 13, 2024

!build

@protonu protonu force-pushed the pbasu_mma_epilogue_hopper_no_smem_epilogue branch from 895df0f to 09cb407 Compare December 14, 2024 01:42
@protonu protonu force-pushed the pbasu_mma_epilogue_hopper_smem_epilogue branch 2 times, most recently from c0cfa4e to af126a1 Compare December 14, 2024 03:55
Base automatically changed from pbasu_mma_epilogue_hopper_no_smem_epilogue to main December 14, 2024 14:46
@protonu protonu force-pushed the pbasu_mma_epilogue_hopper_smem_epilogue branch from af126a1 to 8066a54 Compare December 16, 2024 16:04
@protonu protonu force-pushed the pbasu_mma_epilogue_hopper_smem_epilogue branch from 8066a54 to 2c19191 Compare December 16, 2024 16:27
@protonu
Copy link
Collaborator Author

protonu commented Dec 16, 2024

!test

Comment on lines 524 to 539
blockTileTensors({d});
parallelizeBlocks({d});
transformLikeMmaOutput(d, /*is_mma_result=*/false);

// Apply mma common transformation
for (auto tv : tvs_to_schedule) {
transformLikeMmaOutput(tv, /*is_mma_result=*/false);
}
scheduler_utils::BoundedDirectionalTransformPropagator::backward(
d,
-1,
{dc},
scheduler_utils::BoundedDirectionalTransformPropagator::Options()
.propagateParallelType()
.propagateToBoundary());

// Schedule register cache; Output from epilogue
{
auto s = mma_utils::MmaSwizzler::scheduleMmaOutputAllocation(
dc->getLoopDomain());
dc->setLoopDomain(s.as<IterDomain*>());
dc->setAllocationDomain(s.as<IterDomain*>(), true);
}
auto s = mma_utils::MmaSwizzler::scheduleMmaOutputAllocation(
dc->getLoopDomain());
dc->setLoopDomain(s.as<IterDomain*>());
dc->setAllocationDomain(s.as<IterDomain*>(), true);
Copy link
Collaborator

@jacobhinkle jacobhinkle Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why propagate from d to dc then schedule dc directly like this instead of just directly scheduling dc? What does the propagation provide us? Could use this in a code comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the comment.

@protonu protonu requested a review from jacobhinkle December 16, 2024 17:28
@protonu protonu requested a review from jacobhinkle December 16, 2024 18:00
@protonu protonu requested a review from jacobhinkle December 16, 2024 19:13
Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Hopefully bypassing or manipulating stmatrix for float outputs will be straightforward.

csrc/scheduler/hopper_multi_matmul.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@rdspring1 rdspring1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any bank conflicts in the epilogue fusions?

I'd run ncu and report anything interesting.

ncu -k regex:nvfuser --metrics 'regex:l1tex__data_bank_conflicts_pipe_lsu_mem_shared.*sum$' ./build/test_matmul --gtest_filter='*HopperMatmulScheduler*

@protonu
Copy link
Collaborator Author

protonu commented Dec 17, 2024

Are there any bank conflicts in the epilogue fusions?

I'd run ncu and report anything interesting.

ncu -k regex:nvfuser --metrics 'regex:l1tex__data_bank_conflicts_pipe_lsu_mem_shared.*sum$' ./build/test_matmul --gtest_filter='*HopperMatmulScheduler*

For the test HopperMatmulSchedulerTest*FusedMultiplySumBiasNeg* there are no bank conflicts reported.

@jacobhinkle
Copy link
Collaborator

!build

@protonu
Copy link
Collaborator Author

protonu commented Dec 18, 2024

!test

@protonu protonu merged commit c31b919 into main Dec 18, 2024
23 of 24 checks passed
@protonu protonu deleted the pbasu_mma_epilogue_hopper_smem_epilogue branch December 18, 2024 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants