Skip to content

Conversation

@HydraQYH
Copy link
Contributor

@HydraQYH HydraQYH commented Dec 9, 2025

Follow: #2719. Some SM120 kernels will use SM90 Cooperative/Pingpong Mainloop with SM100 TileScheduler. SM100 TileScheduler do not have is_last_tile function. This can cause compilation errors when we remove unnecessary conditional compilation.
In this PR, If SM120 blockscaled is detected at compile time, then bypass launch_dependent_grids.

@HydraQYH
Copy link
Contributor Author

HydraQYH commented Dec 9, 2025

@Junkai-Wu I believe the current solution is only temporary, and in the future we should add is_last_tile to the SM100 TileScheduler. cc @hwu36 @IonThruster @d-k-b

@Junkai-Wu
Copy link
Collaborator

@Junkai-Wu I believe the current solution is only temporary, and in the future we should add is_last_tile to the SM100 TileScheduler. cc @hwu36 @IonThruster @d-k-b

Agreed. @HydraQYH Do you think if you can do the corresponding changes in this PR? I think just an empty const function with false return should be OK.

@HydraQYH
Copy link
Contributor Author

@Junkai-Wu I believe the current solution is only temporary, and in the future we should add is_last_tile to the SM100 TileScheduler. cc @hwu36 @IonThruster @d-k-b

Agreed. @HydraQYH Do you think if you can do the corresponding changes in this PR? I think just an empty const function with false return should be OK.

@Junkai-Wu Thank you for your suggestion, I will do it as soon as possible.

@HydraQYH HydraQYH force-pushed the dev_fix_sm120_with_pdl_support branch from 42d17b3 to 4ece099 Compare December 10, 2025 14:43
@HydraQYH
Copy link
Contributor Author

HydraQYH commented Dec 10, 2025

@Junkai-Wu I rebase code and just add is_last_tile for SM100 TileScheduler. After testing with following command:

cmake .. -DCUTLASS_NVCC_ARCHS=120a -DCUTLASS_BUILD_FOR_PROFILER_REGRESSIONS=ON
make VERBOSE=1 cutlass_profiler -j16

There are no compilation errors anymore. And this PR is ready for review.

@Junkai-Wu
Copy link
Collaborator

@HydraQYH The changes look much cleaner now. Thanks for the quick action. Leave minor comments above.

@Junkai-Wu
Copy link
Collaborator

@HydraQYH I ran the internal pipeline with these changes and got a timeout issue of this unit test: https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/sm120_tensorop_gemm/CMakeLists.txt#L51

The issue disappeared when I added the macro back on the is_last_tile function. I've asked corresponding developer to help identify the issue. You can also investigate it if possible.

@HydraQYH
Copy link
Contributor Author

HydraQYH commented Dec 13, 2025

@Junkai-Wu Sorry, I don't have an SM120 device, so all I can do is compile. However, I noticed that removing this macro in Pingpong causes the program to return immediately sometime:

#ifdef CUTLASS_ENABLE_GDC_FOR_SM90
// It is possible to have work tiles start off invalid,
// so we have to check that first.
if (not work_tile_info.is_valid()) {
// Hint on an early release of global memory resources.
// The timing of calling this function only influences performance,
// not functional correctness.
cutlass::arch::launch_dependent_grids();
return;
}
#endif

I think this radical approach may be risky, and may result in some semaphores not being released yet. In contrast, other changes will not affect the program's execution. I've reverted the changes to the early stop part. Could you try this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants