Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[do not merge] Register stealing #3566

Draft
wants to merge 1 commit into
base: elect-sync-out
Choose a base branch
from

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Dec 11, 2024

This PR is based on #3564

In this PR, I added register stealing (thanks @rdspring1 for the comment #3511 (comment)), perf improved.

On H200:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     32.6           124702          1  124702.0  124702.0    124702    124702          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     22.9            87711          1   87711.0   87711.0     87711     87711          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 70.3%

@zasdfgbnm zasdfgbnm changed the title Register stealing [do not merge] Register stealing Dec 11, 2024
@@ -11368,11 +11369,13 @@ __global__ void __cluster_dims__(2, 1, 1) nvfuser_none_f0_c0_r0_g0(Tensor<__half
}
}
}
return;
Copy link
Collaborator Author

@zasdfgbnm zasdfgbnm Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This return is super important, without this, our perf will be 4% of cuBLAS.

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     89.5          2213806          1  2213806.0  2213806.0   2213806   2213806          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
      3.6            87935          1    87935.0    87935.0     87935     87935          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

@rdspring1
Copy link
Collaborator

We need to set __launch_bounds__ for register sharing.

The setmaxnreg instruction requires that the kernel has been launched with a valid value of maximum number of per-thread registers specified via the appropriate compilation via the appropriate compile-time option or the appropriate performance tuning directive. Otherwise, the setmaxnreg instruction may have no effect.

https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-setmaxnreg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants