[do not merge] Register stealing #3566

zasdfgbnm · 2024-12-11T01:05:56Z

This PR is based on #3564

In this PR, I added register stealing (thanks @rdspring1 for the comment #3511 (comment)), perf improved.

On H200:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     32.6           124702          1  124702.0  124702.0    124702    124702          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     22.9            87711          1   87711.0   87711.0     87711     87711          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 70.3%

zasdfgbnm · 2024-12-11T01:38:07Z

__tmp_kernel_none_f0_c0_r0_g0.cu

@@ -11368,11 +11369,13 @@ __global__ void __cluster_dims__(2, 1, 1) nvfuser_none_f0_c0_r0_g0(Tensor<__half
        }
      }
    }
+    return;


Note: This return is super important, without this, our perf will be 4% of cuBLAS.

Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- --------- --------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 89.5 2213806 1 2213806.0 2213806.0 2213806 2213806 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>… 3.6 87935 1 87935.0 87935.0 87935 87935 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

rdspring1 · 2024-12-23T17:43:17Z

We need to set __launch_bounds__ for register sharing.

The setmaxnreg instruction requires that the kernel has been launched with a valid value of maximum number of per-thread registers specified via the appropriate compilation via the appropriate compile-time option or the appropriate performance tuning directive. Otherwise, the setmaxnreg instruction may have no effect.

https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-setmaxnreg

Register stealing

f6ab0c3

zasdfgbnm changed the title ~~Register stealing~~ [do not merge] Register stealing Dec 11, 2024

zasdfgbnm commented Dec 11, 2024

View reviewed changes

rdspring1 added the Matmuls label Dec 11, 2024

zasdfgbnm mentioned this pull request Dec 12, 2024

Implement register stealing #3584

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[do not merge] Register stealing #3566

[do not merge] Register stealing #3566

zasdfgbnm commented Dec 11, 2024 •

edited

Loading

zasdfgbnm Dec 11, 2024 •

edited

Loading

rdspring1 commented Dec 23, 2024

[do not merge] Register stealing #3566

Are you sure you want to change the base?

[do not merge] Register stealing #3566

Conversation

zasdfgbnm commented Dec 11, 2024 • edited Loading

zasdfgbnm Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

rdspring1 commented Dec 23, 2024

zasdfgbnm commented Dec 11, 2024 •

edited

Loading

zasdfgbnm Dec 11, 2024 •

edited

Loading