Fix matmul incorrect results when k dim for CTA tile is a multiple of 16 #3616

rdspring1 · 2024-12-19T01:41:15Z

This PR fixes the incorrect results issue when k dimension for CTA tile is a multiple of getK(mma_macro).

Why?

In scheduleMmaResults, we need to split the k reduction by getK(mma_macro). A serial reduction will add the results from wgmma along k-dimension.

Details

Modified transformLikeMmaOutput function to not be used in scheduleMmaResults.
Add HSH_TN_UseScheduler test

Performance Results

HSH_TN_UseScheduler - 1.038 - Basically a tie

CTA tile is M=64, K=256, K=32

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     27.5           466137          1  466137.0  466137.0    466137    466137          0.0  nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_TNN                                                           
     26.5           448985          1  448985.0  448985.0    448985    448985          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…

HSH_NT_UseScheduler - 0.9155

CTA tile is M=64, K=256, K=32

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     27.7           468281          1  468281.0  468281.0    468281    468281          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     25.4           428729          1  428729.0  428729.0    428729    428729          0.0  nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN

rdspring1 · 2024-12-19T03:42:27Z

!test

jacobhinkle · 2024-12-19T17:50:01Z

tests/cpp/test_matmul.cpp

+  // NOTE Certain combinations of cta k dimension and circular buffer
+  // prefetching can get incorrect results.


jacobhinkle · 2024-12-19T17:51:11Z

tests/cpp/test_matmul_scheduler.cpp


-    mparams.supported_vec_size = {8, 8, 4};
+    mparams.supported_vec_size = {8, 8, 8};


Good catch, although I think it currently has no meaning until we start handling epilogue inputs with supported vec size.

rdspring1 · 2024-12-20T06:35:04Z

!test

rdspring1 · 2024-12-22T20:26:18Z

!test

rdspring1 added the Matmuls label Dec 19, 2024

jacobhinkle reviewed Dec 19, 2024

View reviewed changes

rdspring1 force-pushed the hopper_matmul_cta_k_fix branch 3 times, most recently from 952cb0a to 0e213d6 Compare December 20, 2024 23:36

rdspring1 added 8 commits December 20, 2024 15:39

change transformLikeMmaOutput

89736a5

fix mma_result

b0682cd

update HopperMatmulSchedulerTest

5703071

update diagnosis

1659496

Use block sync to sync main loop in pipeline tma

88ceaf1

add default warp specialization

fca0086

create MaxNReg and Return kir nodes

99f492a

make blocksync compatible with warp specialization

0c784e0

rdspring1 force-pushed the hopper_matmul_cta_k_fix branch from d366352 to 0c784e0 Compare December 20, 2024 23:56

rdspring1 added 2 commits December 22, 2024 10:57

Initial warp specialization

9288bac

add todo

202e1bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix matmul incorrect results when k dim for CTA tile is a multiple of 16 #3616

Fix matmul incorrect results when k dim for CTA tile is a multiple of 16 #3616

rdspring1 commented Dec 19, 2024 •

edited

Loading

rdspring1 commented Dec 19, 2024

jacobhinkle Dec 19, 2024

jacobhinkle Dec 19, 2024

rdspring1 commented Dec 20, 2024

rdspring1 commented Dec 22, 2024

		// NOTE Certain combinations of cta k dimension and circular buffer
		// prefetching can get incorrect results.


		mparams.supported_vec_size = {8, 8, 4};
		mparams.supported_vec_size = {8, 8, 8};

Fix matmul incorrect results when k dim for CTA tile is a multiple of 16 #3616

Are you sure you want to change the base?

Fix matmul incorrect results when k dim for CTA tile is a multiple of 16 #3616

Conversation

rdspring1 commented Dec 19, 2024 • edited Loading

Why?

Details

Performance Results

HSH_TN_UseScheduler - 1.038 - Basically a tie

HSH_NT_UseScheduler - 0.9155

rdspring1 commented Dec 19, 2024

jacobhinkle Dec 19, 2024

Choose a reason for hiding this comment

jacobhinkle Dec 19, 2024

Choose a reason for hiding this comment

rdspring1 commented Dec 20, 2024

rdspring1 commented Dec 22, 2024

rdspring1 commented Dec 19, 2024 •

edited

Loading