[QST] Copy Accumulator to GMEM directly? #1920

osayamenja · 2024-11-05T23:47:32Z

What is your question?
Hello! How do you copy the accumulator registers to global memory directly?

For example, in ampere_conv_kernel.h, how would we copy accum directly to gC, skipping the copy to sC?

Thanks!

The text was updated successfully, but these errors were encountered:

osayamenja · 2024-11-05T23:51:51Z

I have tried something like the below but it fails to compile with "Copy_Traits: src failed to vectorize into registers. Layout is incompatible with this CopyOp."

// ... 
copy(gmem_tiled_copy_C, accum, tDgC);

thakkarV · 2024-11-05T23:53:44Z

please see #1905

osayamenja · 2024-11-06T00:04:07Z

@thakkarV Thanks for responding! My mistake for not giving enough information. My use case is actually different from that, as there is no vectorization.

Here is the changed tiled copy.

auto gmem_tiled_copy_C = cute::make_tiled_copy(
        cute::Copy_Atom<cute::UniversalCopy<float>, float>{},
        cute::Layout<cute::Shape<cute::_16, cute::_8>>{},
        cute::Layout<cute::Shape<cute::_1, cute::_1>>{}); // 1x1 per thread, is this the problem?

Below are the layouts.

((_2,_2),_4,_4):((_1,_2),_4,_16) //accum
--------------------------------------------
TiledCopy // gmem_tiled_copy_C
  Tiler_MN:       (_16,_8)
  TiledLayout_TV: (_128,_1):(_1,_0)
Copy_Atom
  ThrID:        _1:_0
  ValLayoutSrc: (_1,_1):(_0,_1)
  ValLayoutDst: (_1,_1):(_0,_1)
  ValLayoutRef: (_1,_1):(_0,_1)
  ValueType:    32b
--------------------------------------------
gmem_ptr[32b](0x420001800) o ((_1,_1),_8,_8):((_0,_0),16,1024) // tDgC

thakkarV · 2024-11-06T00:16:20Z

if you don't care about vectorization, just drop the tiled copy. Partition the gmem tensor with the tiled mma and then just call copy on the partitioned rmem tensor.

auto tCrC = thr_mma.partition_fragment_C(TileShapeMN{});
auto tCgC = thr_mma.partition_C(tiled_gmem_tensor_C);
copy(tCrC, tCgC);

osayamenja · 2024-11-06T00:30:00Z

@thakkarV Life saver, thanks a ton! It compiles now!

Honestly, I would rather use vectorization, but I am following gemm_tn from sgemm_80.cu which uses 1x1 val layout.

I know I can vectorize that layout by changing TA to uint128_t and layout to Layout<_1, _4>. I will experiment and see what happens, thanks again!

thakkarV · 2024-11-06T00:31:33Z

For vectorization "how to" you can follow the other issue I linked

osayamenja added ? - Needs Triage question Question labels Nov 5, 2024

osayamenja closed this as completed Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Copy Accumulator to GMEM directly? #1920

[QST] Copy Accumulator to GMEM directly? #1920

osayamenja commented Nov 5, 2024

osayamenja commented Nov 5, 2024

thakkarV commented Nov 5, 2024

osayamenja commented Nov 6, 2024 •

edited

Loading

thakkarV commented Nov 6, 2024

osayamenja commented Nov 6, 2024

thakkarV commented Nov 6, 2024

[QST] Copy Accumulator to GMEM directly? #1920

[QST] Copy Accumulator to GMEM directly? #1920

Comments

osayamenja commented Nov 5, 2024

osayamenja commented Nov 5, 2024

thakkarV commented Nov 5, 2024

osayamenja commented Nov 6, 2024 • edited Loading

thakkarV commented Nov 6, 2024

osayamenja commented Nov 6, 2024

thakkarV commented Nov 6, 2024

osayamenja commented Nov 6, 2024 •

edited

Loading