Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Copy Accumulator to GMEM directly? #1920

Closed
osayamenja opened this issue Nov 5, 2024 · 6 comments
Closed

[QST] Copy Accumulator to GMEM directly? #1920

osayamenja opened this issue Nov 5, 2024 · 6 comments

Comments

@osayamenja
Copy link

What is your question?
Hello! How do you copy the accumulator registers to global memory directly?

For example, in ampere_conv_kernel.h, how would we copy accum directly to gC, skipping the copy to sC?

Thanks!

@osayamenja
Copy link
Author

I have tried something like the below but it fails to compile with "Copy_Traits: src failed to vectorize into registers. Layout is incompatible with this CopyOp."

// ... 
copy(gmem_tiled_copy_C, accum, tDgC);

@thakkarV
Copy link
Collaborator

thakkarV commented Nov 5, 2024

please see #1905

@osayamenja
Copy link
Author

osayamenja commented Nov 6, 2024

@thakkarV Thanks for responding! My mistake for not giving enough information. My use case is actually different from that, as there is no vectorization.

Here is the changed tiled copy.

auto gmem_tiled_copy_C = cute::make_tiled_copy(
        cute::Copy_Atom<cute::UniversalCopy<float>, float>{},
        cute::Layout<cute::Shape<cute::_16, cute::_8>>{},
        cute::Layout<cute::Shape<cute::_1, cute::_1>>{}); // 1x1 per thread, is this the problem?

Below are the layouts.

((_2,_2),_4,_4):((_1,_2),_4,_16) //accum
--------------------------------------------
TiledCopy // gmem_tiled_copy_C
  Tiler_MN:       (_16,_8)
  TiledLayout_TV: (_128,_1):(_1,_0)
Copy_Atom
  ThrID:        _1:_0
  ValLayoutSrc: (_1,_1):(_0,_1)
  ValLayoutDst: (_1,_1):(_0,_1)
  ValLayoutRef: (_1,_1):(_0,_1)
  ValueType:    32b
--------------------------------------------
gmem_ptr[32b](0x420001800) o ((_1,_1),_8,_8):((_0,_0),16,1024) // tDgC

@thakkarV
Copy link
Collaborator

thakkarV commented Nov 6, 2024

if you don't care about vectorization, just drop the tiled copy. Partition the gmem tensor with the tiled mma and then just call copy on the partitioned rmem tensor.

auto tCrC = thr_mma.partition_fragment_C(TileShapeMN{});
auto tCgC = thr_mma.partition_C(tiled_gmem_tensor_C);
copy(tCrC, tCgC);

@osayamenja
Copy link
Author

@thakkarV Life saver, thanks a ton! It compiles now!

Honestly, I would rather use vectorization, but I am following gemm_tn from sgemm_80.cu which uses 1x1 val layout.

I know I can vectorize that layout by changing TA to uint128_t and layout to Layout<_1, _4>. I will experiment and see what happens, thanks again!

@thakkarV
Copy link
Collaborator

thakkarV commented Nov 6, 2024

For vectorization "how to" you can follow the other issue I linked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants