Support TMA with Int64 indexing #3595

jacobhinkle · 2024-12-16T18:36:41Z

Currently if we try and compile a matmul kernel using TMA loads for a large problem we hit the following error:

INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/executor.cpp":378, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Compilation with int64 is requested but int32 is required because of TMA operations.

This check is there because the box coords argument of the cp.async.bulk.tensor instruction must be 32-bit. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk

This issue is basically just a question: do we really need to restrict to 32-bit indexing to use TMA? If so this is a severe limitation that we should try and work around.

The text was updated successfully, but these errors were encountered:

jacobhinkle · 2024-12-16T18:40:46Z

Note that disabling the check leads to a kernel compile error:

CUDA NVRTC compile error: __tmp_kernel_none_f0_c0_r0_g0.cu(11356): error: no suitable conversion function from "<unnamed>::Array<<unnamed>::nvfuser_index_t, 2, 1>" (aka "<unnamed>::Array<long long, 2, 1>") to "<unnamed>::int32_t" (aka "int") exists            
          Hopper::cpAsyncBulkTensorTileG2S((Hopper::CpAsyncBulkTensorTileG2SIndex<2>{ ptr4, (Array<nvfuser_index_t, 2, 1>{(i5 + (64LL * i29)), i27}), toSmem((&T7[i26])) }), (i28 + (4096LL * i29)));

I think the issue is that we have linked the type of the descriptor to nvfuser_index_t which we could maybe just hard-code as Int32? This is probably naive and there may be further consequences of doing that...

jacobhinkle · 2024-12-17T15:03:05Z

Here is my current understanding:

TMA does require that each box coordinate fits into 32 bits. This is a slightly weaker requirement than requiring that each dimension of the global array fits into 32 bits.
We compute index type assuming linear indexing (i.e. not how TMA requires the coords) so for example a 50k by 50k matrix will trigger 64-bit indexing.
Matmul problems that currently trigger 64-bit indexing are not rare: for example M=30k K=40k. However, if using TMA for both loads and stores we could use 32-bit indexing instead.
There do exist matmul problems that we cannot use TMA for because of this restriction, for example M=2 billion, K=8. Since the boundary is about 1 billion, these should be extremely rare.

Due to 4, I think it could make sense for us to disable TMA for only those extremely large single dimension problems, and otherwise hardcode 32-bit indexing if TMA is used. If TMA loads are specified without use_smem_epilogue then #3599 solves the loading part, but we could potentially still use 32-bit indexing by checking only the epilogue tensor sizes. In other words, maybe we just need to be able to special-case the index type computation based on the scheduler.

jacobhinkle · 2024-12-17T15:05:25Z

An alternative, more powerful solution is proposed in #3601

jacobhinkle added the Matmuls label Dec 16, 2024

This was referenced Dec 16, 2024

Support TMA with 64-bit indexing #3599

Open

Compute index type by bounding index expressions #3601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TMA with Int64 indexing #3595

Support TMA with Int64 indexing #3595

jacobhinkle commented Dec 16, 2024 •

edited

Loading

jacobhinkle commented Dec 16, 2024

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 17, 2024

Support TMA with Int64 indexing #3595

Support TMA with Int64 indexing #3595

Comments

jacobhinkle commented Dec 16, 2024 • edited Loading

jacobhinkle commented Dec 16, 2024

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 17, 2024

jacobhinkle commented Dec 16, 2024 •

edited

Loading