Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support TMA with Int64 indexing #3595

Open
jacobhinkle opened this issue Dec 16, 2024 · 3 comments · May be fixed by #3599
Open

Support TMA with Int64 indexing #3595

jacobhinkle opened this issue Dec 16, 2024 · 3 comments · May be fixed by #3599
Labels

Comments

@jacobhinkle
Copy link
Collaborator

jacobhinkle commented Dec 16, 2024

Currently if we try and compile a matmul kernel using TMA loads for a large problem we hit the following error:

INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/executor.cpp":378, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Compilation with int64 is requested but int32 is required because of TMA operations.

This check is there because the box coords argument of the cp.async.bulk.tensor instruction must be 32-bit. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk

This issue is basically just a question: do we really need to restrict to 32-bit indexing to use TMA? If so this is a severe limitation that we should try and work around.

@jacobhinkle
Copy link
Collaborator Author

Note that disabling the check leads to a kernel compile error:

CUDA NVRTC compile error: __tmp_kernel_none_f0_c0_r0_g0.cu(11356): error: no suitable conversion function from "<unnamed>::Array<<unnamed>::nvfuser_index_t, 2, 1>" (aka "<unnamed>::Array<long long, 2, 1>") to "<unnamed>::int32_t" (aka "int") exists            
          Hopper::cpAsyncBulkTensorTileG2S((Hopper::CpAsyncBulkTensorTileG2SIndex<2>{ ptr4, (Array<nvfuser_index_t, 2, 1>{(i5 + (64LL * i29)), i27}), toSmem((&T7[i26])) }), (i28 + (4096LL * i29)));

I think the issue is that we have linked the type of the descriptor to nvfuser_index_t which we could maybe just hard-code as Int32? This is probably naive and there may be further consequences of doing that...

@jacobhinkle
Copy link
Collaborator Author

Here is my current understanding:

  1. TMA does require that each box coordinate fits into 32 bits. This is a slightly weaker requirement than requiring that each dimension of the global array fits into 32 bits.
  2. We compute index type assuming linear indexing (i.e. not how TMA requires the coords) so for example a 50k by 50k matrix will trigger 64-bit indexing.
  3. Matmul problems that currently trigger 64-bit indexing are not rare: for example M=30k K=40k. However, if using TMA for both loads and stores we could use 32-bit indexing instead.
  4. There do exist matmul problems that we cannot use TMA for because of this restriction, for example M=2 billion, K=8. Since the boundary is about 1 billion, these should be extremely rare.

Due to 4, I think it could make sense for us to disable TMA for only those extremely large single dimension problems, and otherwise hardcode 32-bit indexing if TMA is used. If TMA loads are specified without use_smem_epilogue then #3599 solves the loading part, but we could potentially still use 32-bit indexing by checking only the epilogue tensor sizes. In other words, maybe we just need to be able to special-case the index type computation based on the scheduler.

@jacobhinkle
Copy link
Collaborator Author

An alternative, more powerful solution is proposed in #3601

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant