-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support TMA with Int64 indexing #3595
Comments
Note that disabling the check leads to a kernel compile error:
I think the issue is that we have linked the type of the descriptor to |
Here is my current understanding:
Due to 4, I think it could make sense for us to disable TMA for only those extremely large single dimension problems, and otherwise hardcode 32-bit indexing if TMA is used. If TMA loads are specified without |
An alternative, more powerful solution is proposed in #3601 |
Currently if we try and compile a matmul kernel using TMA loads for a large problem we hit the following error:
This check is there because the box coords argument of the
cp.async.bulk.tensor
instruction must be 32-bit. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulkThis issue is basically just a question: do we really need to restrict to 32-bit indexing to use TMA? If so this is a severe limitation that we should try and work around.
The text was updated successfully, but these errors were encountered: