-
Notifications
You must be signed in to change notification settings - Fork 54
[WIP] Add CUDAArray type and implementation with addresspace information #236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
/ok to test |
gmarkall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully reviewed this yet but want to discuss thoughts so far - some of them are marked on the diff.
I think the main design concern I have at the moment is with trying to ensure that all array types in kernels are a CUDAArray type instead of an Array type - I think this might impact launch latency and have a lot of edge cases we need to find. Is an alternative path to keep Array types coexisting with CUDAArray types in kernels, but treat Array types as being in the generic address space? The idea here is to leave the decorator and dispatcher logic unchanged so we don't have to try and make CUDAArray types in the critical path of a launch.
| type_name = "readonly " + type_name | ||
| if not self.aligned: | ||
| type_name = "unaligned " + type_name | ||
| self.name = "%s(%s, %sd, %s, addrspace(%d))" % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the address spaces in nvvm.py are just integers, might it be worth converting them to be an enum class so that it's easier to get the array type to print like
array(int64, 1, 'C', SHARED)instead of
array(int64, 1, 'C', addrspace(3))?
It might make interactive debug / development a little easier without having to mentally translate the address space numbers to names - there aren't to many uses of the address spaces so I would hope that updating the uses (if necessary) wouldn't be too burdensome.
| # dispatcher type in future. | ||
|
|
||
|
|
||
| class CUDAArray(types.Array): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might the mangling_args property need implementing as well? Two methods that differ only in the address space of an array could end up mangling to the same name and potentially creating a symbol clash.
For example:
from numba.core.itanium_mangler import mangle_type
from numba.cuda.types import CUDAArray
from numba import types
shared_array = CUDAArray(types.int64, 1, 'C', addrspace=3)
generic_array = CUDAArray(types.int64, 1, 'C', addrspace=0)
shared_mangled = mangle_type(shared_array)
generic_mangled = mangle_type(generic_array)
print(shared_mangled)
print(generic_mangled)
assert shared_mangled != generic_mangledgives
9CUDAArrayIxLi1E1C7mutable7alignedE
9CUDAArrayIxLi1E1C7mutable7alignedE
Traceback (most recent call last):
File "/home/gmarkall/numbadev/issues/numba-cuda-236/mangle_test.py", line 13, in <module>
assert shared_mangled != generic_mangled
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, what about the unification (the unify() method) and conversion (can_convert_to())? If unify() is not implemented, then all CUDA arrays will end up unifying to Array types instead, even if the set of types to unify were all in the same address space.
Conversions will also lose address space information, or perhaps even allow invalid conversions - I think we should not allow conversion from shared to local address space, for example, but conversions to the generic address space should always be OK.
| # the CUDA Array Interface. | ||
| try: | ||
| return typeof(val, Purpose.argument) | ||
| tp = typeof(val, Purpose.argument) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned this could have a non-trivial impact on kernel launch time. Can you do a microbenchmark to check how much this impacts the latency of launches with various numbers of array arguments?
I think this will break user code too - if we can find a way to avoid doing that I'd strongly prefer to. |
Yes, that's what I'm worry about. Still thinking about potential ways to avoid it |
|
|
This change adds "dwarfAddressSpace" attribute to debug metadata for CUDA shared memory pointers, enabling debuggers to correctly identify memory location of variables. I choose to add address space tracking in the lowering phase, rather than modifying the underlying typing infrastructure (ArrayModel, PointerModel) due to the following reasons: 1) There is an onging effort decoupling from Numba's typing system, but the default behavior is still redirect to Numba; 2) There is a WIP [PR#236](#236) introducing CUDAArray type and implementation with addresspace information. When either of the above is completed, there will be a cleaner approach to update this patch. So in this change, 1) Add detection in CUDALower Numba ir.Call to find cuda.shared.array() call; set flag for the subsequent storevar() to record the name / addrespace mapping; later reference the address space map when emitting debug info. 2) A mapping from NVVM address space to DWARF address class is added in order to emit the "dwarfAddressSpace" to the DIDerivedType for pointer member "data" from the CUDA array descriptor. 3) A new test is added to make sure shared array and regular local array get distinguished. This fixes nvbug#5643016. --------- Co-authored-by: Graham Markall <[email protected]>
Replace internal usage of
types.Arraytowards cuda specific array typeCUDAArraythat can handle addrespace. The idea is to help the nvvm compiler recognize address space for the memory load/store operation. The issue is that in complex workloads compiler loses track of address space and produce general purpose instructions instead of memory specific one. One such example is device GEMM. This PR results inLDSshared memory specific instruction generated instead of general purposeLD.E.List of changes happened after transitioning to
CUDAArray:CUDAArraytype and model that supports address space pointer;cuda.jitare now usingCUDAArrayinstead oftypes.Array. That breaks some api like requesting implementation with general purpose array signature. Thats why many tests had to be updated;Future improvements that unblocked:
TODO: