Skip to content

Conversation

@ZzEeKkAa
Copy link
Contributor

@ZzEeKkAa ZzEeKkAa commented May 1, 2025

Replace internal usage of types.Array towards cuda specific array type CUDAArray that can handle addrespace. The idea is to help the nvvm compiler recognize address space for the memory load/store operation. The issue is that in complex workloads compiler loses track of address space and produce general purpose instructions instead of memory specific one. One such example is device GEMM. This PR results in LDS shared memory specific instruction generated instead of general purpose LD.E.

List of changes happened after transitioning to CUDAArray:

  • introduce CUDAArray type and model that supports address space pointer;
  • all arrays inside cuda.jit are now using CUDAArray instead of types.Array. That breaks some api like requesting implementation with general purpose array signature. Thats why many tests had to be updated;
  • atomics was updated to use nvvm intrinsics specific to the memory address: https://docs.nvidia.com/cuda/nvvm-ir-spec/#atomic .

Future improvements that unblocked:

  • get rid of memory_info data field for cuda array. The purpose was to have it like a smart pointer when going back and force between python and compiled code. For kernels we have only compiled code, so we don't need this field ;
  • implement address space validation ;
  • implement static size arrays, so that size could be resolved at compile time .

TODO:

  • Fix rest of the tests.
  • Add auto type casting when passing specific address space pointer array to a function that accepts generic address space pointer array.

@ZzEeKkAa ZzEeKkAa changed the title Add CUDAArray type and implementation Add CUDAArray type and implementation with addresspace information May 1, 2025
@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label May 2, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 22, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ZzEeKkAa
Copy link
Contributor Author

/ok to test

@ZzEeKkAa ZzEeKkAa changed the title Add CUDAArray type and implementation with addresspace information [WIP] Add CUDAArray type and implementation with addresspace information Jul 25, 2025
@ZzEeKkAa ZzEeKkAa requested review from atmnp and gmarkall July 25, 2025 18:06
@ZzEeKkAa ZzEeKkAa added improvement Improves an existing functionality breaking Introduces a breaking change labels Jul 25, 2025
@gmarkall
Copy link
Contributor

/ok to test

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully reviewed this yet but want to discuss thoughts so far - some of them are marked on the diff.

I think the main design concern I have at the moment is with trying to ensure that all array types in kernels are a CUDAArray type instead of an Array type - I think this might impact launch latency and have a lot of edge cases we need to find. Is an alternative path to keep Array types coexisting with CUDAArray types in kernels, but treat Array types as being in the generic address space? The idea here is to leave the decorator and dispatcher logic unchanged so we don't have to try and make CUDAArray types in the critical path of a launch.

type_name = "readonly " + type_name
if not self.aligned:
type_name = "unaligned " + type_name
self.name = "%s(%s, %sd, %s, addrspace(%d))" % (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the address spaces in nvvm.py are just integers, might it be worth converting them to be an enum class so that it's easier to get the array type to print like

array(int64, 1, 'C', SHARED)

instead of

array(int64, 1, 'C', addrspace(3))

?

It might make interactive debug / development a little easier without having to mentally translate the address space numbers to names - there aren't to many uses of the address spaces so I would hope that updating the uses (if necessary) wouldn't be too burdensome.

# dispatcher type in future.


class CUDAArray(types.Array):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might the mangling_args property need implementing as well? Two methods that differ only in the address space of an array could end up mangling to the same name and potentially creating a symbol clash.

For example:

from numba.core.itanium_mangler import mangle_type
from numba.cuda.types import CUDAArray
from numba import types

shared_array = CUDAArray(types.int64, 1, 'C', addrspace=3)
generic_array = CUDAArray(types.int64, 1, 'C', addrspace=0)

shared_mangled = mangle_type(shared_array)
generic_mangled = mangle_type(generic_array)

print(shared_mangled)
print(generic_mangled)
assert shared_mangled != generic_mangled

gives

9CUDAArrayIxLi1E1C7mutable7alignedE
9CUDAArrayIxLi1E1C7mutable7alignedE
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/issues/numba-cuda-236/mangle_test.py", line 13, in <module>
    assert shared_mangled != generic_mangled
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, what about the unification (the unify() method) and conversion (can_convert_to())? If unify() is not implemented, then all CUDA arrays will end up unifying to Array types instead, even if the set of types to unify were all in the same address space.

Conversions will also lose address space information, or perhaps even allow invalid conversions - I think we should not allow conversion from shared to local address space, for example, but conversions to the generic address space should always be OK.

# the CUDA Array Interface.
try:
return typeof(val, Purpose.argument)
tp = typeof(val, Purpose.argument)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned this could have a non-trivial impact on kernel launch time. Can you do a microbenchmark to check how much this impacts the latency of launches with various numbers of array arguments?

@gmarkall
Copy link
Contributor

all arrays inside cuda.jit are now using CUDAArray instead of types.Array. That breaks some api like requesting implementation with general purpose array signature. Thats why many tests had to be updated;

I think this will break user code too - if we can find a way to avoid doing that I'd strongly prefer to.

@ZzEeKkAa
Copy link
Contributor Author

all arrays inside cuda.jit are now using CUDAArray instead of types.Array. That breaks some api like requesting implementation with general purpose array signature. Thats why many tests had to be updated;

I think this will break user code too - if we can find a way to avoid doing that I'd strongly prefer to.

Yes, that's what I'm worry about. Still thinking about potential ways to avoid it

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

gmarkall added a commit that referenced this pull request Dec 2, 2025
This change adds "dwarfAddressSpace" attribute to debug metadata for
CUDA shared memory pointers, enabling debuggers to correctly identify
memory location of variables.

I choose to add address space tracking in the lowering phase, rather
than modifying the underlying typing infrastructure (ArrayModel,
PointerModel) due to the following reasons:
1) There is an onging effort decoupling from Numba's typing system, but
the default behavior is still redirect to Numba;
2) There is a WIP
[PR#236](#236) introducing
CUDAArray type and implementation with addresspace information.

When either of the above is completed, there will be a cleaner approach
to update this patch.

So in this change,
1) Add detection in CUDALower Numba ir.Call to find cuda.shared.array()
call; set flag for the subsequent storevar() to record the name /
addrespace mapping; later reference the address space map when emitting
debug info.
2) A mapping from NVVM address space to DWARF address class is added in
order to emit the "dwarfAddressSpace" to the DIDerivedType for pointer
member "data" from the CUDA array descriptor.
3) A new test is added to make sure shared array and regular local array
get distinguished.

This fixes nvbug#5643016.

---------

Co-authored-by: Graham Markall <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress Currently a work in progress breaking Introduces a breaking change improvement Improves an existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants