-
Notifications
You must be signed in to change notification settings - Fork 54
feat: swap out internal device array usage with StridedMemoryView
#703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
f5a1c5c to
c62d013
Compare
|
/ok to test |
Greptile SummaryThis PR refactors kernel argument handling to use Key changes:
Performance trade-offs noted: Code quality: Confidence Score: 4/5
Important Files Changed
|
|
This PR can't be merged until the next release of cuda-core, because I depend on some unreleased features there. However, it's still worth reviewing. |
|
I managed to recover a good amount of perf of devicearray by avoiding the SMV conversion entirely and spoofing the interface. |
|
However there is still a slowdown of ~60%, but only in the many-args case (it's about 15% in the single argument case). This is much better than the previous commit which was upwards of 2.5x. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Refactored kernel argument handling to use StridedMemoryView internally, enabling direct __dlpack__ protocol support and improving CuPy interoperability (~3x speedup).
Key Changes
- Replaced
auto_device()calls with_to_strided_memory_view()for unified array handling - Added LRU caching to type inference functions (
typeof,from_dtype,strides_from_shape) to reduce overhead - Converted several properties to
@functools.cached_propertyfor performance - Refactored
Out/InOutclasses to use inheritance pattern withcopy_inputclass variable - Changed
strides_from_shape()API fromorder="C"/"F"to boolean flagsc_contiguous/f_contiguous
Issues Found
- Logic bug in
strides_from_shape(): when bothc_contiguousandf_contiguousare False, function produces incorrect strides (computes F-contiguous then reverses, which is neither C nor F layout)
Performance Trade-offs
The PR documents a ~2.5x regression for legacy device_array() in exchange for ~3x improvement for CuPy arrays. This aligns with the project's strategic direction toward ecosystem integration.
Confidence Score: 4/5
- This PR is safe to merge with one logic issue that needs fixing
- Score reflects well-structured refactoring with proper caching optimizations, but one critical logic bug in
strides_from_shape()when both contiguity flags are False needs resolution before merge numba_cuda/numba/cuda/np/numpy_support.py- fix thestrides_from_shape()logic for handling non-contiguous arrays
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| numba_cuda/numba/cuda/np/numpy_support.py | 3/5 | Added LRU caching to strides_from_shape and from_dtype; changed API from order parameter to c_contiguous/f_contiguous flags. Logic issue: when both flags are False, function computes F-contiguous strides then reverses them unexpectedly. |
| numba_cuda/numba/cuda/cudadrv/devicearray.py | 4/5 | Added _to_strided_memory_view and _make_strided_memory_view helper functions to support DLPack protocol; converted nbytes and added _strided_memory_view_shim to cached properties. Implementation looks solid. |
| numba_cuda/numba/cuda/args.py | 4/5 | Refactored Out and InOut classes to use StridedMemoryView; changed _numba_type_ to cached property. Clean refactor with proper class inheritance. |
| numba_cuda/numba/cuda/dispatcher.py | 4/5 | Updated kernel argument marshaling to work with StridedMemoryView objects instead of DeviceNDArray. Uses fallback to strides_from_shape when strides not available. |
| numba_cuda/numba/cuda/typing/typeof.py | 5/5 | Added LRU caching to _typeof_cuda_array_interface by extracting logic into cached helper functions. All parameters are hashable, caching is safe and should improve performance. |
| numba_cuda/numba/cuda/np/arrayobj.py | 5/5 | Updated call to strides_from_shape to use new keyword-only argument API with c_contiguous=True. Minimal, straightforward change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, 1 comment
9ff51b9 to
1032275
Compare
rparolin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me. I'm a bit on the fence about shipping a known performance regression to a deprecated type. I'd feel better if we removed it first instead of regressing on performance. All that being said, the regression has improved from the initially reported 2.5x.
I'd still wait to merge until @gmarkall gives the final 👍
|
So the results of where we are is that using CuPy Arrays has ~3x less latency, but using device arrays or torch tensors has ~60% more latency? On the torch front NVIDIA/cuda-python#1439 may help in bypassing the slow |
Almost. Will post new numbers in a bit.
Passing |
1032275 to
739cb5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Refactors internal kernel argument handling to use StridedMemoryView from cuda-python, enabling direct __dlpack__ protocol support for external arrays like CuPy. Replaces __cuda_array_interface__ handling with the unified StridedMemoryView API and adds LRU caching to type inference paths to reduce overhead. Performance measurements show ~3x improvement for CuPy arrays but ~2.5x regression for legacy device_array() objects, which the PR justifies as an acceptable trade-off favoring ecosystem integration over deprecated APIs.
Confidence Score: 1/5
- Critical logic bug in strides fallback for 0-dimensional arrays will cause incorrect behavior
- The PR contains a critical logic error in dispatcher.py line 558 where the strides fallback uses
oroperator with potentially empty tuples. For 0-dimensional arrays,strides_in_bytesis legitimately(), but empty tuples are falsy in Python, triggering unnecessary fallback computation. While the fallback should also return(), this indicates a misunderstanding of the truthiness semantics that could mask other issues. Additionally, there are multiple stream handling edge cases around stream=0 that should be verified for correctness. - numba_cuda/numba/cuda/dispatcher.py requires immediate attention for the strides fallback bug; numba_cuda/numba/cuda/args.py needs verification of stream_ptr=0 handling semantics
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| numba_cuda/numba/cuda/dispatcher.py | 1/5 | Refactored kernel argument marshaling to use StridedMemoryView; critical bug in strides fallback logic |
| numba_cuda/numba/cuda/cudadrv/devicearray.py | 2/5 | Added _to_strided_memory_view and _make_strided_memory_view functions for dlpack/CAI conversion; changed nbytes to cached_property |
| numba_cuda/numba/cuda/args.py | 3/5 | Refactored Out and InOut classes to use _to_strided_memory_view; InOut now inherits from Out with copy_input=True |
| (layout := devary._layout).strides_in_bytes | ||
| or numpy_support.strides_from_shape( | ||
| shape=shape, | ||
| itemsize=itemsize, | ||
| c_contiguous=layout.is_contiguous_c, | ||
| f_contiguous=layout.is_contiguous_f, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strides_from_shape fallback is invoked when strides_in_bytes is falsy (empty tuple () or None). For 0-dimensional arrays, strides_in_bytes is legitimately an empty tuple (), but empty tuples are falsy in Python. This means 0-D arrays will incorrectly trigger the fallback computation even when they already have valid (empty) strides.
Bug scenario:
# 0-D array (scalar)
arr = np.array(42)
arr.strides # returns ()When devary._layout.strides_in_bytes is (), the or operator treats it as falsy and calls strides_from_shape, which should also return () but wastes computation.
Fix:
| (layout := devary._layout).strides_in_bytes | |
| or numpy_support.strides_from_shape( | |
| shape=shape, | |
| itemsize=itemsize, | |
| c_contiguous=layout.is_contiguous_c, | |
| f_contiguous=layout.is_contiguous_f, | |
| ) | |
| kernelargs.extend( | |
| strides_in_bytes := (layout := devary._layout).strides_in_bytes | |
| ) if strides_in_bytes is not None else kernelargs.extend( | |
| numpy_support.strides_from_shape( | |
| shape=shape, | |
| itemsize=itemsize, | |
| c_contiguous=layout.is_contiguous_c, | |
| f_contiguous=layout.is_contiguous_f, | |
| ) | |
| ) |
Alternatively, check explicitly for None:
strides = layout.strides_in_bytes if layout.strides_in_bytes is not None else numpy_support.strides_from_shape(...)
kernelargs.extend(strides)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grumbles ... fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the fallback computation would also return an empty tuple, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if strides_in_bytes is the empty tuple, then shape will be the empty tuple, and the fallback computation would return the empty tuple.
Other than wasting a function call, why is that a "critical" error?
|
I also added a benchmark demonstrating that the additional overhead with Maybe there's some way that we can reduce that further, but I haven't looked into it. |
|
The devicearray regressions are somewhat concerning, but given we are actively working towards deprecating it, I think it would still be worth it. Do we have a sense on follow up work from here that helps to ameliorate the performance overheads related to torch? |
At least some of the remaining overhead is related to stream synchronization, but that may be justified/useful in some cases I'm guessing. After that, I'm not sure. It will require more investigation. Just to make sure we're on the same page, our expectation is that if an array is on device then we should expect the kernel launch overhead to amount to a collection of relatively cheap attribute accesses. Is that correct? |
My 2c: numba-cuda shouldn't be in the business of handling stream synchronization and that if someone is passing an array on a different stream through dlpack / CAI, it becomes their responsibility to launch the kernel on a stream that is synchronized with respect to the passed stream. This is likely a breaking change that would need to be clearly and loudly deprecated and subsequently removed.
Yes. Kernel launch latency is quite important where we should aim for less than 1us overhead. |
Got it, yeah I don't really know enough about how this functionality is used or assumed to be used to have an informed opinion (yet!), but simply removing sync (in the torch case by passing -1 as the stream pointer to
Roger that, I think we can get there if not very close. |
739cb5b to
891ccb7
Compare
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
| return StridedMemoryView.from_dlpack( | ||
| obj, stream_ptr=getattr(stream, "handle", stream) | ||
| ), False | ||
| elif (desc := getattr(obj, "__cuda_array_interface__", None)) is not None: | ||
| smv = StridedMemoryView.from_cuda_array_interface( | ||
| obj, stream_ptr=int(getattr(stream, "handle", stream)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent handling of stream_ptr conversion: line 1020 uses getattr(stream, "handle", stream) without casting to int, while line 1024 uses int(getattr(stream, "handle", stream)).
This inconsistency could potentially cause type mismatches depending on what StridedMemoryView.from_dlpack and StridedMemoryView.from_cuda_array_interface expect.
For consistency and type safety, both should use the same pattern. Looking at line 1037 and line 58 in args.py, it appears the int conversion is sometimes used. Consider standardizing this across all usages.
An option available to use is instead of actually synchronizing the stream (assuming we're calling |
|
I tried rolling some C-API classes for the two |
891ccb7 to
e26196a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, 1 comment
|
I included the pure C classes here for reference, but I am going to revert those changes because they don't really improve anything. |
| Py_TYPE(op)->tp_free(op); | ||
| } | ||
|
|
||
| static int StridedMemoryViewShim_init(PyObject *op, PyObject *args, PyObject *kwds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if any of the seen overhead is coming from here, but with StridedMemoryView being a cython cdef class, all of this information is available via the underlying C struct as native types and we could avoid the whole boxing, unboxing, boxing, unboxing of Python objects that I suspect is happening here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no boxing or unboxing here, these all remain as untouched pyobjects. The pure python version of this code isn't measurably slower, so I'm going to revert back to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this in Cython as well and construction is faster by 1.5x-2x, but attribute access is slower by 1.5x-2x.
Both operations are performed, but attribute access is more common. The constructor is only called once per instance because it's a functools.cached_property whereas these various attributes are accessed once per kernel launch call.
Slots is apparently the best we can do in this particular situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant in the overlap pipeline leading to kernel launch, eventually we need to lower these Python objects down to native types in order to pass into the kernel via the numba-cuda ABI and that if we were able to keep things as native types from the beginning and never raise them back up to Python objects before the conversion down to the numba-cuda ABI for kernel launch that there would be a potential win. That's probably a mountain of work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the argument preparation can be done in cuda-core, down in Cython land, then yeah, that would be ideal.
It might be worth it to start planning that effort now if we think it will bear fruit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 100%. We don't currently have StridedMemoryView plumbed into how we lower Python objects for kernel launch: https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/cuda/core/_kernel_arg_handler.pyx but it absolutely could be added. I'm sure there's ABI convention things that need to be handled is the big thing.
e26196a to
8843b4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 1 comment
8843b4a to
5beaae5
Compare
|
The implementation generally looks good to me. Would be good to get a final round of benchmarks to evaluate where the improvements / regressions are. |
96eeaa1 to
8713b8d
Compare
|
/ok to test |
|
Ran this through an LLM to get a nice summary of the final benchmark results. High-level takeaways (NOW vs
|
| Variant | Baseline mean (ms) | NOW mean (ms) | Change |
|---|---|---|---|
| dispatch-cupy | 82.40 | 27.21 | -67% (3.0× faster) |
| dispatch-device_array | 6.77 | 10.48 | +55% (1.55× slower) |
| dispatch-torch | 70.60 | 64.31 | -9% (1.10× faster) |
| signature-cupy | 62.79 | 19.80 | -68% (3.17× faster) |
| signature-device_array | 6.91 | 10.78 | +56% (1.56× slower) |
| signature-torch | 49.94 | 54.77 | +10% (1.10× slower) |
test_one_arg
| Variant | Baseline mean (ms) | NOW mean (ms) | Change |
|---|---|---|---|
| dispatch-cupy | 5.91 | 2.90 | -51% (2.0× faster) |
| dispatch-device_array | 1.27 | 1.40 | +11% slower |
| dispatch-torch | 5.32 | 4.96 | -7% faster |
| signature-cupy | 4.38 | 2.27 | -48% (1.9× faster) |
| signature-device_array | 1.27 | 1.41 | +11% slower |
| signature-torch | 3.62 | 4.19 | +16% slower |
Current absolute performance (NOW, mean time)
- Many args:
device_array(~10–11 ms) is still fastest, then CuPy (~20–27 ms), then Torch (~55–64 ms). - One arg:
device_array(~1.4 ms) fastest, then CuPy (~2.3–2.9 ms), then Torch (~4.2–5.0 ms).
Raw data
------------------------------------------------------------------------------ benchmark 'test_many_args[dispatch-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-cupy] (NOW) 26.5499 (1.0) 28.4821 (1.0) 27.2123 (1.0) 0.5466 (1.0) 27.0107 (1.0) 0.7143 (1.0) 6;0 36.7481 (1.0) 22 1
test_many_args[dispatch-cupy] (0001_c91948d) 80.1594 (3.02) 84.9900 (2.98) 82.3980 (3.03) 1.5929 (2.91) 82.2436 (3.04) 2.3078 (3.23) 4;0 12.1362 (0.33) 10 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-device_array]': 2 tests ------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-device_array] (0001_c91948d) 6.4306 (1.0) 7.0619 (1.0) 6.7681 (1.0) 0.1830 (1.06) 6.7565 (1.0) 0.3377 (1.82) 15;0 147.7515 (1.0) 30 1
test_many_args[dispatch-device_array] (NOW) 10.1953 (1.59) 10.9627 (1.55) 10.4819 (1.55) 0.1723 (1.0) 10.4611 (1.55) 0.1854 (1.0) 8;2 95.4026 (0.65) 33 1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------ benchmark 'test_many_args[dispatch-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-torch] (NOW) 63.9991 (1.0) 65.4999 (1.0) 64.3109 (1.0) 0.3777 (1.0) 64.2809 (1.0) 0.2062 (1.0) 1;1 15.5495 (1.0) 13 1
test_many_args[dispatch-torch] (0001_c91948d) 68.3296 (1.07) 73.8380 (1.13) 70.5973 (1.10) 1.8827 (4.98) 70.1094 (1.09) 2.6950 (13.07) 5;0 14.1648 (0.91) 11 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------ benchmark 'test_many_args[signature-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-cupy] (NOW) 19.3784 (1.0) 21.0086 (1.0) 19.7990 (1.0) 0.4120 (1.0) 19.6576 (1.0) 0.4164 (1.0) 9;4 50.5075 (1.0) 51 1
test_many_args[signature-cupy] (0001_c91948d) 60.1597 (3.10) 67.1960 (3.20) 62.7940 (3.17) 2.0793 (5.05) 63.2831 (3.22) 3.3902 (8.14) 6;0 15.9251 (0.32) 17 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------ benchmark 'test_many_args[signature-device_array]': 2 tests -------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-device_array] (0001_c91948d) 6.6276 (1.0) 10.7366 (1.0) 6.9080 (1.0) 0.4227 (2.08) 6.7669 (1.0) 0.2620 (1.91) 16;15 144.7594 (1.0) 143 1
test_many_args[signature-device_array] (NOW) 10.5675 (1.59) 11.8660 (1.11) 10.7840 (1.56) 0.2029 (1.0) 10.7284 (1.59) 0.1375 (1.0) 10;6 92.7299 (0.64) 87 1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------ benchmark 'test_many_args[signature-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-torch] (0001_c91948d) 48.2905 (1.0) 52.6185 (1.0) 49.9353 (1.0) 1.3081 (1.0) 49.6528 (1.0) 2.2560 (1.0) 8;0 20.0259 (1.0) 20 1
test_many_args[signature-torch] (NOW) 51.6698 (1.07) 57.1802 (1.09) 54.7654 (1.10) 1.8941 (1.45) 55.9016 (1.13) 3.7764 (1.67) 7;0 18.2597 (0.91) 19 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-cupy]': 2 tests ----------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-cupy] (NOW) 2.8639 (1.0) 3.1819 (1.0) 2.9041 (1.0) 0.0440 (1.0) 2.8976 (1.0) 0.0236 (1.0) 2;4 344.3451 (1.0) 66 1
test_one_arg[dispatch-cupy] (0001_c91948d) 5.4198 (1.89) 6.7414 (2.12) 5.9068 (2.03) 0.3580 (8.14) 5.6831 (1.96) 0.5460 (23.13) 7;0 169.2971 (0.49) 43 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-device_array]': 2 tests ----------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array] (0001_c91948d) 1.2494 (1.0) 1.3038 (1.0) 1.2674 (1.0) 0.0213 (1.04) 1.2601 (1.0) 0.0212 (1.0) 1;0 789.0171 (1.0) 5 1
test_one_arg[dispatch-device_array] (NOW) 1.3719 (1.10) 1.4275 (1.09) 1.4005 (1.10) 0.0206 (1.0) 1.3975 (1.11) 0.0260 (1.23) 2;0 714.0520 (0.90) 5 1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-torch]': 2 tests -----------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-torch] (NOW) 4.9318 (1.0) 4.9933 (1.0) 4.9609 (1.0) 0.0142 (1.0) 4.9585 (1.0) 0.0149 (1.0) 15;2 201.5761 (1.0) 55 1
test_one_arg[dispatch-torch] (0001_c91948d) 4.9826 (1.01) 5.4021 (1.08) 5.3169 (1.07) 0.0916 (6.43) 5.3375 (1.08) 0.0466 (3.12) 3;3 188.0811 (0.93) 46 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-cupy]': 2 tests -----------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-cupy] (NOW) 2.0933 (1.0) 2.5494 (1.0) 2.2743 (1.0) 0.0755 (1.0) 2.3032 (1.0) 0.0573 (1.0) 104;85 439.6938 (1.0) 412 1
test_one_arg[signature-cupy] (0001_c91948d) 3.9292 (1.88) 5.7554 (2.26) 4.3756 (1.92) 0.4172 (5.53) 4.2204 (1.83) 0.5594 (9.77) 61;4 228.5381 (0.52) 235 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-device_array]': 2 tests -----------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-device_array] (0001_c91948d) 1.2242 (1.0) 1.9068 (1.0) 1.2712 (1.0) 0.0389 (1.08) 1.2642 (1.0) 0.0191 (1.0) 36;35 786.6426 (1.0) 621 1
test_one_arg[signature-device_array] (NOW) 1.3621 (1.11) 2.0377 (1.07) 1.4072 (1.11) 0.0359 (1.0) 1.4019 (1.11) 0.0281 (1.47) 52;13 710.6116 (0.90) 597 1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------- benchmark 'test_one_arg[signature-torch]': 2 tests ----------------------------------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-torch] (0001_c91948d) 3.3064 (1.0) 5.4537 (1.14) 3.6175 (1.0) 0.2671 (2.00) 3.5855 (1.0) 0.1975 (1.0) 14;8 276.4351 (1.0) 196 1
test_one_arg[signature-torch] (NOW) 3.9787 (1.20) 4.7722 (1.0) 4.1866 (1.16) 0.1335 (1.0) 4.2446 (1.18) 0.2394 (1.21) 95;1 238.8596 (0.86) 227 1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
8713b8d to
fc60b93
Compare
|
/ok to test |


Summary
Refactor kernel argument handling to use
StridedMemoryViewinternally,enabling direct support for
__dlpack__objects and improving interoperabilitywith libraries like CuPy.
Closes: #152
Tracking issue: #128
Key Changes
New capability: Kernel arguments now accept objects with
__dlpack__protocol directly (e.g., CuPy arrays).
Internals: Replaced array interface handling with
cuda.core.utils.StridedMemoryViewfor:__dlpack__objects (new)__cuda_array_interface__objectsDeviceNDArray)Performance:
device_array()arrays: ~2.5x regression (initial measurements)slow. Previously it was going through CAI but its CAI version isn't supported
by
StridedMemoryViewPerformance Trade-off Discussion
The 2.5x slowdown for
device_array()is worth discussing (and perhaps thetorch regression is as well):
Arguments for accepting this regression:
__dlpack__libraries represent the primary ecosystem (or atleast the end goal) for GPU computing in Python
that we are prioritizing
device_array()is primarily used in legacy code and tests and isdeprecated
Why this might be worth merging despite the regression:
the project's direction
it proves important
Implementation Details
_to_strided_memory_view()and_make_strided_memory_view()helperfunctions (numba_cuda/numba/cuda/cudadrv/devicearray.py:247-359)
typeoffor CAI objects to reduce type inferenceoverhead (typing/typeof.py:315-365)
Testing
Existing test suite passes.
TL;DR: Adds
__dlpack__support (~3x faster for CuPy), with ~2.5xregression on legacy
device_array(). Trade-off favors ecosystem integration,but open to discussion.