feat: swap out internal device array usage with `StridedMemoryView` #703

cpcloud · 2026-01-07T18:27:18Z

Summary

Refactor kernel argument handling to use StridedMemoryView internally,
enabling direct support for __dlpack__ objects and improving interoperability
with libraries like CuPy.

Closes: #152
Tracking issue: #128

Key Changes

New capability: Kernel arguments now accept objects with __dlpack__
protocol directly (e.g., CuPy arrays).

Internals: Replaced array interface handling with
cuda.core.utils.StridedMemoryView for:

__dlpack__ objects (new)
__cuda_array_interface__ objects
Device arrays (existing DeviceNDArray)
NumPy arrays (copied to device as before)

Performance:

CuPy arrays: ~3x improvement on kernel launch (initial measurements)
device_array() arrays: ~2.5x regression (initial measurements)
torch shows regression as well, probably because its DLPack implementation is
slow. Previously it was going through CAI but its CAI version isn't supported
by ~~StridedMemoryView~~

Performance Trade-off Discussion

The 2.5x slowdown for device_array() is worth discussing (and perhaps the
torch regression is as well):

Arguments for accepting this regression:

CuPy and other __dlpack__ libraries represent the primary ecosystem (or at
least the end goal) for GPU computing in Python
The 3x speedup benefits the common interoperability use case for objects
that we are prioritizing
device_array() is primarily used in legacy code and tests and is
deprecated

Why this might be worth merging despite the regression:

Blocking ecosystem integration to optimize a legacy API doesn't align with
the project's direction
The gains where it matters (external arrays) are substantial
Performance parity for device_arrays could be addressed in follow-up work if
it proves important

Implementation Details

Added _to_strided_memory_view() and _make_strided_memory_view() helper
functions (numba_cuda/numba/cuda/cudadrv/devicearray.py:247-359)
Updated kernel argument marshaling in dispatcher.py:541-614
Added LRU caching to typeof for CAI objects to reduce type inference
overhead (typing/typeof.py:315-365)

Testing

Existing test suite passes.

TL;DR: Adds __dlpack__ support (~3x faster for CuPy), with ~2.5x
regression on legacy device_array(). Trade-off favors ecosystem integration,
but open to discussion.

copy-pr-bot · 2026-01-07T18:27:21Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpcloud · 2026-01-07T18:31:58Z

/ok to test

greptile-apps · 2026-01-07T18:32:49Z

Greptile Summary

This PR refactors kernel argument handling to use StridedMemoryView internally, enabling direct support for __dlpack__ objects (like CuPy arrays) and improving interoperability with external GPU libraries.

Key changes:

Replaced auto_device() with new _to_strided_memory_view() and _make_strided_memory_view() helper functions in devicearray.py
Updated kernel argument marshaling in dispatcher.py to use .ptr and ._layout attributes from strided memory views
Modified strides_from_shape() signature to use boolean flags (c_contiguous, f_contiguous) instead of order string
Added LRU caching to type inference functions (_typeof_cuda_array_interface_cached, from_dtype) to reduce overhead
Changed nbytes and _numba_type_ to cached properties for performance
Refactored Out/InOut classes to use inheritance with copy_input flag

Performance trade-offs noted:
The PR description acknowledges ~3x speedup for CuPy arrays but ~2.5x regression for legacy device_array(). This is an intentional trade-off favoring ecosystem integration.

Code quality:
The implementation is clean and well-structured. Previous review comments have been addressed. The changes follow consistent patterns and include proper error handling.

Confidence Score: 4/5

This PR is generally safe to merge with minor considerations around edge cases
The refactoring is well-executed and previous review comments have been addressed. The score is 4 (not 5) due to the acknowledged performance regression for legacy APIs and the complexity of the changes affecting core kernel argument handling. The implementation is sound, but the performance trade-offs should be monitored in production use.
Pay close attention to numba_cuda/numba/cuda/cudadrv/devicearray.py due to the complexity of the new _to_strided_memory_view logic and potential edge cases with different array types

Important Files Changed

Filename	Overview
numba_cuda/numba/cuda/args.py	Refactored to use `_to_strided_memory_view` and `_make_strided_memory_view` instead of `auto_device`. Changed `InOut` to inherit from `Out` with proper `copy_input` flag. Added explicit copy-back logic for `Out` class.
numba_cuda/numba/cuda/cudadrv/devicearray.py	Added `_to_strided_memory_view`, `_make_strided_memory_view` helper functions and shim classes for `StridedMemoryView` integration. Changed `nbytes` to cached property. Core implementation for new DLPack support.
numba_cuda/numba/cuda/dispatcher.py	Updated kernel argument marshaling to use `devary.ptr` and `devary._layout` attributes. Added fallback logic for computing strides when `strides_in_bytes` is falsy (including empty tuple for 0-D arrays).
numba_cuda/numba/cuda/np/numpy_support.py	Changed `strides_from_shape` signature to use boolean flags (`c_contiguous`, `f_contiguous`) instead of order string. Added LRU cache to `from_dtype` function. Added assertion preventing both flags from being False.
numba_cuda/numba/cuda/typing/typeof.py	Added LRU caching to `_typeof_cuda_array_interface_cached` and `_numba_dtype_from_str` to reduce type inference overhead. Refactored to separate cached computation from value extraction.

cpcloud · 2026-01-07T18:36:30Z

This PR can't be merged until the next release of cuda-core, because I depend on some unreleased features there.

However, it's still worth reviewing.

cpcloud · 2026-01-07T18:42:19Z

Current benchmarks versus main (NOW is this PR, the other is upstream/main as of when I ran the command):

Full text of benchmark results

------------------------------------------------------------------------------ benchmark 'test_many_args[dispatch-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                    Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-cupy] (NOW)              28.5491 (1.0)      32.0701 (1.0)      29.3013 (1.0)      0.8509 (3.39)     29.0451 (1.0)      0.7380 (2.20)          3;1  34.1282 (1.0)          20           1
test_many_args[dispatch-cupy] (0001_459b8c0)     81.2223 (2.85)     82.0122 (2.56)     81.6227 (2.79)     0.2509 (1.0)      81.5803 (2.81)     0.3350 (1.0)           3;0  12.2515 (0.36)          8           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-device_array]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                            Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-device_array] (0001_459b8c0)      6.5039 (1.0)       6.6605 (1.0)       6.5602 (1.0)      0.0347 (1.0)       6.5567 (1.0)      0.0391 (1.0)           5;1  152.4334 (1.0)          21           1
test_many_args[dispatch-device_array] (NOW)              16.1680 (2.49)     18.7992 (2.82)     16.4472 (2.51)     0.5087 (14.68)    16.3114 (2.49)     0.1429 (3.65)          2;3   60.8005 (0.40)         28           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[dispatch-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                     Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-torch] (NOW)              61.4096 (1.0)      73.2956 (1.03)     65.9567 (1.0)      3.3014 (6.78)     65.8106 (1.0)      3.7330 (7.42)          4;1  15.1615 (1.0)          13           1
test_many_args[dispatch-torch] (0001_459b8c0)     69.6396 (1.13)     71.3505 (1.0)      70.7103 (1.07)     0.4867 (1.0)      70.8821 (1.08)     0.5033 (1.0)           3;1  14.1422 (0.93)         10           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                     Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-cupy] (NOW)              20.3893 (1.0)      22.5916 (1.0)      20.7331 (1.0)      0.2960 (1.79)     20.6997 (1.0)      0.1337 (1.0)           3;3  48.2320 (1.0)          48           1
test_many_args[signature-cupy] (0001_459b8c0)     61.7753 (3.03)     62.3395 (2.76)     61.9959 (2.99)     0.1649 (1.0)      61.9470 (2.99)     0.1919 (1.44)          5;0  16.1301 (0.33)         17           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-device_array]': 2 tests -------------------------------------------------------------------------------
Name (time in ms)                                             Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-device_array] (0001_459b8c0)      6.7572 (1.0)       6.9647 (1.0)       6.8144 (1.0)      0.0366 (1.0)       6.8028 (1.0)      0.0550 (1.0)          38;2  146.7473 (1.0)         141           1
test_many_args[signature-device_array] (NOW)              16.7750 (2.48)     18.1621 (2.61)     17.2505 (2.53)     0.2715 (7.42)     17.1982 (2.53)     0.2356 (4.28)         12;5   57.9693 (0.40)         58           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                      Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-torch] (0001_459b8c0)     49.9804 (1.0)      50.5731 (1.0)      50.1364 (1.0)      0.1298 (1.0)      50.1156 (1.0)      0.1224 (1.0)           4;1  19.9456 (1.0)          20           1
test_many_args[signature-torch] (NOW)              51.7247 (1.03)     53.2339 (1.05)     52.3643 (1.04)     0.4227 (3.26)     52.5188 (1.05)     0.7008 (5.73)          6;0  19.0970 (0.96)         20           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-cupy]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                 Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-cupy] (NOW)              3.1364 (1.0)      3.2981 (1.0)      3.2120 (1.0)      0.0336 (1.17)     3.2084 (1.0)      0.0362 (1.09)         18;3  311.3370 (1.0)          63           1
test_one_arg[dispatch-cupy] (0001_459b8c0)     5.6052 (1.79)     5.7344 (1.74)     5.6561 (1.76)     0.0286 (1.0)      5.6565 (1.76)     0.0333 (1.0)           8;2  176.7993 (0.57)         35           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-device_array]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                         Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array] (0001_459b8c0)     1.2796 (1.0)      1.3412 (1.0)      1.3013 (1.0)      0.0261 (2.85)     1.2922 (1.0)      0.0400 (3.88)          1;0  768.4784 (1.0)           5           1
test_one_arg[dispatch-device_array] (NOW)              1.9005 (1.49)     1.9257 (1.44)     1.9143 (1.47)     0.0091 (1.0)      1.9148 (1.48)     0.0103 (1.0)           2;0  522.3939 (0.68)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-torch]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                  Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-torch] (NOW)              4.9126 (1.0)      5.9787 (1.16)     5.0667 (1.0)      0.2255 (15.06)    4.9729 (1.0)      0.0839 (3.53)        10;11  197.3664 (1.0)          58           1
test_one_arg[dispatch-torch] (0001_459b8c0)     5.1037 (1.04)     5.1586 (1.0)      5.1313 (1.01)     0.0150 (1.0)      5.1302 (1.03)     0.0238 (1.0)          13;0  194.8823 (0.99)         32           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-cupy]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                  Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-cupy] (NOW)              2.1288 (1.0)      2.7003 (1.0)      2.2073 (1.0)      0.0588 (1.0)      2.1964 (1.0)      0.0452 (1.0)         41;20  453.0358 (1.0)         401           1
test_one_arg[signature-cupy] (0001_459b8c0)     3.8849 (1.82)     4.4341 (1.64)     4.0874 (1.85)     0.1246 (2.12)     4.0341 (1.84)     0.2278 (5.04)        109;0  244.6541 (0.54)        231           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-device_array]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                          Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-device_array] (0001_459b8c0)     1.2409 (1.0)      1.9163 (1.0)      1.2736 (1.0)      0.0424 (1.0)      1.2686 (1.0)      0.0137 (1.0)         23;32  785.1979 (1.0)         646           1
test_one_arg[signature-device_array] (NOW)              1.8690 (1.51)     2.8435 (1.48)     1.9830 (1.56)     0.1331 (3.14)     1.9313 (1.52)     0.1464 (10.69)       32;12  504.2930 (0.64)        453           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[signature-torch]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                   Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-torch] (0001_459b8c0)     3.4580 (1.0)      3.7149 (1.0)      3.5206 (1.0)      0.0280 (1.0)      3.5193 (1.0)      0.0324 (1.0)          75;6  284.0446 (1.0)         259           1
test_one_arg[signature-torch] (NOW)              4.1100 (1.19)     4.6233 (1.24)     4.2768 (1.21)     0.1268 (4.53)     4.3273 (1.23)     0.2216 (6.83)         87;0  233.8189 (0.82)        224           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
======================================================================================== 12 passed in 11.54s =========================================================================================

cpcloud · 2026-01-07T19:08:49Z

I managed to recover a good amount of perf of devicearray by avoiding the SMV conversion entirely and spoofing the interface.

cpcloud · 2026-01-07T19:11:24Z

However there is still a slowdown of ~60%, but only in the many-args case (it's about 15% in the single argument case). This is much better than the previous commit which was upwards of 2.5x.

greptile-apps

Greptile Overview

Greptile Summary

Refactored kernel argument handling to use StridedMemoryView internally, enabling direct __dlpack__ protocol support and improving CuPy interoperability (~3x speedup).

Key Changes

Replaced auto_device() calls with _to_strided_memory_view() for unified array handling
Added LRU caching to type inference functions (typeof, from_dtype, strides_from_shape) to reduce overhead
Converted several properties to @functools.cached_property for performance
Refactored Out/InOut classes to use inheritance pattern with copy_input class variable
Changed strides_from_shape() API from order="C"/"F" to boolean flags c_contiguous/f_contiguous

Issues Found

Logic bug in strides_from_shape(): when both c_contiguous and f_contiguous are False, function produces incorrect strides (computes F-contiguous then reverses, which is neither C nor F layout)

Performance Trade-offs

The PR documents a ~2.5x regression for legacy device_array() in exchange for ~3x improvement for CuPy arrays. This aligns with the project's strategic direction toward ecosystem integration.

Confidence Score: 4/5

This PR is safe to merge with one logic issue that needs fixing
Score reflects well-structured refactoring with proper caching optimizations, but one critical logic bug in strides_from_shape() when both contiguity flags are False needs resolution before merge
numba_cuda/numba/cuda/np/numpy_support.py - fix the strides_from_shape() logic for handling non-contiguous arrays

Important Files Changed

File Analysis

Filename	Score	Overview
numba_cuda/numba/cuda/np/numpy_support.py	3/5	Added LRU caching to `strides_from_shape` and `from_dtype`; changed API from `order` parameter to `c_contiguous`/`f_contiguous` flags. Logic issue: when both flags are False, function computes F-contiguous strides then reverses them unexpectedly.
numba_cuda/numba/cuda/cudadrv/devicearray.py	4/5	Added `_to_strided_memory_view` and `_make_strided_memory_view` helper functions to support DLPack protocol; converted `nbytes` and added `_strided_memory_view_shim` to cached properties. Implementation looks solid.
numba_cuda/numba/cuda/args.py	4/5	Refactored `Out` and `InOut` classes to use `StridedMemoryView`; changed `_numba_type_` to cached property. Clean refactor with proper class inheritance.
numba_cuda/numba/cuda/dispatcher.py	4/5	Updated kernel argument marshaling to work with `StridedMemoryView` objects instead of `DeviceNDArray`. Uses fallback to `strides_from_shape` when strides not available.
numba_cuda/numba/cuda/typing/typeof.py	5/5	Added LRU caching to `_typeof_cuda_array_interface` by extracting logic into cached helper functions. All parameters are hashable, caching is safe and should improve performance.
numba_cuda/numba/cuda/np/arrayobj.py	5/5	Updated call to `strides_from_shape` to use new keyword-only argument API with `c_contiguous=True`. Minimal, straightforward change.

numba_cuda/numba/cuda/np/numpy_support.py

greptile-apps

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

numba_cuda/numba/cuda/np/numpy_support.py

numba_cuda/numba/cuda/cudadrv/devicearray.py

rparolin

Generally looks good to me. I'm a bit on the fence about shipping a known performance regression to a deprecated type. I'd feel better if we removed it first instead of regressing on performance. All that being said, the regression has improved from the initially reported 2.5x.

I'd still wait to merge until @gmarkall gives the final 👍

numba_cuda/numba/cuda/cudadrv/devicearray.py

kkraus14 · 2026-01-08T15:07:05Z

So the results of where we are is that using CuPy Arrays has ~3x less latency, but using device arrays or torch tensors has ~60% more latency?

On the torch front NVIDIA/cuda-python#1439 may help in bypassing the slow __dlpack__ implementation.

cpcloud · 2026-01-08T18:20:58Z

So the results of where we are is that using CuPy Arrays has ~3x less latency, but using device arrays or torch tensors has ~60% more latency?

Almost. Will post new numbers in a bit.

On the torch front NVIDIA/cuda-python#1439 may help in bypassing the slow __dlpack__ implementation.

Passing stream_ptr=-1 removes most of the additional torch overhead versus cupy, which I believe indicates that it's the stream synchronization done by torch (or StridedMemoryView, or both) that causes the slowness. I think that also means removing Python overhead won't help with that source of performance loss.

greptile-apps

Greptile Overview

Greptile Summary

Refactors internal kernel argument handling to use StridedMemoryView from cuda-python, enabling direct __dlpack__ protocol support for external arrays like CuPy. Replaces __cuda_array_interface__ handling with the unified StridedMemoryView API and adds LRU caching to type inference paths to reduce overhead. Performance measurements show ~3x improvement for CuPy arrays but ~2.5x regression for legacy device_array() objects, which the PR justifies as an acceptable trade-off favoring ecosystem integration over deprecated APIs.

Confidence Score: 1/5

Critical logic bug in strides fallback for 0-dimensional arrays will cause incorrect behavior
The PR contains a critical logic error in dispatcher.py line 558 where the strides fallback uses or operator with potentially empty tuples. For 0-dimensional arrays, strides_in_bytes is legitimately (), but empty tuples are falsy in Python, triggering unnecessary fallback computation. While the fallback should also return (), this indicates a misunderstanding of the truthiness semantics that could mask other issues. Additionally, there are multiple stream handling edge cases around stream=0 that should be verified for correctness.
numba_cuda/numba/cuda/dispatcher.py requires immediate attention for the strides fallback bug; numba_cuda/numba/cuda/args.py needs verification of stream_ptr=0 handling semantics

Important Files Changed

File Analysis

Filename	Score	Overview
numba_cuda/numba/cuda/dispatcher.py	1/5	Refactored kernel argument marshaling to use StridedMemoryView; critical bug in strides fallback logic
numba_cuda/numba/cuda/cudadrv/devicearray.py	2/5	Added _to_strided_memory_view and _make_strided_memory_view functions for dlpack/CAI conversion; changed nbytes to cached_property
numba_cuda/numba/cuda/args.py	3/5	Refactored Out and InOut classes to use _to_strided_memory_view; InOut now inherits from Out with copy_input=True

greptile-apps · 2026-01-08T18:25:20Z

numba_cuda/numba/cuda/dispatcher.py

+                (layout := devary._layout).strides_in_bytes
+                or numpy_support.strides_from_shape(
+                    shape=shape,
+                    itemsize=itemsize,
+                    c_contiguous=layout.is_contiguous_c,
+                    f_contiguous=layout.is_contiguous_f,
+                )


The strides_from_shape fallback is invoked when strides_in_bytes is falsy (empty tuple () or None). For 0-dimensional arrays, strides_in_bytes is legitimately an empty tuple (), but empty tuples are falsy in Python. This means 0-D arrays will incorrectly trigger the fallback computation even when they already have valid (empty) strides.

Bug scenario:

# 0-D array (scalar) arr = np.array(42) arr.strides # returns ()

When devary._layout.strides_in_bytes is (), the or operator treats it as falsy and calls strides_from_shape, which should also return () but wastes computation.

Fix:

Suggested change

(layout := devary._layout).strides_in_bytes

or numpy_support.strides_from_shape(

shape=shape,

itemsize=itemsize,

c_contiguous=layout.is_contiguous_c,

f_contiguous=layout.is_contiguous_f,

)

kernelargs.extend(

strides_in_bytes := (layout := devary._layout).strides_in_bytes

) if strides_in_bytes is not None else kernelargs.extend(

numpy_support.strides_from_shape(

shape=shape,

itemsize=itemsize,

c_contiguous=layout.is_contiguous_c,

f_contiguous=layout.is_contiguous_f,

)

)

Alternatively, check explicitly for None:

strides = layout.strides_in_bytes if layout.strides_in_bytes is not None else numpy_support.strides_from_shape(...) kernelargs.extend(strides)

grumbles ... fine.

Actually, the fallback computation would also return an empty tuple, right?

Yes, if strides_in_bytes is the empty tuple, then shape will be the empty tuple, and the fallback computation would return the empty tuple.

Other than wasting a function call, why is that a "critical" error?

numba_cuda/numba/cuda/np/numpy_support.py

numba_cuda/numba/cuda/cudadrv/devicearray.py

numba_cuda/numba/cuda/args.py

numba_cuda/numba/cuda/typing/typeof.py

cpcloud · 2026-01-08T18:50:35Z

@kkraus14

- is an improvement
+ is a regression

cupy:

one arg: -2x
many args: -3x

torch:

one arg: +17%
many args: +7% for explicit signature, -12% for dispatch

device_array:

one arg: roughly +9%
many args: roughly +60%

cpcloud · 2026-01-08T19:57:58Z

I also added a benchmark demonstrating that the additional overhead with device_array is entirely due to having to construct a shim to fit device array information into an object that is API-compatible with the attributes we pull out of StridedMemoryView when preparing arguments for kernel launch.

Maybe there's some way that we can reduce that further, but I haven't looked into it.

kkraus14 · 2026-01-12T16:40:01Z

The devicearray regressions are somewhat concerning, but given we are actively working towards deprecating it, I think it would still be worth it.

Do we have a sense on follow up work from here that helps to ameliorate the performance overheads related to torch?

cpcloud · 2026-01-12T17:40:01Z

The devicearray regressions are somewhat concerning, but given we are actively working towards deprecating it, I think it would still be worth it.

Do we have a sense on follow up work from here that helps to ameliorate the performance overheads related to torch?

At least some of the remaining overhead is related to stream synchronization, but that may be justified/useful in some cases I'm guessing.

After that, I'm not sure. It will require more investigation.

Just to make sure we're on the same page, our expectation is that if an array is on device then we should expect the kernel launch overhead to amount to a collection of relatively cheap attribute accesses. Is that correct?

kkraus14 · 2026-01-12T18:01:32Z

At least some of the remaining overhead is related to stream synchronization, but that may be justified/useful in some cases I'm guessing.

My 2c: numba-cuda shouldn't be in the business of handling stream synchronization and that if someone is passing an array on a different stream through dlpack / CAI, it becomes their responsibility to launch the kernel on a stream that is synchronized with respect to the passed stream. This is likely a breaking change that would need to be clearly and loudly deprecated and subsequently removed.

Just to make sure we're on the same page, our expectation is that if an array is on device then we should expect the kernel launch overhead to amount to a collection of relatively cheap attribute accesses. Is that correct?

Yes. Kernel launch latency is quite important where we should aim for less than 1us overhead.

cpcloud · 2026-01-12T18:04:39Z

My 2c: numba-cuda shouldn't be in the business of handling stream synchronization and that if someone is passing an array on a different stream through dlpack / CAI, it becomes their responsibility to launch the kernel on a stream that is synchronized with respect to the passed stream. This is likely a breaking change that would need to be clearly and loudly deprecated and subsequently removed.

Got it, yeah I don't really know enough about how this functionality is used or assumed to be used to have an informed opinion (yet!), but simply removing sync (in the torch case by passing -1 as the stream pointer to __dlpack__) saves a noticeable amount of time.

Yes. Kernel launch latency is quite important where we should aim for less than 1us overhead.

Roger that, I think we can get there if not very close.

cpcloud · 2026-01-12T18:04:44Z

/ok to test

numba_cuda/numba/cuda/cudadrv/devicearray.py

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

numba_cuda/numba/cuda/args.py

greptile-apps · 2026-01-12T18:18:11Z

numba_cuda/numba/cuda/cudadrv/devicearray.py

+        return StridedMemoryView.from_dlpack(
+            obj, stream_ptr=getattr(stream, "handle", stream)
+        ), False
+    elif (desc := getattr(obj, "__cuda_array_interface__", None)) is not None:
+        smv = StridedMemoryView.from_cuda_array_interface(
+            obj, stream_ptr=int(getattr(stream, "handle", stream))


Inconsistent handling of stream_ptr conversion: line 1020 uses getattr(stream, "handle", stream) without casting to int, while line 1024 uses int(getattr(stream, "handle", stream)).

This inconsistency could potentially cause type mismatches depending on what StridedMemoryView.from_dlpack and StridedMemoryView.from_cuda_array_interface expect.

For consistency and type safety, both should use the same pattern. Looking at line 1037 and line 58 in args.py, it appears the int conversion is sometimes used. Consider standardizing this across all usages.

kkraus14 · 2026-01-12T18:43:58Z

Got it, yeah I don't really know enough about how this functionality is used or assumed to be used to have an informed opinion (yet!), but simply removing sync (in the torch case by passing -1 as the stream pointer to __dlpack__) saves a noticeable amount of time.

An option available to use is instead of actually synchronizing the stream (assuming we're calling cuStreamSynchronize), we could record a CUDA event on that stream and wait on that event in our launching stream. This waiting would then be done device side without blocking the CPU thread as oppposed to cuStreamSynchronize which does block the CPU thread. This would retain the current behavior, but again I would argue this is out of the scope of numba-cuda and should be done by a higher level library / framework using numba-cuda.

cpcloud · 2026-01-12T18:47:52Z

I tried rolling some C-API classes for the two Shim objects I introduced here but there was virtually no difference in the benchmark versus defining them in Python.

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

numba_cuda/numba/cuda/args.py

cpcloud · 2026-01-13T20:45:00Z

I included the pure C classes here for reference, but I am going to revert those changes because they don't really improve anything.

kkraus14 · 2026-01-14T18:38:55Z

numba_cuda/numba/cuda/cext/_helpermod.c

+    Py_TYPE(op)->tp_free(op);
+}
+
+static int StridedMemoryViewShim_init(PyObject *op, PyObject *args, PyObject *kwds) {


Not sure if any of the seen overhead is coming from here, but with StridedMemoryView being a cython cdef class, all of this information is available via the underlying C struct as native types and we could avoid the whole boxing, unboxing, boxing, unboxing of Python objects that I suspect is happening here.

There's no boxing or unboxing here, these all remain as untouched pyobjects. The pure python version of this code isn't measurably slower, so I'm going to revert back to that.

I implemented this in Cython as well and construction is faster by 1.5x-2x, but attribute access is slower by 1.5x-2x.

Both operations are performed, but attribute access is more common. The constructor is only called once per instance because it's a functools.cached_property whereas these various attributes are accessed once per kernel launch call.

Slots is apparently the best we can do in this particular situation.

Sorry, I meant in the overlap pipeline leading to kernel launch, eventually we need to lower these Python objects down to native types in order to pass into the kernel via the numba-cuda ABI and that if we were able to keep things as native types from the beginning and never raise them back up to Python objects before the conversion down to the numba-cuda ABI for kernel launch that there would be a potential win. That's probably a mountain of work though.

If all the argument preparation can be done in cuda-core, down in Cython land, then yeah, that would be ideal.

It might be worth it to start planning that effort now if we think it will bear fruit.

Yes, 100%. We don't currently have StridedMemoryView plumbed into how we lower Python objects for kernel launch: https://github.com/NVIDIA/cuda-python/blob/main/cuda_core/cuda/core/_kernel_arg_handler.pyx but it absolutely could be added. I'm sure there's ABI convention things that need to be handled is the big thing.

numba_cuda/numba/cuda/cudadrv/devicearray.py

greptile-apps

_{7 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

numba_cuda/numba/cuda/args.py

kkraus14 · 2026-01-15T17:47:09Z

The implementation generally looks good to me. Would be good to get a final round of benchmarks to evaluate where the improvements / regressions are.

cpcloud · 2026-01-16T14:45:01Z

/ok to test

cpcloud · 2026-01-16T15:00:54Z

Ran this through an LLM to get a nice summary of the final benchmark results.

High-level takeaways (NOW vs `0001_c91948d`)

CuPy: big wins across the board (~2–3× faster).
device_array: consistent regressions (~10% slower for one-arg, ~55–56% slower for many-args).
Torch: mixed — dispatch got slightly faster, but signature got slower.

Mean time changes (lower is better)

`test_many_args`

Variant	Baseline mean (ms)	NOW mean (ms)	Change
dispatch-cupy	82.40	27.21	-67% (3.0× faster)
dispatch-device_array	6.77	10.48	+55% (1.55× slower)
dispatch-torch	70.60	64.31	-9% (1.10× faster)
signature-cupy	62.79	19.80	-68% (3.17× faster)
signature-device_array	6.91	10.78	+56% (1.56× slower)
signature-torch	49.94	54.77	+10% (1.10× slower)

`test_one_arg`

Variant	Baseline mean (ms)	NOW mean (ms)	Change
dispatch-cupy	5.91	2.90	-51% (2.0× faster)
dispatch-device_array	1.27	1.40	+11% slower
dispatch-torch	5.32	4.96	-7% faster
signature-cupy	4.38	2.27	-48% (1.9× faster)
signature-device_array	1.27	1.41	+11% slower
signature-torch	3.62	4.19	+16% slower

Current absolute performance (NOW, mean time)

Many args: device_array (~10–11 ms) is still fastest, then CuPy (~20–27 ms), then Torch (~55–64 ms).
One arg: device_array (~1.4 ms) fastest, then CuPy (~2.3–2.9 ms), then Torch (~4.2–5.0 ms).

Raw data

------------------------------------------------------------------------------ benchmark 'test_many_args[dispatch-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                    Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-cupy] (NOW)              26.5499 (1.0)      28.4821 (1.0)      27.2123 (1.0)      0.5466 (1.0)      27.0107 (1.0)      0.7143 (1.0)           6;0  36.7481 (1.0)          22           1
test_many_args[dispatch-cupy] (0001_c91948d)     80.1594 (3.02)     84.9900 (2.98)     82.3980 (3.03)     1.5929 (2.91)     82.2436 (3.04)     2.3078 (3.23)          4;0  12.1362 (0.33)         10           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------- benchmark 'test_many_args[dispatch-device_array]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                            Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-device_array] (0001_c91948d)      6.4306 (1.0)       7.0619 (1.0)       6.7681 (1.0)      0.1830 (1.06)      6.7565 (1.0)      0.3377 (1.82)         15;0  147.7515 (1.0)          30           1
test_many_args[dispatch-device_array] (NOW)              10.1953 (1.59)     10.9627 (1.55)     10.4819 (1.55)     0.1723 (1.0)      10.4611 (1.55)     0.1854 (1.0)           8;2   95.4026 (0.65)         33           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[dispatch-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                     Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[dispatch-torch] (NOW)              63.9991 (1.0)      65.4999 (1.0)      64.3109 (1.0)      0.3777 (1.0)      64.2809 (1.0)      0.2062 (1.0)           1;1  15.5495 (1.0)          13           1
test_many_args[dispatch-torch] (0001_c91948d)     68.3296 (1.07)     73.8380 (1.13)     70.5973 (1.10)     1.8827 (4.98)     70.1094 (1.09)     2.6950 (13.07)         5;0  14.1648 (0.91)         11           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-cupy]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                     Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-cupy] (NOW)              19.3784 (1.0)      21.0086 (1.0)      19.7990 (1.0)      0.4120 (1.0)      19.6576 (1.0)      0.4164 (1.0)           9;4  50.5075 (1.0)          51           1
test_many_args[signature-cupy] (0001_c91948d)     60.1597 (3.10)     67.1960 (3.20)     62.7940 (3.17)     2.0793 (5.05)     63.2831 (3.22)     3.3902 (8.14)          6;0  15.9251 (0.32)         17           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-device_array]': 2 tests -------------------------------------------------------------------------------
Name (time in ms)                                             Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-device_array] (0001_c91948d)      6.6276 (1.0)      10.7366 (1.0)       6.9080 (1.0)      0.4227 (2.08)      6.7669 (1.0)      0.2620 (1.91)        16;15  144.7594 (1.0)         143           1
test_many_args[signature-device_array] (NOW)              10.5675 (1.59)     11.8660 (1.11)     10.7840 (1.56)     0.2029 (1.0)      10.7284 (1.59)     0.1375 (1.0)          10;6   92.7299 (0.64)         87           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_many_args[signature-torch]': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                      Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_many_args[signature-torch] (0001_c91948d)     48.2905 (1.0)      52.6185 (1.0)      49.9353 (1.0)      1.3081 (1.0)      49.6528 (1.0)      2.2560 (1.0)           8;0  20.0259 (1.0)          20           1
test_many_args[signature-torch] (NOW)              51.6698 (1.07)     57.1802 (1.09)     54.7654 (1.10)     1.8941 (1.45)     55.9016 (1.13)     3.7764 (1.67)          7;0  18.2597 (0.91)         19           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-cupy]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                 Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-cupy] (NOW)              2.8639 (1.0)      3.1819 (1.0)      2.9041 (1.0)      0.0440 (1.0)      2.8976 (1.0)      0.0236 (1.0)           2;4  344.3451 (1.0)          66           1
test_one_arg[dispatch-cupy] (0001_c91948d)     5.4198 (1.89)     6.7414 (2.12)     5.9068 (2.03)     0.3580 (8.14)     5.6831 (1.96)     0.5460 (23.13)         7;0  169.2971 (0.49)         43           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-device_array]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                         Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-device_array] (0001_c91948d)     1.2494 (1.0)      1.3038 (1.0)      1.2674 (1.0)      0.0213 (1.04)     1.2601 (1.0)      0.0212 (1.0)           1;0  789.0171 (1.0)           5           1
test_one_arg[dispatch-device_array] (NOW)              1.3719 (1.10)     1.4275 (1.09)     1.4005 (1.10)     0.0206 (1.0)      1.3975 (1.11)     0.0260 (1.23)          2;0  714.0520 (0.90)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[dispatch-torch]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                  Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[dispatch-torch] (NOW)              4.9318 (1.0)      4.9933 (1.0)      4.9609 (1.0)      0.0142 (1.0)      4.9585 (1.0)      0.0149 (1.0)          15;2  201.5761 (1.0)          55           1
test_one_arg[dispatch-torch] (0001_c91948d)     4.9826 (1.01)     5.4021 (1.08)     5.3169 (1.07)     0.0916 (6.43)     5.3375 (1.08)     0.0466 (3.12)          3;3  188.0811 (0.93)         46           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-cupy]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                  Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-cupy] (NOW)              2.0933 (1.0)      2.5494 (1.0)      2.2743 (1.0)      0.0755 (1.0)      2.3032 (1.0)      0.0573 (1.0)        104;85  439.6938 (1.0)         412           1
test_one_arg[signature-cupy] (0001_c91948d)     3.9292 (1.88)     5.7554 (2.26)     4.3756 (1.92)     0.4172 (5.53)     4.2204 (1.83)     0.5594 (9.77)         61;4  228.5381 (0.52)        235           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------- benchmark 'test_one_arg[signature-device_array]': 2 tests -----------------------------------------------------------------------------
Name (time in ms)                                          Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-device_array] (0001_c91948d)     1.2242 (1.0)      1.9068 (1.0)      1.2712 (1.0)      0.0389 (1.08)     1.2642 (1.0)      0.0191 (1.0)         36;35  786.6426 (1.0)         621           1
test_one_arg[signature-device_array] (NOW)              1.3621 (1.11)     2.0377 (1.07)     1.4072 (1.11)     0.0359 (1.0)      1.4019 (1.11)     0.0281 (1.47)        52;13  710.6116 (0.90)        597           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------- benchmark 'test_one_arg[signature-torch]': 2 tests ----------------------------------------------------------------------------
Name (time in ms)                                   Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_one_arg[signature-torch] (0001_c91948d)     3.3064 (1.0)      5.4537 (1.14)     3.6175 (1.0)      0.2671 (2.00)     3.5855 (1.0)      0.1975 (1.0)          14;8  276.4351 (1.0)         196           1
test_one_arg[signature-torch] (NOW)              3.9787 (1.20)     4.7722 (1.0)      4.1866 (1.16)     0.1335 (1.0)      4.2446 (1.18)     0.2394 (1.21)         95;1  238.8596 (0.86)        227           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

cpcloud · 2026-01-16T20:32:28Z

/ok to test

cpcloud force-pushed the move-to-smv-for-kernel-launch branch from f5a1c5c to c62d013 Compare January 7, 2026 18:29

cpcloud requested review from gmarkall and kkraus14 January 7, 2026 18:35

greptile-apps bot reviewed Jan 7, 2026

View reviewed changes

numba_cuda/numba/cuda/np/numpy_support.py Show resolved Hide resolved

greptile-apps bot reviewed Jan 7, 2026

View reviewed changes

numba_cuda/numba/cuda/np/numpy_support.py Show resolved Hide resolved

cpcloud force-pushed the move-to-smv-for-kernel-launch branch 2 times, most recently from 9ff51b9 to 1032275 Compare January 7, 2026 19:34

rparolin self-requested a review January 7, 2026 22:11

rparolin reviewed Jan 7, 2026

View reviewed changes

numba_cuda/numba/cuda/cudadrv/devicearray.py Show resolved Hide resolved

rparolin approved these changes Jan 7, 2026

View reviewed changes

kkraus14 reviewed Jan 8, 2026

View reviewed changes

numba_cuda/numba/cuda/cudadrv/devicearray.py Show resolved Hide resolved

cpcloud force-pushed the move-to-smv-for-kernel-launch branch from 1032275 to 739cb5b Compare January 8, 2026 18:21

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

cpcloud force-pushed the move-to-smv-for-kernel-launch branch from 739cb5b to 891ccb7 Compare January 12, 2026 18:04

cpcloud commented Jan 12, 2026

View reviewed changes

numba_cuda/numba/cuda/cudadrv/devicearray.py Show resolved Hide resolved

greptile-apps bot reviewed Jan 12, 2026

View reviewed changes

cpcloud force-pushed the move-to-smv-for-kernel-launch branch from 891ccb7 to e26196a Compare January 13, 2026 19:42

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

numba_cuda/numba/cuda/args.py Outdated Show resolved Hide resolved

kkraus14 reviewed Jan 14, 2026

View reviewed changes

cpcloud force-pushed the move-to-smv-for-kernel-launch branch from e26196a to 8843b4a Compare January 14, 2026 20:27

greptile-apps bot reviewed Jan 14, 2026

View reviewed changes

numba_cuda/numba/cuda/args.py Outdated Show resolved Hide resolved

cpcloud force-pushed the move-to-smv-for-kernel-launch branch from 8843b4a to 5beaae5 Compare January 15, 2026 17:16

kkraus14 approved these changes Jan 15, 2026

View reviewed changes

cpcloud force-pushed the move-to-smv-for-kernel-launch branch from 96eeaa1 to 8713b8d Compare January 16, 2026 14:25

cpcloud added 4 commits January 16, 2026 14:49

refactor: use StridedMemoryView internally

cf8f155

fix: avoid racing allocation with copying from host to device

a6acd84

fix: ensure that nbytes is always valid

ae0182e

chore: bump cuda-core lower bound to pickup required changes

fc60b93

cpcloud force-pushed the move-to-smv-for-kernel-launch branch from 8713b8d to fc60b93 Compare January 16, 2026 20:32

feat: swap out internal device array usage with StridedMemoryView #703

Are you sure you want to change the base?

feat: swap out internal device array usage with StridedMemoryView #703

Uh oh!

Conversation

cpcloud commented Jan 7, 2026

Summary

Key Changes

Performance Trade-off Discussion

Implementation Details

Testing

Uh oh!

copy-pr-bot bot commented Jan 7, 2026

Uh oh!

cpcloud commented Jan 7, 2026

Uh oh!

greptile-apps bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

cpcloud commented Jan 7, 2026

Uh oh!

cpcloud commented Jan 7, 2026

Uh oh!

cpcloud commented Jan 7, 2026

Uh oh!

cpcloud commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Key Changes

Issues Found

Performance Trade-offs

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rparolin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kkraus14 commented Jan 8, 2026

Uh oh!

cpcloud commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Uh oh!

greptile-apps bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cpcloud commented Jan 8, 2026

Uh oh!

cpcloud commented Jan 8, 2026

Uh oh!

feat: swap out internal device array usage with `StridedMemoryView` #703

feat: swap out internal device array usage with `StridedMemoryView` #703

greptile-apps bot commented Jan 7, 2026 •

edited

Loading

cpcloud commented Jan 7, 2026 •

edited

Loading

cpcloud commented Jan 8, 2026 •

edited

Loading

High-level takeaways (NOW vs `0001_c91948d`)

`test_many_args`

`test_one_arg`