Skip to content

Conversation

@ZzEeKkAa
Copy link
Contributor

@ZzEeKkAa ZzEeKkAa commented Jul 21, 2025

Vendor NopythonTypeInference pass and modify it to allow array mutations:

@cuda.jit(device=True, forceinline=True)
def slice_array(a, x_id, x_size, y_id, y_size):
    return a[
        x_id * x_size : (x_id + 1) * x_size : 1,
        y_id * y_size : (y_id + 1) * y_size : 1,
    ]

Fixes: #221

How it works

Instead of maintaining two lists of what is cast and arg values it populates whitelist of vars that may be returned. Ideally it should be upstreamed to numba, since there is the exactly same problem there. Only happens in nopython mode with nrt disabled.

Why is it safe

We are practically just making a view of an array, not creating a new array, so no memory allocation or leak are introduced.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 21, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label Jul 21, 2025
@gmarkall
Copy link
Contributor

/ok to test

Comment on lines 315 to 316
if inst.value.value.name in whitelist_vars:
whitelist_vars.add(inst.target.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a danger that this misses transitively allowing variable where the blocks aren't visited in the correct order? Does propagation of allowed variables need to iterate to a fixpoint instead? I'm thinking of a case like

if cond:
    b = a[:, 1]
c = b
return c

If the block after the if is traversed first, is there a risk that returning c is disallowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, it is a valid point. My thoughts was that blocks are properly ordered. And you can reference variable only if it is above the usage of the same variable. Is it a thing that blocks could be not ordered/nested?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit apprehensive because I'm not certain what the ordering is, or is guaranteed to be. I'm also wondering whether phi nodes will be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to use forest of trees to eliminate any issues with block ordering

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea in principle. I have a couple of questions on the diff (and await the addition of tests).

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 2 - In Progress Currently a work in progress labels Jul 22, 2025


def array_local(shape, dtype):
return cuda.local.array(shape, dtype=dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmarkall I know this is conceptually wrong, since we are trying to return pointer to the stack memory. However if we set it to forceinline it should turn into a valid code, but as far as I know it is against llvm design to generate invalid code that turns into valid only because of force inline. Do you know any idea how it potentially could be achieved? I have one use case in nvmath that will benefit from it.

@ZzEeKkAa ZzEeKkAa changed the title [WIP] Feature: allow return array Feature: allow return array Jul 22, 2025
@ZzEeKkAa ZzEeKkAa marked this pull request as ready for review July 22, 2025 14:23
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 22, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ZzEeKkAa ZzEeKkAa added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jul 22, 2025
@ZzEeKkAa ZzEeKkAa self-assigned this Jul 22, 2025
@ZzEeKkAa
Copy link
Contributor Author

/ok to test

@ZzEeKkAa ZzEeKkAa requested a review from gmarkall July 22, 2025 14:24
@ZzEeKkAa
Copy link
Contributor Author

/ok to test

@ZzEeKkAa
Copy link
Contributor Author

Is it me messing up the test, or it is out of scope of this MR:

Compilation is falling back to object mode WITHOUT looplifting enabled because Function "init_xoroshiro128p_states_cpu" failed type inference due to: Invalid use of type(CPUDispatcher(<function init_xoroshiro128p_state at 0x777b44828400>)) with parameters (array(Record(s0[type=uint64;offset=0],s1[type=uint64;offset=8];16;True), 1d, C), Literal[int](0), uint64)

@gmarkall
Copy link
Contributor

Is it me messing up the test, or it is out of scope of this MR:

That's not you, the simulator always does that. It's a bit hard to fix and not really critical so it's never got to the top of the priority list.

@ZzEeKkAa
Copy link
Contributor Author

/ok to test

@ZzEeKkAa
Copy link
Contributor Author

/ok to test

@ZzEeKkAa ZzEeKkAa added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jul 31, 2025
Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! I tried adding a few more test cases and I found that arguments aren't tracked through tuples - I've pushed these now, and the ones that use tuples to hold array arguments are the failing ones.

  • For "getitem" ops (presently handled, but they don't traverse tuples), I think it will be necessary to ensure that all of the tuple elements are an argument.
  • For "static_getitem" ops (presently not handled in the code), it should be sufficient to ensure that only the indexed item in the tuple is an argument.

@gmarkall
Copy link
Contributor

gmarkall commented Aug 1, 2025

/ok to test

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Aug 1, 2025
@ZzEeKkAa
Copy link
Contributor Author

ZzEeKkAa commented Aug 5, 2025

/ok to test

@ZzEeKkAa
Copy link
Contributor Author

ZzEeKkAa commented Aug 5, 2025

/ok to test

@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Aug 6, 2025
Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<posting a review because I have a couple of pending comments and don't want to lose them if I push an update>

Comment on lines +234 to +253
@pytest.mark.xfail(reason="Returning local arrays is not yet supported")
@skip_on_cudasim("type inference is unsupported in the simulator")
def test_array_local(self):
@cuda.jit
def array_local_fp32(size):
return cuda.local.array(size, dtype=np.float32)

@cuda.jit
def kernel(r):
x = array_local_fp32(2)
x[0], x[1] = 1.0, 2.0

r[0] = x[0] + x[1]

r = np.zeros(1, dtype=np.float32)

kernel[1, 1](r)

np.testing.assert_equal(r, [3.0])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have this xfailing test - it contradicts the test_array_local_illegal test above. It's not clear to me how we could have a valid way to return a local array.

Suggested change
@pytest.mark.xfail(reason="Returning local arrays is not yet supported")
@skip_on_cudasim("type inference is unsupported in the simulator")
def test_array_local(self):
@cuda.jit
def array_local_fp32(size):
return cuda.local.array(size, dtype=np.float32)
@cuda.jit
def kernel(r):
x = array_local_fp32(2)
x[0], x[1] = 1.0, 2.0
r[0] = x[0] + x[1]
r = np.zeros(1, dtype=np.float32)
kernel[1, 1](r)
np.testing.assert_equal(r, [3.0])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++ it is possible with constexpr

return b

# c in the loop is a local array
# TODO: do we want to support local and shared arrays?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to support local and shared arrays being returned from a device function that declares them. Local arrays seem like a case we shouldn't support, but I'm less sure about shared arrays - does CUDA C++ allow you to return a shared array that a device function created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++ it is possible with constexpr

@gmarkall
Copy link
Contributor

/ok to test

@gmarkall gmarkall closed this Aug 18, 2025
@gmarkall gmarkall reopened this Aug 18, 2025
@NVIDIA NVIDIA deleted a comment from CLAassistant Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Waiting on reviewer Waiting for reviewer to respond to author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Relax return type restrictions for nopython functions

2 participants