[FEA]: Enable using custom data types with cuda.parallel #3135

shwina · 2024-12-12T11:55:31Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

cuda.parallel (Python)

Is your feature request related to a problem? Please describe.

In C++, Thrust enables using algorithms with custom data types:

Example (ChatGPT generated)

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>

// Define the RGBA struct
struct RGBA {
    int R, G, B, A;

    // Constructor for initializing RGBA values
    __host__ __device__ RGBA(int r, int g, int b, int a) : R(r), G(g), B(b), A(a) {}

    // Default constructor for thrust::reduce to work
    __host__ __device__ RGBA() : R(INT_MIN), G(0), B(0), A(0) {}

    // Optionally, for debugging, we can print RGBA values
    void print() const {
        std::cout << "(" << R << ", " << G << ", " << B << ", " << A << ")\n";
    }
};

// Functor to compare two RGBA structs based on the R value
struct compare_R {
    __device__ RGBA operator()(const RGBA& lhs, const RGBA& rhs) const {
        return (lhs.R > rhs.R) ? lhs : rhs;
    }
};

int main() {
    // Create a collection of RGBA colors
    std::vector<RGBA> colors = {
        RGBA(255, 0, 0, 255),
        RGBA(128, 255, 0, 255),
        RGBA(100, 100, 255, 255),
        RGBA(200, 50, 50, 255),
    };

    // Copy the data to the device
    thrust::device_vector<RGBA> d_colors = colors;

    // Use thrust::reduce with a custom functor to find the RGBA struct with the maximum R value
    RGBA max_R_color = thrust::reduce(d_colors.begin(), d_colors.end(), 
                                      RGBA(),  // Default constructor (INT_MIN for R, 0 for others)
                                      compare_R());  // Custom comparison functor

    // Print the RGBA struct with the maximum R value
    std::cout << "RGBA with maximum R value: ";
    max_R_color.print();

    return 0;
}

RGBA with maximum R value: (255, 0, 0, 255)

We'd like to support the same use-case from Python using cuda.parallel.

Describe the solution you'd like

First, we should implement a POC that shows this is possible purely from the Python side. Likely this would look similar to the example used in this numba extension example which defines a custom numba data type and passes it to a user defined function.
Second, we should decide on what the API should look like. We probably don't want users having to define custom numba data types and the typing/lowering for those. We should investigate what we can do on their behalf.

Issues around `h_init` for reduction

In a discussion with @gevtushenko , it came up that Thrust's reduce algorithm requires an initial value to be passed for the reduction as a host value. It's a question how we would pass an appropriate value from Python to the underlying C++ layer. We would either need to define a ctypes struct types corresponding to the numba type, or have the C++ layer accept a pointer to device memory for the h_init argument.

Admittedly, I'm not exactly sure what other issues abound here and will update this issue as I explore/learn more.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

shwina · 2024-12-14T18:03:30Z

I hacked together a prototype for @gpudataclass, an extremely primitive and limited version of @jitclass for the CUDA target. It enables easily registering a custom struct-like data type with numba and operating on arrays-of-structs on the GPU:

Below is what it looks like to use @gpudataclass with cuda.parallel:

"""
Using cuda.parallel to operate on structs ("dataclasses") on the GPU.
"""

import numpy as np
import cupy as cp

import cuda.parallel.experimental as cudax
from gpudataclass import gpudataclass


# The @gpudataclass decorator registers `Pixel` as a user-defined
# numba type.
@gpudataclass
class Pixel:
    r: np.dtype("int32")
    g: np.dtype("int32")
    b: np.dtype("int32")

# This is the comparator we want to pass to `reduce`. It takes
# two Pixel objects as input and returns the one with the
# larger `g` component as output:
def max_g_value(x, y):
    return x if x.g > y.g else y
    
# Next, we need to initialize data on the device. We'll construct
# a CuPy array of size (10, 3) to represent 10 RGB values
# and view it as a structured dtype:
dtype = np.dtype([("r", "int32"), ("g", "int32"), ("b", "int32")])
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=cp.int32).view(dtype)

# Create an empty array to store the output:
d_out = cp.zeros(1, dtype)

# The initial value is provided as a Pixel object:
h_init = Pixel(0, 0, 0)

# Now, we can perform the reduction:

# compute temp storage:
reducer = cudax.reduce_into(d_rgb, d_out, max_g_value, h_init)
temp_storage_bytes = reducer(None, d_rgb, d_out, len(d_rgb), h_init)

# do the reduction:
d_temp_storage = cp.zeros(temp_storage_bytes, dtype=np.uint8)
_ = reducer(d_temp_storage, d_rgb, d_out, len(d_rgb), h_init)

# results:
print()
print("Input RGB values:")
print("-----------------")
print(d_rgb.get())
print()
print("Value with largest g component:")
print("-------------------------------")
print(d_out.get())
print()

output:

Input RGB values:
-----------------
[[( 21,  73,  65)]
 [(216, 238,  89)]
 [(154, 113, 147)]
 [(167, 229,  60)]
 [( 53,  23, 222)]
 [( 92, 162, 231)]
 [(255,  87, 126)]
 [( 64, 254, 241)]
 [(207,  50,  79)]
 [( 41, 166, 206)]]

Value with largest g component:
-------------------------------
[(64, 254, 241)]

The code for the example above, and the gpudataclass decorator is here. This also requires some changes to cuda.parallel to work: https://github.com/shwina/cccl/tree/cuda-parallel-struct-types-hack.

It would be great to get some feedback on whether this is generally a good direction for the API/implementation and what features we want to support in an MVP.

rwgk · 2024-12-15T20:27:02Z

Wow, this looks better than a MVP!

In your comment you wrote:

Below is what it looks like to use @gpujitclass with cuda.parallel:

Did you mean @gpudataclass ?

I think what you have is great, there are only two things that come to mind looking through the code you posted in the comment above, and one isn't even related to your work:

Automatically inferring the numpy dtype from @gpudataclass. — I guess that's on your todo list already?
For such demos, having a public helper function that hides the d_temp_storage boilerplate would be great, and it seems super easy.

shwina added the feature request New feature or request. label Dec 12, 2024

github-project-automation bot added this to CCCL Dec 12, 2024

github-project-automation bot moved this to Todo in CCCL Dec 12, 2024

shwina self-assigned this Dec 12, 2024

gevtushenko moved this from Todo to In Progress in CCCL Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Enable using custom data types with cuda.parallel #3135

[FEA]: Enable using custom data types with cuda.parallel #3135

shwina commented Dec 12, 2024 •

edited

Loading

shwina commented Dec 14, 2024 •

edited

Loading

rwgk commented Dec 15, 2024

[FEA]: Enable using custom data types with cuda.parallel #3135

[FEA]: Enable using custom data types with cuda.parallel #3135

Comments

shwina commented Dec 12, 2024 • edited Loading

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Issues around h_init for reduction

Describe alternatives you've considered

Additional context

shwina commented Dec 14, 2024 • edited Loading

rwgk commented Dec 15, 2024

shwina commented Dec 12, 2024 •

edited

Loading

Issues around `h_init` for reduction

shwina commented Dec 14, 2024 •

edited

Loading