Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Enable using custom data types with cuda.parallel #3135

Open
1 task done
shwina opened this issue Dec 12, 2024 · 2 comments
Open
1 task done

[FEA]: Enable using custom data types with cuda.parallel #3135

shwina opened this issue Dec 12, 2024 · 2 comments
Assignees
Labels
feature request New feature or request.

Comments

@shwina
Copy link
Contributor

shwina commented Dec 12, 2024

Is this a duplicate?

Area

cuda.parallel (Python)

Is your feature request related to a problem? Please describe.

In C++, Thrust enables using algorithms with custom data types:

Example (ChatGPT generated)
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>

// Define the RGBA struct
struct RGBA {
    int R, G, B, A;

    // Constructor for initializing RGBA values
    __host__ __device__ RGBA(int r, int g, int b, int a) : R(r), G(g), B(b), A(a) {}

    // Default constructor for thrust::reduce to work
    __host__ __device__ RGBA() : R(INT_MIN), G(0), B(0), A(0) {}

    // Optionally, for debugging, we can print RGBA values
    void print() const {
        std::cout << "(" << R << ", " << G << ", " << B << ", " << A << ")\n";
    }
};

// Functor to compare two RGBA structs based on the R value
struct compare_R {
    __device__ RGBA operator()(const RGBA& lhs, const RGBA& rhs) const {
        return (lhs.R > rhs.R) ? lhs : rhs;
    }
};

int main() {
    // Create a collection of RGBA colors
    std::vector<RGBA> colors = {
        RGBA(255, 0, 0, 255),
        RGBA(128, 255, 0, 255),
        RGBA(100, 100, 255, 255),
        RGBA(200, 50, 50, 255),
    };

    // Copy the data to the device
    thrust::device_vector<RGBA> d_colors = colors;

    // Use thrust::reduce with a custom functor to find the RGBA struct with the maximum R value
    RGBA max_R_color = thrust::reduce(d_colors.begin(), d_colors.end(), 
                                      RGBA(),  // Default constructor (INT_MIN for R, 0 for others)
                                      compare_R());  // Custom comparison functor

    // Print the RGBA struct with the maximum R value
    std::cout << "RGBA with maximum R value: ";
    max_R_color.print();

    return 0;
}
RGBA with maximum R value: (255, 0, 0, 255)

We'd like to support the same use-case from Python using cuda.parallel.

Describe the solution you'd like

  1. First, we should implement a POC that shows this is possible purely from the Python side. Likely this would look similar to the example used in this numba extension example which defines a custom numba data type and passes it to a user defined function.

  2. Second, we should decide on what the API should look like. We probably don't want users having to define custom numba data types and the typing/lowering for those. We should investigate what we can do on their behalf.

Issues around h_init for reduction

In a discussion with @gevtushenko , it came up that Thrust's reduce algorithm requires an initial value to be passed for the reduction as a host value. It's a question how we would pass an appropriate value from Python to the underlying C++ layer. We would either need to define a ctypes struct types corresponding to the numba type, or have the C++ layer accept a pointer to device memory for the h_init argument.

Admittedly, I'm not exactly sure what other issues abound here and will update this issue as I explore/learn more.

Describe alternatives you've considered

No response

Additional context

No response

@shwina shwina added the feature request New feature or request. label Dec 12, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Dec 12, 2024
@shwina shwina self-assigned this Dec 12, 2024
@shwina
Copy link
Contributor Author

shwina commented Dec 14, 2024

I hacked together a prototype for @gpudataclass, an extremely primitive and limited version of @jitclass for the CUDA target. It enables easily registering a custom struct-like data type with numba and operating on arrays-of-structs on the GPU:

Below is what it looks like to use @gpudataclass with cuda.parallel:

"""
Using cuda.parallel to operate on structs ("dataclasses") on the GPU.
"""

import numpy as np
import cupy as cp

import cuda.parallel.experimental as cudax
from gpudataclass import gpudataclass


# The @gpudataclass decorator registers `Pixel` as a user-defined
# numba type.
@gpudataclass
class Pixel:
    r: np.dtype("int32")
    g: np.dtype("int32")
    b: np.dtype("int32")

# This is the comparator we want to pass to `reduce`. It takes
# two Pixel objects as input and returns the one with the
# larger `g` component as output:
def max_g_value(x, y):
    return x if x.g > y.g else y
    
# Next, we need to initialize data on the device. We'll construct
# a CuPy array of size (10, 3) to represent 10 RGB values
# and view it as a structured dtype:
dtype = np.dtype([("r", "int32"), ("g", "int32"), ("b", "int32")])
d_rgb = cp.random.randint(0, 256, (10, 3), dtype=cp.int32).view(dtype)

# Create an empty array to store the output:
d_out = cp.zeros(1, dtype)

# The initial value is provided as a Pixel object:
h_init = Pixel(0, 0, 0)

# Now, we can perform the reduction:

# compute temp storage:
reducer = cudax.reduce_into(d_rgb, d_out, max_g_value, h_init)
temp_storage_bytes = reducer(None, d_rgb, d_out, len(d_rgb), h_init)

# do the reduction:
d_temp_storage = cp.zeros(temp_storage_bytes, dtype=np.uint8)
_ = reducer(d_temp_storage, d_rgb, d_out, len(d_rgb), h_init)

# results:
print()
print("Input RGB values:")
print("-----------------")
print(d_rgb.get())
print()
print("Value with largest g component:")
print("-------------------------------")
print(d_out.get())
print()

output:

Input RGB values:
-----------------
[[( 21,  73,  65)]
 [(216, 238,  89)]
 [(154, 113, 147)]
 [(167, 229,  60)]
 [( 53,  23, 222)]
 [( 92, 162, 231)]
 [(255,  87, 126)]
 [( 64, 254, 241)]
 [(207,  50,  79)]
 [( 41, 166, 206)]]

Value with largest g component:
-------------------------------
[(64, 254, 241)]

The code for the example above, and the gpudataclass decorator is here. This also requires some changes to cuda.parallel to work: https://github.com/shwina/cccl/tree/cuda-parallel-struct-types-hack.


It would be great to get some feedback on whether this is generally a good direction for the API/implementation and what features we want to support in an MVP.

@rwgk
Copy link
Contributor

rwgk commented Dec 15, 2024

Wow, this looks better than a MVP!

In your comment you wrote:

Below is what it looks like to use @gpujitclass with cuda.parallel:

Did you mean @gpudataclass ?

I think what you have is great, there are only two things that come to mind looking through the code you posted in the comment above, and one isn't even related to your work:

  • Automatically inferring the numpy dtype from @gpudataclass. — I guess that's on your todo list already?

  • For such demos, having a public helper function that hides the d_temp_storage boilerplate would be great, and it seems super easy.

@gevtushenko gevtushenko moved this from Todo to In Progress in CCCL Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Status: In Progress
Development

No branches or pull requests

2 participants