[FEA]: Provide generic and safe C++ interfaces for warp shuffle #2976

fbusato · 2024-11-27T20:13:11Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

libcu++

Is your feature request related to a problem? Please describe.

CUDA provides warp shuffle intrinsics that support a limited set of types. Secondly, they there are not check to validate the inputs
see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions

Describe the solution you'd like

Provide:

cuda::shfl(T var, int srcLane, unsigned mask = 0xFFFFFFFF, int width=warpSize)
cuda::shfl_up(T var, int delta, unsigned mask = 0xFFFFFFFF, int width=warpSize)
cuda::shfl_down(T var, int delta, unsigned mask = 0xFFFFFFFF, int width=warpSize)
cuda::shfl_xor(T var, int laneMask, unsigned mask = 0xFFFFFFFF, int width=warpSize)

Features and checks:

All shuffle operations work with any trivially copyable types with arbitrary sizes + extended floating points
width is a power of two and 1 <= width <= WarpSize
mask is a subset of __activemask()
cuda::shfl
- srcLane is part of mask
- 0 <= srcLane < width ([optional] no modulo behavior)
cuda::shfl_up
- 1 < delta < width
- max(laneid - delta, 0) is part of mask
cuda::shfl_down
- 1 < delta < width
- min(laneid + delta, width) is part of mask
cuda::shfl_xor
- clamp(laneid ^ laneMask, 0, width) is part of mask
check that mask value is the same for all participating lanes (__match_all_sync()) [optional]

Describe alternatives you've considered

An alternative could use mask at the end of the parameter list

Additional context

No response

The text was updated successfully, but these errors were encountered:

soumikiith · 2024-12-11T06:58:08Z

Hi, if I understand correctly, you might want something like this. I have only done this for one of the functions mentioned. If this is correct or if any additional features need to be added, I would happily like to take up this problem.

template <typename T>
__device__ T shfl(T var, int srcLane, unsigned mask = 0xFFFFFFFF, int width = warpSize) {
    static_assert(std::is_trivially_copyable<T>::value, "T must be trivially copyable");
    assert(width <= warpSize && "Width must not exceed warp size");
    assert(isPowerofTwo(width) && "width must be power of 2");

    return __shfl_sync(mask, var, srcLane, width);

   //The rest of the code comes here.
}

NOTE: The code provided is like an MWE to understand whether it meets the actual expectations.

miscco · 2024-12-11T08:31:20Z

Hi @soumikiith ,

this is definitely going in the right direction. We would need some additional constraints though

The docs state the supported types, so I am not sure whether we can actually go with is_trivially_copyable
We would want to SFINAE away the functions if they do not meet the requirements (we have a set of macros _CCCL_TEMPLATE and _CCCL_REQUIRES that make this easier)
We need to use macros for a lot of stuff, like __device__ needs to be _CCCL_DEVICE and assert would become _CCCL_ASSERT
We want this most likely in either <cuda/cmath> or <cuda/functional>. But we need to ensure that we only define those functions of there is an actual cuda compiler used (See uses of _CCCL_HAS_CUDA_COMPILER)

soumikiith · 2024-12-11T08:46:17Z

Hi @miscco ,
Thank you for your feedback. I will look into this. I genuinely thank you because of the MACROS. I did not know that. I will surely implement using them.
Thank you again!!.

miscco · 2024-12-11T09:09:19Z

Also regarding extended floating point support, have a look at e.g cmath extensions

fbusato · 2024-12-11T17:35:41Z

please note that __shfl_sync only supports 32-bit values at HW-level. Any other type size must be cast from/to a common type (e.g. array of uint32_t).
This requires bit_cast or ::memcpy (for __half, __nv_bfloat16, composition of them, etc.)

miscco · 2024-12-11T18:29:02Z

We can bitcast __half and __nvbfloat just fine and should easily extend that to any other extended floating point vector type

fbusato · 2024-12-11T18:48:56Z

We can bitcast __half and __nvbfloat just fine and should easily extend that to any other extended floating point vector type

we need to extend this mechanism to raw array, cuda::std::array, cuda::std::pair, composition of them, etc..

soumikiith · 2024-12-16T10:59:55Z

Hi, I am currently finishing off on the issue. Sorry for the delayed response. I just want to check whether my implementation is aligned with the expectations of the issue. I am sharing a minimal code. If anything needs to be addressed, please let me know. Waiting for any further comments or any review. Thank You.

namespace cuda {
    // Helper to check if a type is supported for warp shuffle operations
    template <typename T>
    struct is_supported_type : std::false_type {};
   //more code here.
    #if defined(_CCCL_HAS_NVFP16)
        template <> struct is_supported_type<__half> : std::true_type {};
        template <> struct is_supported_type<__half2> : std::true_type {};
    #endif

//more code supporting floating vector types here.
 
    //Helper function to validate shuffle inputs
    inline void validate_shuffle_inputs(int width)
    {
        _CCCL_ASSERT((width > 0 && (width & (width - 1)) == 0), "Width must be a power of two");
    }

    template <typename T>
    _CCCL_DEVICE T shfl(T var, int srcLane, unsigned mask = 0xFFFFFFFF, int width = warpSize) {
    _CCCL_ASSERT(is_supported_type<T>::value, "T must be a supported type for warp shuffle operations");
    validate_shuffle_inputs(width);

//more code here

    if constexpr(sizeof(T) == 2)
    {
        float packed_data = _CCCL_BUILTIN_BIT_CAST(float, var);
        float result = __shfl_sync(mask, packed_data, srcLane, width);
        return _CCCL_BUILTIN_BIT_CAST(T, result);
    }
//more code here
}

P.S. I am implementing the code inside <cuda/cmath> as you mentioned.

miscco · 2024-12-16T11:35:00Z

Sorry for the delayed response.

No need to apologize, we deeply appreciate you putting your time in.

I just want to check whether my implementation is aligned with the expectations of the issue. I am sharing a minimal code. If anything needs to be addressed, please let me know. Waiting for any further comments or any review. Thank You.

Generally we prefer to discuss within a draft PR if there are still design issues. We recently changed our CI so that it does not run on draft PRs so there is no drawback in opening a draft PR, even if it is just a sketch.

struct is_supported_type : std::false_type {};
//more code here.
#if defined(_CCCL_HAS_NVFP16)
template <> struct is_supported_type<__half> : std::true_type {};
template <> struct is_supported_type<__half2> : std::true_type {};
#endif

We are generally prefering variable templates over types. Note that we might need this for C++11 support and compilers with insufficient variable templates support

//Helper function to validate shuffle inputs
inline void validate_shuffle_inputs(int width)
{
    _CCCL_ASSERT((width > 0 && (width & (width - 1)) == 0), "Width must be a power of two");
}

This should just be directly inlined into the code. This would add a function call in debug mode when its not necessary

P.S. I am implementing the code inside <cuda/cmath> as you mentioned.

To be sure, we want this code to live in something like <cuda/__cmath/safe_shuffle.h> and which is then included into <cuda/cmath>

vrajvaghela89 · 2024-12-19T08:19:05Z

"Hi, I'd like to work on this issue. Could you assign it to me?"

miscco · 2024-12-19T08:46:38Z

@vrajvaghela89 there is already @soumikiith working on it

…A#2976

pciolkosz · 2024-12-21T01:04:54Z

@fbusato Why not use cooperative groups shfl? Is it because of some very unusual mask pattern?
CG shfl has support for arbitrary trivially copyable types, not sure about some of the other requirements. But some of the points are probably already enforced by the tile model.

fbusato · 2024-12-21T01:13:27Z

that's a good point. Do you mean using CG as a backend?
anyway, is_trivially_copyable is not sufficient. I don't understand the restriction related to sizeof(T) <= 32

pciolkosz · 2024-12-21T01:28:14Z

Either backend or just use CG where the proposed functionality would be used.
I think the restriction for size <= 32 comes from performance analysis, I don't remember the details, but its possible that for larger types its better to use shared memory instead, since you can use a few very wide store/loads, instead of shuffling every 4 bytes. @wmaxey might remember more where the restriction comes from.
Regarding trivially_copyable, I think shuffle has the exact same restrictions like memcpy, but I can see some cases where some opt-in for a type that is technically not trivially copyable could be useful.

fbusato · 2024-12-21T01:33:03Z

I guess size <= 32 performance issue is related to register usage

Regarding trivially_copyable, I think shuffle has the exact same restrictions like memcpy, but I can see some cases where some opt-in for a type that is technically not trivially copyable could be useful.

the main use cases are the extended floating point types

fbusato added the feature request New feature or request. label Nov 27, 2024

github-project-automation bot added this to CCCL Nov 27, 2024

github-project-automation bot moved this to Todo in CCCL Nov 27, 2024

miscco added the good first issue Good for newcomers. label Nov 30, 2024

miscco assigned soumikiith Dec 19, 2024

soumikiith added a commit to soumikiith/cccl that referenced this issue Dec 20, 2024

Provide generic and safe C++ interfaces for warp shuffle: Issue NVIDI…

22344f4

…A#2976

soumikiith linked a pull request Dec 20, 2024 that will close this issue

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976 #3210

Open

2 tasks

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Provide generic and safe C++ interfaces for warp shuffle #2976

[FEA]: Provide generic and safe C++ interfaces for warp shuffle #2976

fbusato commented Nov 27, 2024 •

edited

Loading

soumikiith commented Dec 11, 2024 •

edited

Loading

miscco commented Dec 11, 2024

soumikiith commented Dec 11, 2024

miscco commented Dec 11, 2024

fbusato commented Dec 11, 2024

miscco commented Dec 11, 2024

fbusato commented Dec 11, 2024

soumikiith commented Dec 16, 2024

miscco commented Dec 16, 2024 •

edited

Loading

vrajvaghela89 commented Dec 19, 2024

miscco commented Dec 19, 2024

pciolkosz commented Dec 21, 2024 •

edited

Loading

fbusato commented Dec 21, 2024

pciolkosz commented Dec 21, 2024

fbusato commented Dec 21, 2024 •

edited

Loading

[FEA]: Provide generic and safe C++ interfaces for warp shuffle #2976

[FEA]: Provide generic and safe C++ interfaces for warp shuffle #2976

Comments

fbusato commented Nov 27, 2024 • edited Loading

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

soumikiith commented Dec 11, 2024 • edited Loading

miscco commented Dec 11, 2024

soumikiith commented Dec 11, 2024

miscco commented Dec 11, 2024

fbusato commented Dec 11, 2024

miscco commented Dec 11, 2024

fbusato commented Dec 11, 2024

soumikiith commented Dec 16, 2024

miscco commented Dec 16, 2024 • edited Loading

vrajvaghela89 commented Dec 19, 2024

miscco commented Dec 19, 2024

pciolkosz commented Dec 21, 2024 • edited Loading

fbusato commented Dec 21, 2024

pciolkosz commented Dec 21, 2024

fbusato commented Dec 21, 2024 • edited Loading

fbusato commented Nov 27, 2024 •

edited

Loading

soumikiith commented Dec 11, 2024 •

edited

Loading

miscco commented Dec 16, 2024 •

edited

Loading

pciolkosz commented Dec 21, 2024 •

edited

Loading

fbusato commented Dec 21, 2024 •

edited

Loading