Apply a functor to a tensor, element-wise #1070

cdsousa · 2025-10-20T21:07:10Z

cdsousa
Oct 20, 2025

One thing that I've been trying to do with MatX is to be able to populate a tensor with the return of some function that is evaluated for each element.
Moreover it would be even greater if that functor could be called over the indices of each element.

Correct me if I'm wrong, but MatX has no means of doing that as of now.
Thrust can kind of do it, but it works only for 1D vectors allocated by Thrust.

I would like to be able to do something like this:

auto A = make_tensor<float>({20, 30});
auto f = [] __device__ (auto indices){ return static_cast<float>(indices[0] + indices[1]); };
(     A = apply( f, indices_of(A) )     ).run();

cliffburdick · 2025-10-20T21:13:49Z

cliffburdick
Oct 20, 2025
Maintainer

Hi @cdsousa , thanks for the suggestion! The closest we have to that is custom operators:

MatX/examples/black_scholes.cu

Line 56 in e713250

template <class O, class I1>

As you can see it's more verbose than your suggestion. The reason is that the operator needs to follow the operator interface: https://nvidia.github.io/MatX/basics/concepts.html#operator

Your lambda would qualify as a substitute for operator(), but it does not implement Size() or Rank(). Sometimes these are very simple, but other times they are not. For example, both Size() and Rank() refer to the output size and rank, and with something other than a simple element-wise transform, they may not match your inputs.

However, I can see how this might be useful for very simple operators that would like to simplify the code. We will look into this and get back to you.

0 replies

cliffburdick · 2025-10-20T22:43:13Z

cliffburdick
Oct 20, 2025
Maintainer

@cdsousa please check out #1072

It's slightly different than what you suggested in that instead of indices it takes operators and your function applies an operation across the different operators. I think this is a more general way to do what you suggested by something like this:

auto A = make_tensor<float>({20, 30});
auto f = [] __device__ (auto a, auto b){ return static_cast<float>(a + b); };
(     A = apply( f, slice<1>(A, {0, 0}, {matxEnd, matxDropDim}),slice<1>(A, {0, 1}, {matxEnd, matxDropDim}) )     ).run();

Note that you need extended lambda support enabled for cuda to accept this.

0 replies

cdsousa · 2025-10-21T17:07:54Z

cdsousa
Oct 21, 2025
Author

Wow, that was fast 😄 Yes, more generic is better! Thanks

I've been trying with this, but I've hit an issue.
This works fine:

auto A = make_tensor<float>({ 4 });
A.SetVals({ 1.1, 2.2, 3.3, 4.4 });
auto f = [] __device__(auto a, auto b) { return a+b; };
(A = apply(f, A, A)).run();

but this fails to compile:

auto A = make_tensor<float>({ 4 });
A.SetVals({ 1.1, 2.2, 3.3, 4.4 });
auto f = [] __device__(auto a) { return a; };
(A = apply(f, A)).run();

(compiler error)

``` [build] /usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/cuda/std/__tuple_dir/tuple_size.h(71): error: incomplete type "cuda::std::__4::tuple_size> &>>" (aka "cuda::std::__4::tuple_size, cuda::std::__4::array, 1>>>") is not allowed [build] inline constexpr size_t tuple_size_v = tuple_size<_Tp>::value; [build] ^ [build] detected during: [build] instantiation of "const size_t cuda::std::__4::tuple_size_v [with _Tp=matx::tensor_t, cuda::std::__4::array, 1>>]" at line 1366 of /usr/local/cuda-13.0/targets/x86_64-linux/include/cccl/cuda/std/detail/libcxx/include/tuple [build] instantiation of "decltype(auto) cuda::std::__4::apply(_Fn &&, _Tuple &&) [with _Fn=lambda [](auto)->auto &, _Tuple=matx::tensor_t, cuda::std::__4::array, 1>> &]" at line 74 of /home.... ```

Also, as far as I understand, the slice in your example will not produce the indices of the tensor. I would like to get the indices of the tensor elements (could be as a tuple or separated), not their value. Ideally, those indices would be lazily computed.
Probably, that would be a different feature request.

6 replies

cdsousa Oct 21, 2025
Author

I haven't found index before, but that seems to be what I meant.
That exact syntax doesn't seem to work though. Only (A = index(0)).run() worked.

But let me describe what I want to achieve so that I don't create an XY problem
The common use case I have is "image warping", where the color of each output image pixel is a sample from the input image (with interpolation, using a CUDA texture as input image).
The place on the input from where the sample is taken is a function of the coordinates of the output pixel (so, the indices of the output tensor). The computation of the sample locations, the sampling from the texture, and the write to the output image can be parallelized.

I'm mostly used to doing the above through CUDA.jl in Julia.
There, I usually do something like:

function warp(outTensor::CuArray, inTexture::CuTexture, params)
    # `f` is a closure/lambda that captures `inTexture` and `params`:
    f = (outIndices) -> inTexture[some_function(outIndices..., params)...]

    # The "dot" notation broadcasts the assignment and the function `f`
    # `CartesianIndices` lazily generates a tensor containing the indices of outTensor
    outTensor .= f.(CartesianIndices(outTensor))
end

Basically, I was trying to do the same in C++, in a short piece of code, without having to use low-level CUDA kernel design/launch.

cliffburdick Oct 21, 2025
Maintainer

Got it. Let me play around with it and get back to you.

cdsousa Oct 21, 2025
Author

The following is working and is already very close to what I wanted.

auto A = make_tensor<float>({ 2,2 });
auto B = make_tensor<float>({ 2,2 });
B.SetVals({ {1.1, 2.2}, {3.3, 4.4} });
auto f = [B] __device__(auto _, auto i, auto j) { return B(i, j); };
(A = apply(f, A, index(0), index(1))).run();

Great!

cliffburdick Oct 21, 2025
Maintainer

Do you still need my help or is this good enough? I can't tell if that's quite the same since you aren't adding the indices anymore. I also don't think you need to pass in A?

cdsousa Oct 21, 2025
Author

Adding the indices was just an example to tell that I wanted to do computation with them.
Having A as the first argument seems to be required, or else it won't compile, but that's fine. Also, notice my earlier comment on the problem when there is a single argument to the functor.

Nevertheless, it seems I'll be able to do everything that I wanted, so I just have to thank you a lot for all the support and quick implementation of the feature. It's awesome, thank you!

cdsousa · 2025-10-21T23:17:44Z

cdsousa
Oct 21, 2025
Author

Hey @cliffburdick , thanks for #1072!
Here is an example of this feature being used in a Holoscan operator to convert video frames in YUV422 (aka YUYV) format to frames in RGB format. And, yes, I know there are more appropriate ways to do this (e.g. NPP), but this's just an example.

using yuyv_t = cuda::std::tuple<uchar2, uchar2>;
using rgb_t = uchar3;

// `tensor_in` and `tensor_out` are `holoscan::Tensor`s of uint8 with shapes H,W,2 and H,W,3 (yuvy and rgb), respectively
auto matx_tensor_in =
    matx::make_tensor<yuyv_t>(static_cast<yuyv_t*>(tensor_in->data()), { tensor_in->shape()[0], tensor_in->shape()[1] / 2 });
auto matx_tensor_out =
    matx::make_tensor<rgb_t>(static_cast<rgb_t*>(tensor_out->data()), { tensor_out->shape()[0], tensor_out->shape()[1] });

auto yuyv_to_rgb = [matx_tensor_in] __device__(auto, auto i, auto j) {
    auto in = matx_tensor_in(i, j / 2);
    auto y = (j % 2) == 0 ? cuda::std::get<0>(in).x : cuda::std::get<1>(in).x;
    auto u = cuda::std::get<0>(in).y;
    auto v = cuda::std::get<1>(in).y;
    auto r = static_cast<uint8_t>(cuda::std::clamp(y + 1.402f * (v - 128), 0.0f, 255.0f));
    auto g = static_cast<uint8_t>(cuda::std::clamp(y - 0.344136f * (u - 128) - 0.714136f * (v - 128), 0.0f, 255.0f));
    auto b = static_cast<uint8_t>(cuda::std::clamp(y + 1.772f * (u - 128), 0.0f, 255.0f));
    return rgb_t{ r, g, b };
};
(matx_tensor_out = matx::apply(yuyv_to_rgb, matx_tensor_out, matx::index(0), matx::index(1))).run();

0 replies

cliffburdick · 2025-10-23T16:59:37Z

cliffburdick
Oct 23, 2025
Maintainer

Hi @cdsousa, a couple updates:

We're about to push some more changes to apply that give about a 25% performance improvement over custom operators, which means you should not use custom operators in most cases
We are looking at adding an apply_idx that does an apply, but passes indices instead so you can customize the function a bit more.

Be on the lookout for #1 to merge probably tomorrow.

2 replies

cdsousa Oct 24, 2025
Author

Lovely!
It would also be great to have an index(tensor, axis) operator, as you imply in #1070 (reply in thread), but which doesn't seem to exist yet.

cliffburdick Oct 24, 2025
Maintainer

@cdsousa #1077

cliffburdick · 2025-10-23T23:54:39Z

cliffburdick
Oct 23, 2025
Maintainer

This has been merged

0 replies

Apply a functor to a tensor, element-wise #1070

Uh oh!

cdsousa Oct 20, 2025

Replies: 6 comments · 8 replies

Uh oh!

cliffburdick Oct 20, 2025 Maintainer

Uh oh!

cliffburdick Oct 20, 2025 Maintainer

Uh oh!

Uh oh!

cdsousa Oct 21, 2025 Author

Uh oh!

Uh oh!

cdsousa Oct 21, 2025 Author

Uh oh!

cliffburdick Oct 21, 2025 Maintainer

Uh oh!

cdsousa Oct 21, 2025 Author

Uh oh!

cliffburdick Oct 21, 2025 Maintainer

Uh oh!

cdsousa Oct 21, 2025 Author

Uh oh!

Uh oh!

cdsousa Oct 21, 2025 Author

Uh oh!

cliffburdick Oct 23, 2025 Maintainer

Uh oh!

cdsousa Oct 24, 2025 Author

Uh oh!

cliffburdick Oct 24, 2025 Maintainer

Uh oh!

cliffburdick Oct 23, 2025 Maintainer

cdsousa
Oct 20, 2025

Replies: 6 comments 8 replies

cliffburdick
Oct 20, 2025
Maintainer

cliffburdick
Oct 20, 2025
Maintainer

cdsousa
Oct 21, 2025
Author

cdsousa Oct 21, 2025
Author

cliffburdick Oct 21, 2025
Maintainer

cdsousa Oct 21, 2025
Author

cliffburdick Oct 21, 2025
Maintainer

cdsousa Oct 21, 2025
Author

cdsousa
Oct 21, 2025
Author

cliffburdick
Oct 23, 2025
Maintainer

cdsousa Oct 24, 2025
Author

cliffburdick Oct 24, 2025
Maintainer

cliffburdick
Oct 23, 2025
Maintainer