Implement parallel `cuda::std::remove{_if}` by miscco · Pull Request #7693 · NVIDIA/cccl

miscco · 2026-02-17T13:12:28Z

This implements the remove algorithm for the cuda backend.

std::remove see https://en.cppreference.com/w/cpp/algorithm/remove.html
std::remove_if see https://en.cppreference.com/w/cpp/algorithm/remove.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

miscco · 2026-02-17T13:13:50Z

Algorithms look similar to other algorithms, equal for many elements, slower for few

['thrust_remove.json', 'pstl_remove.json']
### [0] NVIDIA RTX A6000

| T{ct} |     Elements     | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  | GlobalMem BW | BWUtil |
|-------|------------------|---------|------------|--------|------------|--------|----------|--------------|--------|
|    I8 |     2^16 = 65536 |    424x |  28.074 us |  3.28% |  23.903 us |  2.56% |   2.742G |   5.417 GB/s |  0.71% |
|    I8 |   2^20 = 1048576 |    440x |  31.883 us | 11.38% |  27.636 us | 13.32% |  37.942G |  75.002 GB/s |  9.76% |
|    I8 |  2^24 = 16777216 |    498x | 121.142 us |  0.92% | 116.927 us |  0.97% | 143.484G | 283.637 GB/s | 36.93% |
|    I8 | 2^28 = 268435456 |   1170x |   1.568 ms |  0.93% |   1.564 ms |  0.93% | 171.645G | 339.299 GB/s | 44.17% |
|   I16 |     2^16 = 65536 |    348x |  28.983 us |  4.90% |  24.638 us |  4.10% |   2.660G |  10.511 GB/s |  1.37% |
|   I16 |   2^20 = 1048576 |    578x |  32.998 us | 10.59% |  28.774 us | 11.97% |  36.441G | 144.071 GB/s | 18.76% |
|   I16 |  2^24 = 16777216 |    522x | 139.938 us |  1.37% | 135.767 us |  1.43% | 123.574G | 488.555 GB/s | 63.61% |
|   I16 | 2^28 = 268435456 |    842x |   1.884 ms |  0.64% |   1.880 ms |  0.64% | 142.806G | 564.581 GB/s | 73.50% |
|   I32 |     2^16 = 65536 |    472x |  28.635 us |  4.45% |  24.390 us |  4.91% |   2.687G |  21.236 GB/s |  2.76% |
|   I32 |   2^20 = 1048576 |    572x |  38.553 us |  7.17% |  34.398 us |  8.01% |  30.484G | 241.036 GB/s | 31.38% |
|   I32 |  2^24 = 16777216 |    668x | 225.856 us |  1.17% | 221.694 us |  1.19% |  75.677G | 598.389 GB/s | 77.91% |
|   I32 | 2^28 = 268435456 |    930x |   3.249 ms |  0.80% |   3.245 ms |  0.80% |  82.730G | 654.141 GB/s | 85.16% |
|   I64 |     2^16 = 65536 |    482x |  29.135 us |  3.93% |  24.762 us |  3.43% |   2.647G |  41.833 GB/s |  5.45% |
|   I64 |   2^20 = 1048576 |    478x |  49.931 us |  3.22% |  45.818 us |  3.52% |  22.886G | 361.913 GB/s | 47.12% |
|   I64 |  2^24 = 16777216 |    790x | 424.956 us |  0.70% | 420.794 us |  0.70% |  39.870G | 630.518 GB/s | 82.09% |
|   I64 | 2^28 = 268435456 |   1044x |   6.439 ms |  0.50% |   6.435 ms |  0.50% |  41.716G | 659.689 GB/s | 85.89% |
|  I128 |     2^16 = 65536 |    468x |  29.528 us |  4.45% |  25.201 us |  5.28% |   2.600G |  82.209 GB/s | 10.70% |
|  I128 |   2^20 = 1048576 |    568x |  75.798 us |  4.08% |  71.610 us |  4.28% |  14.643G | 463.129 GB/s | 60.30% |
|  I128 |  2^24 = 16777216 |    850x | 828.659 us |  0.45% | 824.492 us |  0.46% |  20.349G | 643.592 GB/s | 83.79% |
|  I128 | 2^28 = 268435456 |   1138x |  12.902 ms |  0.29% |  12.898 ms |  0.29% |  20.813G | 658.258 GB/s | 85.70% |
|   F32 |     2^16 = 65536 |    552x |  29.243 us |  2.21% |  24.833 us |  2.78% |   2.639G |  21.112 GB/s |  2.75% |
|   F32 |   2^20 = 1048576 |    542x |  40.061 us |  6.38% |  35.702 us |  7.11% |  29.370G | 234.961 GB/s | 30.59% |
|   F32 |  2^24 = 16777216 |    714x | 226.937 us |  1.87% | 222.715 us |  1.89% |  75.331G | 602.644 GB/s | 78.46% |
|   F32 | 2^28 = 268435456 |   1100x |   3.253 ms |  0.73% |   3.249 ms |  0.73% |  82.626G | 661.007 GB/s | 86.06% |
|   F64 |     2^16 = 65536 |    432x |  29.295 us |  4.56% |  24.904 us |  5.32% |   2.632G |  42.105 GB/s |  5.48% |
|   F64 |   2^20 = 1048576 |    516x |  50.234 us |  2.97% |  46.044 us |  3.26% |  22.773G | 364.373 GB/s | 47.44% |
|   F64 |  2^24 = 16777216 |    708x | 427.916 us |  0.67% | 423.741 us |  0.69% |  39.593G | 633.489 GB/s | 82.48% |
|   F64 | 2^28 = 268435456 |   1074x |   6.465 ms |  0.47% |   6.460 ms |  0.47% |  41.551G | 664.822 GB/s | 86.55% |

['thrust_remove_if.json', 'pstl_remove_if.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |  19.037 us |       3.17% |  23.805 us |       4.93% |  4.768 us |  25.04% |   SLOW   |
|   I8    |    2^20    |  24.917 us |       4.23% |  28.330 us |      13.78% |  3.414 us |  13.70% |   SLOW   |
|   I8    |    2^24    | 114.570 us |       1.10% | 116.225 us |       1.21% |  1.655 us |   1.44% |   SLOW   |
|   I8    |    2^28    |   1.517 ms |       1.15% |   1.543 ms |       0.91% | 25.793 us |   1.70% |   SLOW   |
|   I16   |    2^16    |  19.336 us |       4.85% |  24.616 us |       6.43% |  5.280 us |  27.31% |   SLOW   |
|   I16   |    2^20    |  26.478 us |       2.31% |  29.872 us |      12.12% |  3.394 us |  12.82% |   SLOW   |
|   I16   |    2^24    | 131.577 us |       1.26% | 135.186 us |       1.50% |  3.609 us |   2.74% |   SLOW   |
|   I16   |    2^28    |   1.857 ms |       0.63% |   1.873 ms |       0.67% | 15.264 us |   0.82% |   SLOW   |
|   I32   |    2^16    |  19.496 us |       3.37% |  24.322 us |       4.74% |  4.826 us |  24.76% |   SLOW   |
|   I32   |    2^20    |  30.268 us |       2.51% |  35.252 us |       4.93% |  4.984 us |  16.47% |   SLOW   |
|   I32   |    2^24    | 218.770 us |       0.84% | 222.291 us |       1.19% |  3.521 us |   1.61% |   SLOW   |
|   I32   |    2^28    |   3.239 ms |       0.72% |   3.243 ms |       0.71% |  3.606 us |   0.11% |   SAME   |
|   I64   |    2^16    |  20.996 us |       5.07% |  24.609 us |       4.22% |  3.613 us |  17.21% |   SLOW   |
|   I64   |    2^20    |  41.826 us |       2.58% |  45.323 us |       2.94% |  3.497 us |   8.36% |   SLOW   |
|   I64   |    2^24    | 417.077 us |       0.66% | 419.929 us |       0.89% |  2.852 us |   0.68% |   SLOW   |
|   I64   |    2^28    |   6.428 ms |       0.41% |   6.432 ms |       0.43% |  4.391 us |   0.07% |   SAME   |
|  I128   |    2^16    |  22.573 us |       3.95% |  25.174 us |      28.44% |  2.601 us |  11.52% |   SLOW   |
|  I128   |    2^20    |  67.655 us |       2.10% |  71.257 us |       3.80% |  3.603 us |   5.33% |   SLOW   |
|  I128   |    2^24    | 820.656 us |       0.62% | 824.786 us |       0.65% |  4.131 us |   0.50% |   SAME   |
|  I128   |    2^28    |  12.887 ms |       0.24% |  12.891 ms |       0.25% |  4.662 us |   0.04% |   SAME   |
|   F32   |    2^16    |  19.366 us |       2.42% |  24.395 us |       3.72% |  5.029 us |  25.97% |   SLOW   |
|   F32   |    2^20    |  30.019 us |       1.70% |  35.040 us |       6.10% |  5.021 us |  16.73% |   SLOW   |
|   F32   |    2^24    | 219.595 us |       2.19% | 222.010 us |       1.94% |  2.415 us |   1.10% |   SAME   |
|   F32   |    2^28    |   3.229 ms |       0.69% |   3.233 ms |       0.71% |  4.166 us |   0.13% |   SAME   |
|   F64   |    2^16    |  21.143 us |       4.87% |  24.604 us |       3.11% |  3.461 us |  16.37% |   SLOW   |
|   F64   |    2^20    |  41.642 us |       2.34% |  45.376 us |       3.23% |  3.734 us |   8.97% |   SLOW   |
|   F64   |    2^24    | 417.866 us |       0.73% | 420.821 us |       0.85% |  2.955 us |   0.71% |   SAME   |
|   F64   |    2^28    |   6.432 ms |       0.40% |   6.436 ms |       0.40% |  4.449 us |   0.07% |   SAME   |

libcudacxx/test/libcudacxx/std/algorithms/alg.modifying/alg.remove/pstl_remove_if.cu

oleksandr-pavlyk · 2026-02-25T13:18:53Z

libcudacxx/benchmarks/bench/remove/basic.cu

+
+  state.exec(nvbench::exec_tag::gpu | nvbench::exec_tag::no_batch | nvbench::exec_tag::sync,
+             [&](nvbench::launch& launch) {
+               cuda::std::remove(cuda_policy(alloc, launch), in.begin(), in.end(), T{42});


Question: is do_not_optimize needed here as well?

The issue here is that the function has no return, so one would have to make some crazy hacks, so I just did not add it

This implements the `remove` algorithm for the cuda backend. * `std::remove` see https://en.cppreference.com/w/cpp/algorithm/remove.html * `std::remove_if` see https://en.cppreference.com/w/cpp/algorithm/remove.html It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++ The functionality is publicly available yet and implemented in a private internal header Fixes NVIDIA#7374

bernhardmgruber · 2026-02-25T17:22:18Z

libcudacxx/include/cuda/std/__pstl/cuda/remove_if.h

+    }
+
+    __stream.sync();


I would move the sync into the lifetime of __storage

Suggested change

}

__stream.sync();

__stream.sync();

}

bernhardmgruber · 2026-02-25T17:24:45Z

libcudacxx/include/cuda/std/__pstl/remove_if.h

+    ::cuda::std::execution::__pstl_select_dispatch<::cuda::std::execution::__pstl_algorithm::__remove_if, _Policy>();
+  if constexpr (::cuda::std::execution::__pstl_can_dispatch<decltype(__dispatch)>)
+  {
+    _CCCL_NVTX_RANGE_SCOPE("cuda::std::remove_if");


Q: Why we we not putting the NVTX range across the entire algorithm? Even if the range is empty? We are doing this for CUB (we only skip the NVTX range for the temporary storage allocation query).

github-actions · 2026-02-25T19:00:45Z

😬 CI Workflow Results

🟥 Finished in 2h 41m: Pass: 59%/150 | Total: 3d 03h | Max: 2h 34m | Hits: 80%/144364

See results here.

miscco requested review from a team as code owners February 17, 2026 13:12

github-project-automation bot added this to CCCL Feb 17, 2026

miscco requested review from davebayer and oleksandr-pavlyk February 17, 2026 13:12

github-project-automation bot moved this to Todo in CCCL Feb 17, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 17, 2026

miscco force-pushed the parallel_remove branch 2 times, most recently from b75bed0 to b8e1d28 Compare February 17, 2026 13:56

oleksandr-pavlyk reviewed Feb 17, 2026

View reviewed changes

libcudacxx/test/libcudacxx/std/algorithms/alg.modifying/alg.remove/pstl_remove_if.cu Show resolved Hide resolved