Skip to content

Implement parallel cuda::std::remove{_if}#7693

Open
miscco wants to merge 1 commit intoNVIDIA:mainfrom
miscco:parallel_remove
Open

Implement parallel cuda::std::remove{_if}#7693
miscco wants to merge 1 commit intoNVIDIA:mainfrom
miscco:parallel_remove

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Feb 17, 2026

This implements the remove algorithm for the cuda backend.

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes #7374

@miscco miscco requested review from a team as code owners February 17, 2026 13:12
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 17, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 17, 2026
@miscco
Copy link
Contributor Author

miscco commented Feb 17, 2026

Algorithms look similar to other algorithms, equal for many elements, slower for few

['thrust_remove.json', 'pstl_remove.json']
### [0] NVIDIA RTX A6000

| T{ct} |     Elements     | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  | GlobalMem BW | BWUtil |
|-------|------------------|---------|------------|--------|------------|--------|----------|--------------|--------|
|    I8 |     2^16 = 65536 |    424x |  28.074 us |  3.28% |  23.903 us |  2.56% |   2.742G |   5.417 GB/s |  0.71% |
|    I8 |   2^20 = 1048576 |    440x |  31.883 us | 11.38% |  27.636 us | 13.32% |  37.942G |  75.002 GB/s |  9.76% |
|    I8 |  2^24 = 16777216 |    498x | 121.142 us |  0.92% | 116.927 us |  0.97% | 143.484G | 283.637 GB/s | 36.93% |
|    I8 | 2^28 = 268435456 |   1170x |   1.568 ms |  0.93% |   1.564 ms |  0.93% | 171.645G | 339.299 GB/s | 44.17% |
|   I16 |     2^16 = 65536 |    348x |  28.983 us |  4.90% |  24.638 us |  4.10% |   2.660G |  10.511 GB/s |  1.37% |
|   I16 |   2^20 = 1048576 |    578x |  32.998 us | 10.59% |  28.774 us | 11.97% |  36.441G | 144.071 GB/s | 18.76% |
|   I16 |  2^24 = 16777216 |    522x | 139.938 us |  1.37% | 135.767 us |  1.43% | 123.574G | 488.555 GB/s | 63.61% |
|   I16 | 2^28 = 268435456 |    842x |   1.884 ms |  0.64% |   1.880 ms |  0.64% | 142.806G | 564.581 GB/s | 73.50% |
|   I32 |     2^16 = 65536 |    472x |  28.635 us |  4.45% |  24.390 us |  4.91% |   2.687G |  21.236 GB/s |  2.76% |
|   I32 |   2^20 = 1048576 |    572x |  38.553 us |  7.17% |  34.398 us |  8.01% |  30.484G | 241.036 GB/s | 31.38% |
|   I32 |  2^24 = 16777216 |    668x | 225.856 us |  1.17% | 221.694 us |  1.19% |  75.677G | 598.389 GB/s | 77.91% |
|   I32 | 2^28 = 268435456 |    930x |   3.249 ms |  0.80% |   3.245 ms |  0.80% |  82.730G | 654.141 GB/s | 85.16% |
|   I64 |     2^16 = 65536 |    482x |  29.135 us |  3.93% |  24.762 us |  3.43% |   2.647G |  41.833 GB/s |  5.45% |
|   I64 |   2^20 = 1048576 |    478x |  49.931 us |  3.22% |  45.818 us |  3.52% |  22.886G | 361.913 GB/s | 47.12% |
|   I64 |  2^24 = 16777216 |    790x | 424.956 us |  0.70% | 420.794 us |  0.70% |  39.870G | 630.518 GB/s | 82.09% |
|   I64 | 2^28 = 268435456 |   1044x |   6.439 ms |  0.50% |   6.435 ms |  0.50% |  41.716G | 659.689 GB/s | 85.89% |
|  I128 |     2^16 = 65536 |    468x |  29.528 us |  4.45% |  25.201 us |  5.28% |   2.600G |  82.209 GB/s | 10.70% |
|  I128 |   2^20 = 1048576 |    568x |  75.798 us |  4.08% |  71.610 us |  4.28% |  14.643G | 463.129 GB/s | 60.30% |
|  I128 |  2^24 = 16777216 |    850x | 828.659 us |  0.45% | 824.492 us |  0.46% |  20.349G | 643.592 GB/s | 83.79% |
|  I128 | 2^28 = 268435456 |   1138x |  12.902 ms |  0.29% |  12.898 ms |  0.29% |  20.813G | 658.258 GB/s | 85.70% |
|   F32 |     2^16 = 65536 |    552x |  29.243 us |  2.21% |  24.833 us |  2.78% |   2.639G |  21.112 GB/s |  2.75% |
|   F32 |   2^20 = 1048576 |    542x |  40.061 us |  6.38% |  35.702 us |  7.11% |  29.370G | 234.961 GB/s | 30.59% |
|   F32 |  2^24 = 16777216 |    714x | 226.937 us |  1.87% | 222.715 us |  1.89% |  75.331G | 602.644 GB/s | 78.46% |
|   F32 | 2^28 = 268435456 |   1100x |   3.253 ms |  0.73% |   3.249 ms |  0.73% |  82.626G | 661.007 GB/s | 86.06% |
|   F64 |     2^16 = 65536 |    432x |  29.295 us |  4.56% |  24.904 us |  5.32% |   2.632G |  42.105 GB/s |  5.48% |
|   F64 |   2^20 = 1048576 |    516x |  50.234 us |  2.97% |  46.044 us |  3.26% |  22.773G | 364.373 GB/s | 47.44% |
|   F64 |  2^24 = 16777216 |    708x | 427.916 us |  0.67% | 423.741 us |  0.69% |  39.593G | 633.489 GB/s | 82.48% |
|   F64 | 2^28 = 268435456 |   1074x |   6.465 ms |  0.47% |   6.460 ms |  0.47% |  41.551G | 664.822 GB/s | 86.55% |

['thrust_remove_if.json', 'pstl_remove_if.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |  19.037 us |       3.17% |  23.805 us |       4.93% |  4.768 us |  25.04% |   SLOW   |
|   I8    |    2^20    |  24.917 us |       4.23% |  28.330 us |      13.78% |  3.414 us |  13.70% |   SLOW   |
|   I8    |    2^24    | 114.570 us |       1.10% | 116.225 us |       1.21% |  1.655 us |   1.44% |   SLOW   |
|   I8    |    2^28    |   1.517 ms |       1.15% |   1.543 ms |       0.91% | 25.793 us |   1.70% |   SLOW   |
|   I16   |    2^16    |  19.336 us |       4.85% |  24.616 us |       6.43% |  5.280 us |  27.31% |   SLOW   |
|   I16   |    2^20    |  26.478 us |       2.31% |  29.872 us |      12.12% |  3.394 us |  12.82% |   SLOW   |
|   I16   |    2^24    | 131.577 us |       1.26% | 135.186 us |       1.50% |  3.609 us |   2.74% |   SLOW   |
|   I16   |    2^28    |   1.857 ms |       0.63% |   1.873 ms |       0.67% | 15.264 us |   0.82% |   SLOW   |
|   I32   |    2^16    |  19.496 us |       3.37% |  24.322 us |       4.74% |  4.826 us |  24.76% |   SLOW   |
|   I32   |    2^20    |  30.268 us |       2.51% |  35.252 us |       4.93% |  4.984 us |  16.47% |   SLOW   |
|   I32   |    2^24    | 218.770 us |       0.84% | 222.291 us |       1.19% |  3.521 us |   1.61% |   SLOW   |
|   I32   |    2^28    |   3.239 ms |       0.72% |   3.243 ms |       0.71% |  3.606 us |   0.11% |   SAME   |
|   I64   |    2^16    |  20.996 us |       5.07% |  24.609 us |       4.22% |  3.613 us |  17.21% |   SLOW   |
|   I64   |    2^20    |  41.826 us |       2.58% |  45.323 us |       2.94% |  3.497 us |   8.36% |   SLOW   |
|   I64   |    2^24    | 417.077 us |       0.66% | 419.929 us |       0.89% |  2.852 us |   0.68% |   SLOW   |
|   I64   |    2^28    |   6.428 ms |       0.41% |   6.432 ms |       0.43% |  4.391 us |   0.07% |   SAME   |
|  I128   |    2^16    |  22.573 us |       3.95% |  25.174 us |      28.44% |  2.601 us |  11.52% |   SLOW   |
|  I128   |    2^20    |  67.655 us |       2.10% |  71.257 us |       3.80% |  3.603 us |   5.33% |   SLOW   |
|  I128   |    2^24    | 820.656 us |       0.62% | 824.786 us |       0.65% |  4.131 us |   0.50% |   SAME   |
|  I128   |    2^28    |  12.887 ms |       0.24% |  12.891 ms |       0.25% |  4.662 us |   0.04% |   SAME   |
|   F32   |    2^16    |  19.366 us |       2.42% |  24.395 us |       3.72% |  5.029 us |  25.97% |   SLOW   |
|   F32   |    2^20    |  30.019 us |       1.70% |  35.040 us |       6.10% |  5.021 us |  16.73% |   SLOW   |
|   F32   |    2^24    | 219.595 us |       2.19% | 222.010 us |       1.94% |  2.415 us |   1.10% |   SAME   |
|   F32   |    2^28    |   3.229 ms |       0.69% |   3.233 ms |       0.71% |  4.166 us |   0.13% |   SAME   |
|   F64   |    2^16    |  21.143 us |       4.87% |  24.604 us |       3.11% |  3.461 us |  16.37% |   SLOW   |
|   F64   |    2^20    |  41.642 us |       2.34% |  45.376 us |       3.23% |  3.734 us |   8.97% |   SLOW   |
|   F64   |    2^24    | 417.866 us |       0.73% | 420.821 us |       0.85% |  2.955 us |   0.71% |   SAME   |
|   F64   |    2^28    |   6.432 ms |       0.40% |   6.436 ms |       0.40% |  4.449 us |   0.07% |   SAME   |

@miscco miscco force-pushed the parallel_remove branch 2 times, most recently from b75bed0 to b8e1d28 Compare February 17, 2026 13:56
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@miscco miscco force-pushed the parallel_remove branch 3 times, most recently from 27e1de9 to 96446aa Compare February 24, 2026 14:54
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.


state.exec(nvbench::exec_tag::gpu | nvbench::exec_tag::no_batch | nvbench::exec_tag::sync,
[&](nvbench::launch& launch) {
cuda::std::remove(cuda_policy(alloc, launch), in.begin(), in.end(), T{42});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is do_not_optimize needed here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that the function has no return, so one would have to make some crazy hacks, so I just did not add it

@github-actions

This comment has been minimized.

This implements the `remove` algorithm for the cuda backend.

* `std::remove` see https://en.cppreference.com/w/cpp/algorithm/remove.html
* `std::remove_if` see https://en.cppreference.com/w/cpp/algorithm/remove.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes NVIDIA#7374
Comment on lines +113 to +115
}

__stream.sync();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the sync into the lifetime of __storage

Suggested change
}
__stream.sync();
__stream.sync();
}

::cuda::std::execution::__pstl_select_dispatch<::cuda::std::execution::__pstl_algorithm::__remove_if, _Policy>();
if constexpr (::cuda::std::execution::__pstl_can_dispatch<decltype(__dispatch)>)
{
_CCCL_NVTX_RANGE_SCOPE("cuda::std::remove_if");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why we we not putting the NVTX range across the entire algorithm? Even if the range is empty? We are doing this for CUB (we only skip the NVTX range for the temporary storage allocation query).

@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 2h 41m: Pass: 59%/150 | Total: 3d 03h | Max: 2h 34m | Hits: 80%/144364

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Implement CUDA backend for parallel cuda::std::remove

4 participants