Optimize non fixed size segmented reduce for small segments using max_segment_size by srinivasyadav18 · Pull Request #7718 · NVIDIA/cccl

srinivasyadav18 · 2026-02-19T02:20:07Z

Description

closes #6898

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

cub/cub/device/dispatch/tuning/tuning_reduce.cuh

cub/cub/device/dispatch/kernels/kernel_segmented_reduce.cuh

cub/cub/device/dispatch/dispatch_segmented_reduce.cuh

cub/test/catch2_test_device_segmented_reduce_max_seg_size.cu

cub/benchmarks/bench/segmented_reduce/variable_base.cuh

github-actions · 2026-02-24T19:51:02Z

😬 CI Workflow Results

🟥 Finished in 2h 44m: Pass: 37%/104 | Total: 4d 10h | Max: 2h 44m | Hits: 89%/39834

See results here.

bernhardmgruber

I think this PR is massively complicated by the fact that the segmented reduction dispatch was already refactored to the new tuning API, and the fixed size segmented dispatch was not. I strongly suggest to refactor the fixed size dispatch first (#7641) and then rebase this PR.

bernhardmgruber · 2026-02-24T22:54:48Z

cub/benchmarks/bench/segmented_reduce/variable_argmax.cu

+// SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+// SPDX-License-Identifier: BSD-3


Critical: Please use the correct license header. See https://github.com/NVIDIA/cccl/wiki/Cpp-Coding-Guidelines. Applies to all new files.

bernhardmgruber · 2026-02-24T22:55:50Z

cub/benchmarks/bench/segmented_reduce/variable_argmax.cu

+using value_types       = nvbench::type_list<int32_t, int64_t, float, double>;
+using op_t              = cub::detail::arg_max;
+using some_offset_types = nvbench::type_list<int32_t>;


Critical: Please apply the build time optimization as documented here: https://nvidia.github.io/cccl/unstable/cub/tuning.html#nvbench-attributes. Applies to variable_sum.cu as well.

bernhardmgruber · 2026-02-24T23:09:56Z

cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh

+  [[nodiscard]] _CCCL_API constexpr auto operator()(::cuda::arch_id /*arch*/) const -> segmented_reduce_policy
  {
-    constexpr auto policies =
-      policy_selector{classify_type<AccumT>, classify_op<ReductionOpT>, int{sizeof(OffsetT)}, int{sizeof(AccumT)}};
-    return policies(arch);
+    using fs        = typename policy_hub<AccumT, OffsetT, ReductionOpT>::MaxPolicy;
+    using rp        = typename fs::ReducePolicy;
+    using sp        = typename fs::SmallReducePolicy;
+    using mp        = typename fs::MediumReducePolicy;
+    const auto base = reduce::agent_reduce_policy{
+      rp::BLOCK_THREADS, rp::ITEMS_PER_THREAD, rp::VECTOR_LOAD_LENGTH, rp::BLOCK_ALGORITHM, rp::LOAD_MODIFIER};
+    return segmented_reduce_policy{
+      base,
+      agent_warp_reduce_policy{
+        base.block_threads, sp::WARP_THREADS, sp::ITEMS_PER_THREAD, sp::VECTOR_LOAD_LENGTH, sp::LOAD_MODIFIER},
+      agent_warp_reduce_policy{
+        base.block_threads, mp::WARP_THREADS, mp::ITEMS_PER_THREAD, mp::VECTOR_LOAD_LENGTH, mp::LOAD_MODIFIER}};
  }


Critical: This is breaking the tuning API design, since it decouples the policy_selector_from_types from its corresponding policy_selector. The former must always be implemented as the latter.

This should also fix the policy mismatch error you are seeing in the c parallel and python tests

NaderAlAwar · 2026-02-25T22:12:34Z

c/parallel/src/segmented_reduce.cu

    ctk_path,
    "-rdc=true",
    "-dlto",
    "-DCUB_DISABLE_CDP",


Important: you need to add "-default-device" to be able to compile the new lambda you added to the kernel, see transform.cu for example

NaderAlAwar · 2026-02-25T22:29:49Z

cub/benchmarks/bench/segmented_reduce/base.cuh

    }
  };

-  using dispatch_t = cub::detail::reduce::DispatchFixedSizeSegmentedReduce<


Nit: it seems that this alias was useful, I would reintroduce it

NaderAlAwar · 2026-02-25T22:33:43Z

cub/benchmarks/bench/segmented_reduce/variable_base.cuh

+
+  // Generate input data
+  thrust::device_vector<T> in = generate(elements);
+  thrust::device_vector<output_t> out(num_segments);


Nit:

Suggested change

thrust::device_vector<output_t> out(num_segments);

thrust::device_vector<output_t> out(num_segments, thrust::no_init);

NaderAlAwar · 2026-02-25T22:36:02Z

cub/benchmarks/bench/segmented_reduce/variable_base.cuh

+  auto get_in = [&] {
+    if constexpr (is_argmin || is_argmax)
+    {
+      return d_indexed_in;
+    }
+    else
+    {
+      return d_raw_in;
+    }
+  };
+
+  using input_it_t = decltype(get_in());
+  input_it_t d_in  = get_in();


Suggested change

auto get_in = [&] {

if constexpr (is_argmin || is_argmax)

{

return d_indexed_in;

}

else

{

return d_raw_in;

}

};

using input_it_t = decltype(get_in());

input_it_t d_in = get_in();

auto d_in = [&] {

if constexpr (is_argmin || is_argmax)

{

return d_indexed_in;

}

else

{

return d_raw_in;

}

}();

NaderAlAwar · 2026-02-25T22:37:00Z

cub/benchmarks/bench/segmented_reduce/variable_base.cuh

+
+  // Create wrapped iterator for argmin/argmax operations
+  [[maybe_unused]] auto d_indexed_in = thrust::make_transform_iterator(
+    thrust::counting_iterator<::cuda::std::int64_t>{0},


Suggestion: if possible

Suggested change

thrust::counting_iterator<::cuda::std::int64_t>{0},

cuda::counting_iterator<::cuda::std::int64_t>{0},

NaderAlAwar · 2026-02-25T22:38:20Z

cub/benchmarks/bench/segmented_reduce/variable_base.cuh

+    {},
+    guaranteed_max_seg_size);
+
+  thrust::device_vector<nvbench::uint8_t> temp(temp_size);


Suggested change

thrust::device_vector<nvbench::uint8_t> temp(temp_size);

thrust::device_vector<nvbench::uint8_t> temp(temp_size, thrust::no_init);

NaderAlAwar · 2026-02-25T22:58:21Z

cub/cub/device/dispatch/tuning/tuning_segmented_reduce.cuh

+  [[nodiscard]] _CCCL_API constexpr auto operator()(::cuda::arch_id /*arch*/) const -> segmented_reduce_policy
  {
-    constexpr auto policies =
-      policy_selector{classify_type<AccumT>, classify_op<ReductionOpT>, int{sizeof(OffsetT)}, int{sizeof(AccumT)}};
-    return policies(arch);
+    using fs        = typename policy_hub<AccumT, OffsetT, ReductionOpT>::MaxPolicy;
+    using rp        = typename fs::ReducePolicy;
+    using sp        = typename fs::SmallReducePolicy;
+    using mp        = typename fs::MediumReducePolicy;
+    const auto base = reduce::agent_reduce_policy{
+      rp::BLOCK_THREADS, rp::ITEMS_PER_THREAD, rp::VECTOR_LOAD_LENGTH, rp::BLOCK_ALGORITHM, rp::LOAD_MODIFIER};
+    return segmented_reduce_policy{
+      base,
+      agent_warp_reduce_policy{
+        base.block_threads, sp::WARP_THREADS, sp::ITEMS_PER_THREAD, sp::VECTOR_LOAD_LENGTH, sp::LOAD_MODIFIER},
+      agent_warp_reduce_policy{
+        base.block_threads, mp::WARP_THREADS, mp::ITEMS_PER_THREAD, mp::VECTOR_LOAD_LENGTH, mp::LOAD_MODIFIER}};
  }


This should also fix the policy mismatch error you are seeing in the c parallel and python tests

srinivasyadav18 added 2 commits February 13, 2026 10:16

optimize non-fixed segmented reduce for small segments

0b31f10

minor fixes

2e18dc8

srinivasyadav18 requested review from a team as code owners February 19, 2026 02:20

srinivasyadav18 requested a review from oleksandr-pavlyk February 19, 2026 02:20

github-project-automation bot added this to CCCL Feb 19, 2026

srinivasyadav18 requested a review from NaderAlAwar February 19, 2026 02:20

github-project-automation bot moved this to Todo in CCCL Feb 19, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 19, 2026

srinivasyadav18 changed the title ~~Opt non fixed size segmented reduce for small segments using max_segment_size~~ Optimize non fixed size segmented reduce for small segments using max_segment_size Feb 19, 2026

This comment has been minimized.

Sign in to view

NaderAlAwar reviewed Feb 19, 2026

View reviewed changes

update to new tunings API

32185d4

bernhardmgruber reviewed Feb 24, 2026

View reviewed changes

NaderAlAwar reviewed Feb 25, 2026

View reviewed changes

bernhardmgruber mentioned this pull request Feb 26, 2026

Implement the new tuning API for DispatchFixedSizeSegmentedReduce #7641

Open

1 task

		// SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
		// SPDX-License-Identifier: BSD-3

	thrust::device_vector<output_t> out(num_segments);
	thrust::device_vector<output_t> out(num_segments, thrust::no_init);

	thrust::counting_iterator<::cuda::std::int64_t>{0},
	cuda::counting_iterator<::cuda::std::int64_t>{0},

	thrust::device_vector<nvbench::uint8_t> temp(temp_size);
	thrust::device_vector<nvbench::uint8_t> temp(temp_size, thrust::no_init);

Conversation

srinivasyadav18 commented Feb 19, 2026

Description

Checklist

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 24, 2026

😬 CI Workflow Results

🟥 Finished in 2h 44m: Pass: 37%/104 | Total: 4d 10h | Max: 2h 44m | Hits: 89%/39834

Uh oh!

bernhardmgruber left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants