Skip to content

Optimize non fixed size segmented reduce for small segments using max_segment_size#7718

Open
srinivasyadav18 wants to merge 3 commits intoNVIDIA:mainfrom
srinivasyadav18:opt_non_fixed_seg_reduce
Open

Optimize non fixed size segmented reduce for small segments using max_segment_size#7718
srinivasyadav18 wants to merge 3 commits intoNVIDIA:mainfrom
srinivasyadav18:opt_non_fixed_seg_reduce

Conversation

@srinivasyadav18
Copy link
Contributor

Description

closes #6898

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@srinivasyadav18 srinivasyadav18 requested review from a team as code owners February 19, 2026 02:20
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 19, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 19, 2026
@srinivasyadav18 srinivasyadav18 changed the title Opt non fixed size segmented reduce for small segments using max_segment_size Optimize non fixed size segmented reduce for small segments using max_segment_size Feb 19, 2026
@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 2h 44m: Pass: 37%/104 | Total: 4d 10h | Max: 2h 44m | Hits: 89%/39834

See results here.

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is massively complicated by the fact that the segmented reduction dispatch was already refactored to the new tuning API, and the fixed size segmented dispatch was not. I strongly suggest to refactor the fixed size dispatch first (#7641) and then rebase this PR.

Comment on lines +1 to +2
// SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
// SPDX-License-Identifier: BSD-3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Please use the correct license header. See https://github.com/NVIDIA/cccl/wiki/Cpp-Coding-Guidelines. Applies to all new files.

Comment on lines +6 to +8
using value_types = nvbench::type_list<int32_t, int64_t, float, double>;
using op_t = cub::detail::arg_max;
using some_offset_types = nvbench::type_list<int32_t>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Please apply the build time optimization as documented here: https://nvidia.github.io/cccl/unstable/cub/tuning.html#nvbench-attributes. Applies to variable_sum.cu as well.

Comment on lines +233 to 247
[[nodiscard]] _CCCL_API constexpr auto operator()(::cuda::arch_id /*arch*/) const -> segmented_reduce_policy
{
constexpr auto policies =
policy_selector{classify_type<AccumT>, classify_op<ReductionOpT>, int{sizeof(OffsetT)}, int{sizeof(AccumT)}};
return policies(arch);
using fs = typename policy_hub<AccumT, OffsetT, ReductionOpT>::MaxPolicy;
using rp = typename fs::ReducePolicy;
using sp = typename fs::SmallReducePolicy;
using mp = typename fs::MediumReducePolicy;
const auto base = reduce::agent_reduce_policy{
rp::BLOCK_THREADS, rp::ITEMS_PER_THREAD, rp::VECTOR_LOAD_LENGTH, rp::BLOCK_ALGORITHM, rp::LOAD_MODIFIER};
return segmented_reduce_policy{
base,
agent_warp_reduce_policy{
base.block_threads, sp::WARP_THREADS, sp::ITEMS_PER_THREAD, sp::VECTOR_LOAD_LENGTH, sp::LOAD_MODIFIER},
agent_warp_reduce_policy{
base.block_threads, mp::WARP_THREADS, mp::ITEMS_PER_THREAD, mp::VECTOR_LOAD_LENGTH, mp::LOAD_MODIFIER}};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: This is breaking the tuning API design, since it decouples the policy_selector_from_types from its corresponding policy_selector. The former must always be implemented as the latter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also fix the policy mismatch error you are seeing in the c parallel and python tests

ctk_path,
"-rdc=true",
"-dlto",
"-DCUB_DISABLE_CDP",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: you need to add "-default-device" to be able to compile the new lambda you added to the kernel, see transform.cu for example

}
};

using dispatch_t = cub::detail::reduce::DispatchFixedSizeSegmentedReduce<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it seems that this alias was useful, I would reintroduce it


// Generate input data
thrust::device_vector<T> in = generate(elements);
thrust::device_vector<output_t> out(num_segments);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
thrust::device_vector<output_t> out(num_segments);
thrust::device_vector<output_t> out(num_segments, thrust::no_init);

Comment on lines +66 to +78
auto get_in = [&] {
if constexpr (is_argmin || is_argmax)
{
return d_indexed_in;
}
else
{
return d_raw_in;
}
};

using input_it_t = decltype(get_in());
input_it_t d_in = get_in();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto get_in = [&] {
if constexpr (is_argmin || is_argmax)
{
return d_indexed_in;
}
else
{
return d_raw_in;
}
};
using input_it_t = decltype(get_in());
input_it_t d_in = get_in();
auto d_in = [&] {
if constexpr (is_argmin || is_argmax)
{
return d_indexed_in;
}
else
{
return d_raw_in;
}
}();


// Create wrapped iterator for argmin/argmax operations
[[maybe_unused]] auto d_indexed_in = thrust::make_transform_iterator(
thrust::counting_iterator<::cuda::std::int64_t>{0},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: if possible

Suggested change
thrust::counting_iterator<::cuda::std::int64_t>{0},
cuda::counting_iterator<::cuda::std::int64_t>{0},

{},
guaranteed_max_seg_size);

thrust::device_vector<nvbench::uint8_t> temp(temp_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
thrust::device_vector<nvbench::uint8_t> temp(temp_size);
thrust::device_vector<nvbench::uint8_t> temp(temp_size, thrust::no_init);

Comment on lines +233 to 247
[[nodiscard]] _CCCL_API constexpr auto operator()(::cuda::arch_id /*arch*/) const -> segmented_reduce_policy
{
constexpr auto policies =
policy_selector{classify_type<AccumT>, classify_op<ReductionOpT>, int{sizeof(OffsetT)}, int{sizeof(AccumT)}};
return policies(arch);
using fs = typename policy_hub<AccumT, OffsetT, ReductionOpT>::MaxPolicy;
using rp = typename fs::ReducePolicy;
using sp = typename fs::SmallReducePolicy;
using mp = typename fs::MediumReducePolicy;
const auto base = reduce::agent_reduce_policy{
rp::BLOCK_THREADS, rp::ITEMS_PER_THREAD, rp::VECTOR_LOAD_LENGTH, rp::BLOCK_ALGORITHM, rp::LOAD_MODIFIER};
return segmented_reduce_policy{
base,
agent_warp_reduce_policy{
base.block_threads, sp::WARP_THREADS, sp::ITEMS_PER_THREAD, sp::VECTOR_LOAD_LENGTH, sp::LOAD_MODIFIER},
agent_warp_reduce_policy{
base.block_threads, mp::WARP_THREADS, mp::ITEMS_PER_THREAD, mp::VECTOR_LOAD_LENGTH, mp::LOAD_MODIFIER}};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also fix the policy mismatch error you are seeing in the c parallel and python tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Optimize device_segment_reduce for small and medium varaible segment size's

3 participants