CCCL 2.3.0
What’s New
In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.
System Headers and Warnings
Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem
. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I
.
To achieve the same effect as -isystem
, CCCL headers will now use the system_header
pragma. For more information, see #527.
TL;DR: You should never see warnings emitted from a CCCL header ever again!
Linkage Issues
Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE
macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE
. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.
Thrust
thrust::tuple
, thrust::pair
, and thrust::complex
have been replaced with cuda::std
alternatives. This can be a breaking change, but should be source compatible.
CUB
Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey
, cub::DeviceScan::ExclusiveSumByKey
, and cub::DeviceReduce::ReduceByKey
on A100. cub::DeviceSegmentedReduce
now supports 64-bit indexing.
libcudacxx
- The
cuda::ptx
namespace and<cuda/ptx>
header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics. - #379 - Added experimental bulk TMA memcpy under
<cuda/barrier>
What's Changed
- Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in #303
- Branch/2.2.x by @gevtushenko in #305
- Tune unique by key on A100 by @gevtushenko in #306
- Merge branch/2.2.x to main by @jrhemstad in #308
- Add example cmake project by @jrhemstad in #177
- Adds catch2 tests for reduce-by-key by @elstehle in #311
- Tune scan by key on A100 by @gevtushenko in #325
- Replace diag_suppress by nv_diag_suppress in documentation by @ahendriksen in #281
- Fix MSVC / CUB tests build by @gevtushenko in #336
- gdb pretty printer: handle non-cuda device vectors by @siboehm in #264
- Add a nvrtc configuration for libcu++ by @miscco in #202
- GH Infra: project automation and issue template fixes by @jarmak-nv in #297
- Tune reduce by key on A100 by @gevtushenko in #346
- Merge commits from 2.2 branch by @miscco in #350
- Fix a shadow warning in thrust's execute_with_dependencies.h by @hageboeck in #334
- Assorted fixes for MSVC 2017 by @miscco in #341
- [skip-tests] Guard inline variables with
_LIBCUDACXX_INLINE_VAR
macro by @miscco in #355 - Port cub::DeviceScan tests to catch2 by @elstehle in #347
- Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in #349
- Project Automation: add conditional steps due to context errors by @jarmak-nv in #353
- Work around strange gcc bug by @miscco in #363
- Implement
iter_swap
CPO by @miscco in #332 - Replace default, constexpr, and delete macros by original keywords by @Blonck in #360
- Add clang16 devcontainer and CI job by @miscco in #362
- [skip-tests] Skip merge conflict from old iter_swap PR by @miscco in #369
- [skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in #370
- Remove _LIBCUDACXX_CXX03_LANG macro and all encapsulated code by @Blonck in #368
- Remove checks against _LIBCUDACXX_STD_VER < 11 by @Blonck in #375
- Use
copy-pr-bot
by @ajschmidt8 in #381 - Implement the
permutable
concept by @miscco in #367 - [NFC] We missed some
_NOEXCEPT_
macro uses by @miscco in #371 - Implement
identity
changes for c++20 by @miscco in #383 - Hide third party cmake options in our cmake developer builds. by @allisonvacanti in #300
- Port cub::DeviceScanByKey tests to Catch2 by @elstehle in #380
- Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #399
- Add commit information to the test output by @miscco in #401
- Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in #387
- Project Automation: set
Roadmap
project value on issue/pr close and Auto-type new issues by @jarmak-nv in #389 - Add support for tests that should fail at runtime by @ahendriksen in #418
- Port
DeviceAdjacentDifference::SubtractRight
tests to catch2 by @miscco in #390 - Project automation - Fix indentation for
continue-on-error
by @jarmak-nv in #425 - [BUG] Ensure that all headers build on their own by @miscco in #200
- Remove
util_device.cuh
from iterator headers to enable online compilation by @leofang in #412 - Fix ci-overview example by @gevtushenko in #428
- Port
cub::DeviceRunLengthEncode
tests to catch2 by @miscco in #411 - Add cuda::device::barrier_arrive tx by @ahendriksen in #358
- Fix CubDebug by @gevtushenko in #430
- Do not use static member functions to initialize static member variables. by @miscco in #438
- Implement the
projected
helper struct by @miscco in #385 - Add PTX wrapping functions for TMA features by @ahendriksen in #379
- Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in #320
- Enable lit to determine the compute architectures by @miscco in #447
- Add NVRTC_SKIP_KERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in #434
- Improve documentation of
cuda::barrier
by @ahendriksen in #440 - Extend
thrust::complex
unit tests to prepare for upcoming replacement withstd::complex
by @Blonck in #413 - Remove having two install rules for -header-search.cmake by @robertmaynard in #298
- Run
.devcontainer/launch.sh
with bash + add error checking by @wence- in #407 - Remove C++03 compatability from unit tests by @Blonck in #378
- [libcu++] Fix use of
__ppc64__
by @miscco in #451 - Update the README by @jrhemstad in #291
- [libcu++] Try to avoid gcc misscompilation issues by @miscco in #452
- Consolidate matrix logic into single script/job by @jrhemstad in #361
- Implement the
indirectly_comparable
concept by @miscco in #445 - Fix compute matrix dropping trailing zeros by @jrhemstad in #466
- Avoid integer promotion warnings with MSVC by @miscco in #460
- Implement ranges comparison objects by @miscco in #464
- Fix CUB/MSVC/RDC tests by @gevtushenko in #469
- Fix Thrust/CUB Linkage Issues by @gevtushenko in #443
- Script for Running CUB Benchmarks by @gevtushenko in #472
- [skip ci] Add list of CCCL users to README by @jrhemstad in #474
constexpr
all the things by @pb-dseifert in #476- Add Gonzalo/Allard to trustees by @jrhemstad in #482
- Implement the
sortable
concept by @miscco in #471 - [libcu++] Add _LIBCUDACXX_CUDACC_BELOW_12_3 macro by @gonzalobg in #479
- Refactor
thrust::complex
as a struct derived fromcuda::std::complex
by @Blonck in #454 - Add ci scripts for windows by @miscco in #251
- Enable complex interop on MSVC by @miscco in #490
- [skip ci] Add related projects to readme. by @jrhemstad in #492
- Reenable nvrtc tests by @miscco in #488
- Implement the
mergeable
concept by @miscco in #484 - 64-bit indexing for DeviceSegmentedReduce by @jecs in #414
- Implement
move_sentinel
by @miscco in #496 - Support skipped benches in run script by @gevtushenko in #508
- Implement
unreachable_sentinel
by @miscco in #506 - Disable flaky barrier tests by @miscco in #510
- Add constant initialization of managed variable to silence gcc warning by @miscco in #509
- Add verbose flag to ninja build. by @jrhemstad in #491
- Add devcontainer readme by @jrhemstad in #481
- Add contributor guide by @jrhemstad in #500
- [skip ci] Fix devcontainer guide link by @jrhemstad in #518
- [skip ci] Add example godbolt link. by @jrhemstad in #519
- Replace cuda::atomic with legacy functions for old arch compatibility. by @allisonvacanti in #516
- Simplify examples matrix. by @jrhemstad in #517
- Disable PR workflow triggering on pushes to main. by @jrhemstad in #532
- Add CI job to verify devcontainers are always up to date by @jrhemstad in #514
- [CI] Sink error when git repo is missing from build. by @wmaxey in #533
- Rework our tuple implementation to work with older MSVC by @miscco in #530
- Add jobs using clang as CUDA compiler by @jrhemstad in #493
- Remove cudaDeviceSetSharedMemConfig from CUB tests by @gevtushenko in #538
- Implement
__bounded_iter
by @miscco in #540 - Fix cub::BlockAdjacentDifference documentation by @pauleonix in #542
- Add cuda::device::memcpy_async_tx by @ahendriksen in #405
- Introduce Thrust benchmarks by @gevtushenko in #534
- Fix MSVC benchmarks build by @gevtushenko in #536
- Fix nvc++ as host compiler by @gevtushenko in #560
- Add missing overload definition of thrust::complex operator!= by @srinivasyadav18 in #564
- Make template parameters consistent in thrust::complex operators by @srinivasyadav18 in #555
- Migrate CI configs to CMake presets. by @allisonvacanti in #324
- Replace thrust::detail::integral_constant with libcudacxx implementation by @ZelboK in #561
- Add
cuda::device::barrier_expect_tx
by @ahendriksen in #498 - Add ARM build configs for latest gcc/clang. by @jrhemstad in #468
- Fea/486 Improve thrust::complex operators compile time throughput by @srinivasyadav18 in #567
- Define compiler env vars for CMake in dev containers. by @allisonvacanti in #576
- Revert back to working nvbench commit by @miscco in #582
- use clang-format in dev containers by @miscco in #513
- Introduce CCCL clang-format by @gevtushenko in #551
- Add
cp.async.bulk
global -> shared support tocuda::memcpy_async
by @ahendriksen in #501 - [skip ci] Also update the base image by @miscco in #584
- Replace
thrust::tuple
implementation withcuda::std::tuple
by @miscco in #262 - Fix clangd integration by @gevtushenko in #588
- Always treat CCCL as system headers by @miscco in #531
- Refactor inline comments by @gevtushenko in #581
- Relax Catch2 include order requirements by @gevtushenko in #601
- Project Automation - Fix issue/pr sync workflow by @jarmak-nv in #504
- [skip-tests] Add a preset that builds all configs of all projects. by @allisonvacanti in #580
- Implement
ranges::advance
by @miscco in #546 - Update status check job to check status of precursor jobs by @jrhemstad in #605
- Report times for libcudacxx tests in CI by @jrhemstad in #606
- Fix bug in the construct_at optimization by @miscco in #608
- [skip-tests] Disable rdc tests for windows. by @miscco in #615
- Implement
ranges::next
by @miscco in #611 - Support FP8 in radix sort by @gevtushenko in #623
- Fix examples/cccl_infra mixup in ci. by @wmaxey in #633
- Fixes block-scope run-length decode one-past-the-end memory access into smem TempStorage by @elstehle in #626
- Harmonize CUB includes by @gevtushenko in #632
- Create NVRTCC, a utility for running tests under NVRTC by @wmaxey in #494
- Fix typo and grammar errors by @VaibhavWakde52 in #639
- [Backport branch/2.3.x] Add CCCL_VERSION and script for updating version by @github-actions in #667
- Backport 574 ptx by @miscco in #663
- [Backport branch/2.3.x] Fix C++11 support of recently added tests by @github-actions in #658
- [Backport branch/2.3.x] Update CUDA newest to CTK 12.3 by @github-actions in #1072
- [Backport to branch/2.3.x] Rework our system header approach to be more error proof (#661) by @miscco in #675
- [Backport branch/2.3.x] Fix fallback when checking git repo by @github-actions in #1086
- [Backport branch/2.3.x] Currently the verbose option does not work beacuse of a typo in the argument handling by @github-actions in #1090
- [Backport branch/2.3.x] Add
cuda::ptx::st_async
by @github-actions in #1093 - [Backport branch/2.3.x] Add
cuda::ptx::red_async
by @github-actions in #1094 - Backport PR #1075 by @wmaxey in #1100
- [Backport branch/2.3.x] Add
cuda::ptx:mbarrier_{try/test}_wait{_parity}
by @github-actions in #1106 - [Backport branch/2.3.x] Fix
cuda::ptx::red.async
for int32_t types by @github-actions in #1107 - [Backport branch/2.3.x] Fix local test runs with lit by @github-actions in #1110
- [Backport branch/2.3.x] Fix config when only non-CDPv1 arches are enabled. by @github-actions in #1111
- [Backport branch/2.3.x] Fix GCC6 / FP8 warning by @github-actions in #1131
- [Backport branch/2.3.x] Fix
ptx.st.async.compile.pass.cpp
failing in C++11. by @github-actions in #1136 - BACKPORT: Fix
_LIBCUDACXX_UNREACHABLE
for old MSVC (#1114) by @miscco in #1143 - [2.3.x] Backport benchmarking PRs by @wmaxey in #1168
- Backport P0 filter commit. by @wmaxey in #1172
- [BACKPORT] Implement math functions for thrust::complex by @miscco in #1191
- Backport fix icc / cub (#1152) by @wmaxey in #1171
- [BACKPORT]: Fix availability of is_constant_evaluated on old MSVC by @miscco in #1198
- [BACKPORT] Add icc to the ci matrix by @miscco in #1209
- [BACKPORT]: Add missing overloads for thrust::pow by @miscco in #1223
New Contributors
- @siboehm made their first contribution in #264
- @hageboeck made their first contribution in #334
- @Blonck made their first contribution in #349
- @leofang made their first contribution in #412
- @HapeMask made their first contribution in #320
- @jecs made their first contribution in #414
- @pauleonix made their first contribution in #542
- @srinivasyadav18 made their first contribution in #564
- @ZelboK made their first contribution in #561
- @VaibhavWakde52 made their first contribution in #639
Full Changelog: v2.2.0...2.3.0