Releases: NVIDIA/cub
CUB 1.6.1 (previously 1.5.4)
Summary
CUB 1.6.1 (previously 1.5.4) is a minor release.
Bug Fixes
- Fix radix sorting bug introduced by scan refactorization.
CUB 1.6.0 (previously 1.5.3)
Summary
CUB 1.6.0 changes the scan and reduce interfaces. Exclusive scans now accept an "initial value" instead of an "identity value". Scans and reductions now support differing input and output sequence types. Additionally, many bugs have been fixed.
Breaking Changes
- Device/block/warp-wide exclusive scans have been revised to now accept an "initial value" (instead of an "identity value") for seeding the computation with an arbitrary prefix.
- Device-wide reductions and scans can now have input sequence types that are different from output sequence types (as long as they are convertible).
Other Enhancements
- Reduce repository size by moving the doxygen binary to doc repository.
- Minor reduction in
cub::BlockScan
instruction counts.
Bug Fixes
- Issue #55: Warning in
cub/device/dispatch/dispatch_reduce_by_key.cuh
. - Issue #59:
cub::DeviceScan::ExclusiveSum
can't prefix sum of float into double. - Issue #58: Infinite loop in
cub::CachingDeviceAllocator::NearestPowerOf
. - Issue #47:
cub::CachingDeviceAllocator
needs to clean up CUDA global error state upon successful retry. - Issue #46: Very high amount of needed memory from the
cub::DeviceHistogram::HistogramEven
. - Issue #45:
cub::CachingDeviceAllocator
fails with debug output enabled
CUB 1.5.2
Summary
CUB 1.5.2 enhances cub::CachingDeviceAllocator
and improves scan performance for SM5x (Maxwell).
Enhancements
- Improved medium-size scan performance on SM5x (Maxwell).
- Refactored
cub::CachingDeviceAllocator
:- Now spends less time locked.
- Uses C++11's
std::mutex
when available. - Failure to allocate a block from the runtime will retry once after freeing cached allocations.
- Now respects max-bin, fixing an issue where blocks in excess of max-bin were still being retained in the free cache.
Bug fixes:
- Fix for generic-type reduce-by-key
cub::WarpScan
for SM3x and newer GPUs.
CUB 1.5.1
CUB 1.5.1
Summary
CUB 1.5.1 is a minor release.
Bug Fixes
- Fix for incorrect
cub::DeviceRadixSort
output for some small problems on SM52 (Mawell) GPUs. - Fix for macro redefinition warnings when compiling
thrust::sort
.
CUB 1.5.0
CUB 1.5.0
CUB 1.5.0 introduces segmented sort and reduction primitives.
New Features:
- Segmented device-wide operations for device-wide sort and reduction primitives.
Bug Fixes:
- #36:
cub::ThreadLoad
generates compiler errors when loading from pointer-to-const. - #29:
cub::DeviceRadixSort::SortKeys<bool>
yields compiler errors. - #26: Misaligned address after
cub::DeviceRadixSort::SortKeys
. - #25: Fix for incorrect results and crashes when radix sorting 0-length problems.
- Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and warp-reduction on non-primitive data types (e.g. user-defined structs).
- Fix small radix sorting problems where 0 temporary bytes were required and users code was invoking
malloc(0)
on some systems where that returnsNULL
. CUB assumed the user was asking for the size again and not running the sort.
CUB 1.4.1
Summary
CUB 1.4.1 is a minor release.
Enhancements
- Allow
cub::DeviceRadixSort
andcub::BlockRadixSort
on bool types.
Bug Fixes
- Fix minor CUDA 7.0 performance regressions in
cub::DeviceScan
andcub::DeviceReduceByKey
. - Remove requirement for callers to define the
CUB_CDP
macro when invoking CUB device-wide rountines using CUDA dynamic parallelism. - Fix headers not being included in the proper order (or missing includes) for some block-wide functions.
CUB 1.4.0
Summary
CUB 1.4.0 adds cub::DeviceSpmv
, cub::DeviceRunLength::NonTrivialRuns
, improves cub::DeviceHistogram
, and introduces support for SM5x (Maxwell) GPUs.
New Features:
cub::DeviceSpmv
methods for multiplying sparse matrices by dense vectors, load-balanced using a merge-based parallel decomposition.cub::DeviceRadixSort
sorting entry-points that always return the sorted output into the specified buffer, as opposed to thecub::DoubleBuffer
in which it could end up in either buffer.cub::DeviceRunLengthEncode::NonTrivialRuns
for finding the starting offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in a given sequence. Useful for top-down partitioning algorithms like MSD sorting of very-large keys.
Other Enhancements
- Support and performance tuning for SM5x (Maxwell) GPUs.
- Updated cub::DeviceHistogram implementation that provides the same "histogram-even" and "histogram-range" functionality as IPP/NPP. Provides extremely fast and, perhaps more importantly, very uniform performance response across diverse real-world datasets, including pathological (homogeneous) sample distributions.
CUB 1.3.2
Summary
CUB 1.3.2 is a minor release.
Bug Fixes
- Fix
cub::DeviceReduce
where reductions of small problems (small enough to only dispatch a single thread block) would run in the default stream (stream zero) regardless of whether an alternate stream was specified.
CUB 1.3.1
Summary
CUB 1.3.1 is a minor release.
Bug Fixes
- Workaround for a benign WAW race warning reported by cuda-memcheck in
cub::BlockScan
specialized forBLOCK_SCAN_WARP_SCANS
algorithm. - Fix bug in
cub::DeviceRadixSort
where the algorithm may sort more key bits than the caller specified (up to the nearest radix digit). - Fix for ~3%
cub::DeviceRadixSort
performance regression on SM2x (Fermi) and SM3x (Kepler) GPUs.
CUB 1.3.0
Summary
CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide primitives and adds an enhanced version of cub::WarpScan
.
Breaking Changes
- CUB's collective (block-wide, warp-wide) primitives underwent a minor interface refactoring:
- To provide the appropriate support for multidimensional thread blocks, The interfaces for collective classes are now template-parameterized by X, Y, and Z block dimensions (with
BLOCK_DIM_Y
andBLOCK_DIM_Z
being optional, andBLOCK_DIM_X
replacingBLOCK_THREADS
). Furthermore, the constructors that accept remapped linear thread-identifiers have been removed: all primitives now assume a row-major thread-ranking for multidimensional thread blocks. - To allow the host program (compiled by the host-pass) to accurately determine the device-specific storage requirements for a given collective (compiled for each device-pass), the interfaces for collective classes are now (optionally) template-parameterized by the desired PTX compute capability. This is useful when aliasing collective storage to shared memory that has been allocated dynamically by the host at the kernel call site.
- Most CUB programs having typical 1D usage should not require any changes to accomodate these updates.
- To provide the appropriate support for multidimensional thread blocks, The interfaces for collective classes are now template-parameterized by X, Y, and Z block dimensions (with
New Features
- Added "combination"
cub::WarpScan
methods for efficiently computing both inclusive and exclusive prefix scans (and sums).
Bug Fixes
- Fix for bug in
cub::WarpScan
(which affectedcub::BlockScan
andcub::DeviceScan
) where incorrect results (e.g., NAN) would often be returned when parameterized for floating-point types (fp32, fp64). - Workaround for ptxas error when compiling with with -G flag on Linux (for debug instrumentation).
- Fixes for certain scan scenarios using custom scan operators where code compiled for SM1x is run on newer GPUs of higher compute-capability: the compiler could not tell which memory space was being used collective operations and was mistakenly using global ops instead of shared ops.