Releases
v1.4.4
New Features and Enhancements
Core
Implemented asymmetric memory support {PR #1000 }
Enhanced error handling and resource cleanup {PR #960 , #951 }
Improved service team handling {PR #1046 }
Fixed triggered post for zero size collectives {PR #960 }
CL/HIER
Added allgatherv support {PR #1111 }
Implemented node subgroup unpacking {PR #1103 }
Added reduce to supported collectives {PR #997 }
Fixed integer overflow in alltoall {PR #944 }
TL/UCP
Split single and multithreaded send/receive operations {PR #1109 }
Added knomial allgather with CUDA memory support {PR #1095 }
Implemented reduce SRG knomial algorithm {PR #1058 }
Added radix selection to knomial operations {PR #1072 }
Added sliding window allreduce implementation {PR #958 }
Added knomial allgatherv support {PR #1008 }
Added sparbit algorithm for allgather {PR #940 }
Extended broadcast active set support for size > 2 {PR #926 }
Added knomial algorithm for reduce-scatter {PR #970 }
TL/MLX5
Added multicast-based zero-copy broadcast {PR #1087 }
Implemented mcast multi-group support {PR #1060 }
Added non-blocking CUDA memory copy support {PR #1040 }
Added device memory multicast broadcast {PR #989 }
Enhanced mcast allgather staging-based algorithm {PR #994 }
Improved one-sided mcast reliability initialization {PR #980 }
Various performance optimizations in alltoall {PR #1067 }
Fixed fences in all-to-all WQEs {PR #1069 }
Added context option to disable all-to-all operations {PR #1062 }
Improved error handling and device checks {PR #1102 }
Disabled mcast for thread multiple mode {PR #961 }
TL/SHARP
Added support for allgather operation {PR #1081 }
Enabled reduce-scatter with SAT support {PR #1084 }
Added SHARP multi-channel support {PR #1049 }
Fixed service team OOB handling {PR #1001 }
Improved internal OOB usage {PR #986 }
CUDA
Added linear broadcast implementation {PR #948 }
Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093 }
Enhanced error handling for CUDA context operations {PR #1025 }
Fixed context cleanup in CUDA operations {PR #954 }
Build and Test
Added support for specific GPU architectures with ROCM {PR #987 }
Added UCC pkg-config support {PR #1036 }
Fixed build compatibility with NVC compiler {PR #1052 }
Enhanced config parser functionality {PR #1092 }
Enhanced ASAN/LSAN memory leak detection {PR #1074 }
Added error checking and exit handling in gtests {PR #1083 }
Documentation
Updated README with UCC publication information {PR #1028 }
Added DOCA_UROM documentation {PR #999 }
Fixed Doxygen documentation issues {PR #1038 }
Enhanced code style consistency {PR #1020 }
CL/DOCA_UROM
Implemented new DOCA UROM plugin {PR #978 }
Added support for offloading collective operations to DPUs
Implemented allreduce collective
You can’t perform that action at this time.