|
6 | 6 |
|
7 | 7 | ## Current
|
8 | 8 |
|
| 9 | +## 1.4.0 (TBD) |
| 10 | + |
| 11 | +## New Features and Enhancements |
| 12 | + |
| 13 | +### Core |
| 14 | +- Implemented asymmetric memory support {PR #1000} |
| 15 | +- Enhanced error handling and resource cleanup {PR #960, #951} |
| 16 | +- Improved service team handling {PR #1046} |
| 17 | +- Fixed triggered post for zero size collectives {PR #960} |
| 18 | + |
| 19 | +### CL/HIER |
| 20 | +- Added allgatherv support {PR #1111} |
| 21 | +- Implemented node subgroup unpacking {PR #1103} |
| 22 | +- Added reduce to supported collectives {PR #997} |
| 23 | +- Fixed integer overflow in alltoall {PR #944} |
| 24 | + |
| 25 | +### TL/UCP |
| 26 | +- Split single and multithreaded send/receive operations {PR #1109} |
| 27 | +- Added knomial allgather with CUDA memory support {PR #1095} |
| 28 | +- Implemented reduce SRG knomial algorithm {PR #1058} |
| 29 | +- Added radix selection to knomial operations {PR #1072} |
| 30 | +- Added sliding window allreduce implementation {PR #958} |
| 31 | +- Added knomial allgatherv support {PR #1008} |
| 32 | +- Added sparbit algorithm for allgather {PR #940} |
| 33 | +- Extended broadcast active set support for size > 2 {PR #926} |
| 34 | +- Added knomial algorithm for reduce-scatter {PR #970} |
| 35 | + |
| 36 | +### TL/MLX5 |
| 37 | +- Added multicast-based zero-copy broadcast {PR #1087} |
| 38 | +- Implemented mcast multi-group support {PR #1060} |
| 39 | +- Added non-blocking CUDA memory copy support {PR #1040} |
| 40 | +- Added device memory multicast broadcast {PR #989} |
| 41 | +- Enhanced mcast allgather staging-based algorithm {PR #994} |
| 42 | +- Improved one-sided mcast reliability initialization {PR #980} |
| 43 | +- Various performance optimizations in alltoall {PR #1067} |
| 44 | +- Fixed fences in all-to-all WQEs {PR #1069} |
| 45 | +- Added context option to disable all-to-all operations {PR #1062} |
| 46 | +- Improved error handling and device checks {PR #1102} |
| 47 | +- Disabled mcast for thread multiple mode {PR #961} |
| 48 | + |
| 49 | +### TL/SHARP |
| 50 | +- Added support for allgather operation {PR #1081} |
| 51 | +- Enabled reduce-scatter with SAT support {PR #1084} |
| 52 | +- Added SHARP multi-channel support {PR #1049} |
| 53 | +- Fixed service team OOB handling {PR #1001} |
| 54 | +- Improved internal OOB usage {PR #986} |
| 55 | + |
| 56 | +### CUDA |
| 57 | +- Added linear broadcast implementation {PR #948} |
| 58 | +- Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093} |
| 59 | +- Enhanced error handling for CUDA context operations {PR #1025} |
| 60 | +- Fixed context cleanup in CUDA operations {PR #954} |
| 61 | + |
| 62 | +### Build and Test |
| 63 | +- Added support for specific GPU architectures with ROCM {PR #987} |
| 64 | +- Added UCC pkg-config support {PR #1036} |
| 65 | +- Fixed build compatibility with NVC compiler {PR #1052} |
| 66 | +- Enhanced config parser functionality {PR #1092} |
| 67 | +- Enhanced ASAN/LSAN memory leak detection {PR #1074} |
| 68 | +- Added error checking and exit handling in gtests {PR #1083} |
| 69 | + |
| 70 | +### Documentation |
| 71 | +- Updated README with UCC publication information {PR #1028} |
| 72 | +- Added DOCA_UROM documentation {PR #999} |
| 73 | +- Fixed Doxygen documentation issues {PR #1038} |
| 74 | +- Enhanced code style consistency {PR #1020} |
| 75 | + |
| 76 | +### CL/DOCA_UROM |
| 77 | +- Implemented new DOCA UROM plugin {PR #978} |
| 78 | +- Added support for offloading collective operations to DPUs |
| 79 | +- Implemented allreduce collective |
| 80 | + |
| 81 | +## 1.3.0 (April 18th, 2024) |
| 82 | + |
9 | 83 | ## New Features and Enhancements
|
10 | 84 |
|
11 | 85 | ### CL/HIER
|
|
207 | 281 | - Added support for multithreaded context progress
|
208 | 282 | - Added support for nonblocking team destroy
|
209 | 283 |
|
210 |
| -#### CL |
| 284 | +#### CL |
211 | 285 |
|
212 | 286 | - Added support for hierarchical collectives
|
213 | 287 | - Added support for hierarchical allreduce collective operation
|
|
219 | 293 |
|
220 | 294 | ##### UCP
|
221 | 295 |
|
222 |
| -- Added Bcast SAG algorithm for large messages |
223 |
| -- Added Knomial based reduce algorithm |
| 296 | +- Added Bcast SAG algorithm for large messages |
| 297 | +- Added Knomial based reduce algorithm |
224 | 298 | - Making allgather and alltoall agree with the API
|
225 | 299 | - Added SRA knomial allreduce algorithm
|
226 | 300 | - Added pairwise alltoall and alltoallv algorithms
|
227 |
| -- Added allgather and allgatherv ring algorithms |
| 301 | +- Added allgather and allgatherv ring algorithms |
228 | 302 | - Added support for collective operations based on one-sided semantics
|
229 | 303 | - Added support for alltoall with one-sided transfer semantics
|
230 | 304 | - Bug fixes
|
|
237 | 311 | scatter, bcast, allgather and allgatherv
|
238 | 312 |
|
239 | 313 | #### Tests
|
240 |
| -- Updated tests to test the newly added algorithms and operations |
| 314 | +- Updated tests to test the newly added algorithms and operations |
241 | 315 |
|
242 | 316 |
|
243 | 317 | ## 0.1.0 (TBD)
|
|
256 | 330 | - Added support for configuring UCC library and contexts
|
257 | 331 |
|
258 | 332 |
|
259 |
| -#### CL |
| 333 | +#### CL |
260 | 334 |
|
261 | 335 | - Added support for collectives, while the source and destination is either in
|
262 |
| - CPU or device (GPU) |
| 336 | + CPU or device (GPU) |
263 | 337 | - Added support for UCC_THREAD_MULTIPLE
|
264 |
| -- Added support for CUDA stream-based collectives |
| 338 | +- Added support for CUDA stream-based collectives |
265 | 339 |
|
266 | 340 |
|
267 | 341 | #### TL
|
|
0 commit comments