Skip to content

Releases: ROCm/rocBLAS

rocBLAS 4.3.0 for ROCm 6.3.0

03 Dec 19:49
8ebd6c1
Compare
Choose a tag to compare

Added

  • Level 3 and EX functions have an additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments

Changed

  • amdclang is used as the default compiler instead of hipcc
  • Internal performance scripts use amd-smi instead of the deprecated rocm-smi

Optimized

  • Improved performance of Level 2 gbmv
  • Improved performance of Level 2 gemv for float and double precisions for problem sizes (TransA == N && m==n && m % 128 == 0) measured on a gfx942 GPU

Resolved issues

  • Fixed stbsv_strided_batched_64 Fortran binding

Upcoming changes

  • rocblas_Xgemm_kernel_name APIs are deprecated

rocBLAS 4.2.4 for ROCm 6.2.4

06 Nov 19:55
3171316
Compare
Choose a tag to compare

Additions

  • GFX1151 Support

rocBLAS 4.2.1 for ROCm 6.2.2

27 Sep 16:01
Compare
Choose a tag to compare

rocBLAS code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.

rocBLAS 4.2.1 for ROCm 6.2.1

20 Sep 19:58
Compare
Choose a tag to compare

Removals

  • Remove Device_Memory_Allocation.pdf link in documentation

Fixes

  • Fixed error/warn message during rocblas_set_stream() call

rocBLAS 4.2.0 for ROCm 6.2.0

02 Aug 16:15
54f305c
Compare
Choose a tag to compare

Additions

  • Level 2 functions and level 3 trsm have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments
  • Cache flush timing for gemm_batched_ex, gemm_strided_batched_ex, axpy
  • Benchmark class for common timing code
  • An environment variable "ROCBLAS_DEFAULT_ATOMICS_MODE" to set default atomics mode during creation of 'rocblas_handle'
  • Extended dot_ex to support single-precision (fp32_r) input and double-precision (fp64_r) output and compute types

Optimizations

  • Improved performance of Level 1 dot_batched and dot_strided_batched for all precisions. Performance enhanced by 6 times for bigger problem sizes measured on MI210 GPU

Changes

  • Linux AOCL dependency updated to release 4.2 gcc build
  • Windows vcpkg dependencies updated to release 2024.02.14
  • Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40

Deprecations

  • rocblas_gemm_ex3, gemm_batched_ex3 and gemm_strided_batched_ex3 are deprecated and will be removed in the next major release of rocBLAS. Please refer to hipBLASLt for future 8 bit float usage https://github.com/ROCm/hipBLASLt

rocBLAS 4.1.2 for ROCm 6.1.2

04 Jun 16:53
8443539
Compare
Choose a tag to compare

Fixes

  • Fixes BF16 TT get_solutions

Optimizations

  • Tune gfx942 BBS TN, TT

rocBLAS 4.1.0 for ROCm 6.1.1

08 May 18:00
5b85f2d
Compare
Choose a tag to compare

rocBLAS code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.

rocBLAS 4.1.0 for ROCm 6.1.0

16 Apr 19:10
cefa4a9
Compare
Choose a tag to compare

Additions

  • Level 1 and Level 1 Extension functions have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments.
  • Cache flush timing for gemm_ex.

Changes

  • Some Level 2 function argument names have changed 'm' to 'n' to match legacy BLAS, there was no change in implementation.
  • Standardized the use of non-blocking streams for copying results from device to host.

Fixes

  • Fixed host-pointer mode reductions for non-blocking streams.

rocBLAS 4.0.0 for ROCm 6.0.2

31 Jan 20:12
88df972
Compare
Choose a tag to compare

rocBLAS code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.

rocBLAS 4.0.0 for ROCm 6.0.0

15 Dec 18:30
88df972
Compare
Choose a tag to compare

Added

  • Addition of beta API rocblas_gemm_batched_ex3 and rocblas_gemm_strided_batched_ex3
  • Added input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched
  • Added rocblas_status_excluded_from_build to be used when calling functions which require Tensile when using rocBLAS built without Tensile
  • Added system for async kernel launches setting a failure rocblas_status based on hipPeekAtLastError discrepancy

Optimized

  • Trsm performance for small sizes m < 32 && n < 32

Deprecated

  • In a future release atomic operations will be disabled by default so results will be repeatable. Atomic operations can always be enabled or disabled using the function rocblas_set_atomics_mode. Enabling atomic operations can improve performance.

Removed

  • rocblas_gemm_ext2 API function is removed
  • in-place trmm API from Legacy BLAS is removed. It is replaced by an API that supports both in-place and out-of-place trmm
  • int8x4 support is removed. int8 support is unchanged
  • The #define STDC_WANT_IEC_60559_TYPES_EXT has been removed from rocblas-types.h. Users who want ISO/IEC TS 18661-3:2015 functionality must define STDC_WANT_IEC_60559_TYPES_EXT before including float.h, math.h, and rocblas.h
  • The default build removes device code for gfx803 architecture from the fat binary

Fixed

  • Make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimension or increment potentially causing overflow:
    • Level2: gbmv, gemv, hbmv, sbmv, spmv, tbmv, tpmv, tbsv, tpsv
  • Lazy loading to support heterogeneous architecture setup and load appropriate tensile library files based on the device's architecture
  • Guard against no-op kernel launches resulting in potential hipGetLastError

Changed

  • Default verbosity of rocblas-test reduced. To see all tests set environment variable GTEST_LISTENER=PASS_LINE_IN_LOG