squarePacked GEMM. #586

madanm3 · 2021-12-15T11:48:07Z

squarePacked(sqp) GEMM framework is mainly targeted for real and complex GEMM operation for square and small matrix sizes (where, m=k=n <= 512 and m is multiple of 8).
First level implementation is done for double precision & double complex precision only.
Framework uses set of column major dgemm kernels. mx8: {8mx6n, 8mx5n, 8mx4n, 8mx3n, 8mx2n, 8mx1n}
In dgemm, by default A matrix is always packed and for AtxB operations A transpose is done while packing.
These real dgemm kernels are re-used in induced zgemm implementation with 3m algorithm.
3m implementation in sqp, packs all matrix (A, B & C) and its real and imaginary component.
New 3m method offers up-to 25% gain over other zgemm implementation in blis for the targeted sizes.
Though PR implementation is efficient for square and smaller matrix sizes, implementation allows tuning for parameters for other sizes and shapes.
Replacing multiplication with add and sub when alpha, beta = +/-1 is done in the implementation for both real and complex GEMM.
Basic multithreading implementation of threading work items along m dimension is done, which could be useful for large m and relatively smaller n and k sizes. This method is currently disabled since more generic implementation with further framework modification is in progress.
By default, k partition is not done since focus was for small matrix sizes. This does single load and store of C matrix.
But there is provision in the implementation to do k partition by changing kx parameter.
Complete implementation is lightweight and limited to 2 c files. One file for framework and another for kernels.
There is provision to add new kernel set, apart from mx8 kernel set, which is currently used. (new kernel sets - WIP)

1. In zgemm, mkernel outperforms nkernel for both m > n, and n > m. 2. Irrespective of mu and nu sizes, mkernel is forced for zgemm based on analysis done. Change-Id: Iafb7ddb2519c17cf2225da84d6cc74ed985cc21e AMD-Internal: [CPUPL-1352]

1. SquarePacked algorithm focuses on efficient zgemm/dgemm implementation for square matrix sizes (m=k=n) 2. Variation of 3m algorithm (3m_sqp) is implemented to allow single load and store of C matrix in kernel. 3. Currently the method supports only m multiple of 8. Residues cases to be implemented later. 4. dgemm Real kernel (dgemm_sqp) implementation without alpha, beta multiple is done, since real alpha and beta scaling are in 3m_sqp framework. 5. gemm_sqp supports dgemm when alpha = +/-1.0 and beta = 1.0. Change-Id: I49becaf6079da4be29be5b06057ff4e50770a7d8 AMD-Internal: [CPUPL-1352]

1. Added comments. AMD-Internal: [CPUPL-1429] Change-Id: Ie37e24e58cd8bf836038a2258ebd09c3912fab9e

1. bli_malloc modified to normal malloc and address alignment within 3m_sqp. 2. function added to pack A real,imag and sum. 3. function added to pack B real,imag and sum. 4. function added to pack C real,imag and beta handling. 4. sum and sub vectorized. AMD-Internal: [CPUPL-1352] Change-Id: I514e9efb053d529caef2de413d74d0dac2ceca54

1. mx1, mx4 kernel addition and framework modification. 2. 8mx6n kernel addition. 3. NULL check added in dgemm_sqp malloc. 4. mem tracing added. 5. Restricted 3m_sqp to limited matrix sizes. 6. Induced methods disabled temporarily for debug. AMD-Internal: [CPUPL-1352] Change-Id: I31671859b32bfbb359687fb7c9056f9eb904c8b2

1. Re-enabling 3m methods for zgemm. 2. Vectorization of pack_sum routines re-enabled with bug fix. 3. 8mx6n kernel added. AMD-Internal: [CPUPL-1352] Change-Id: Id9f010ba763afc52d268c2e68805f069919b8810

1. kx partitions added to k loop for dgemm and zgemm. 2. mx loop based threading model added for dgemm as prototype of zgemm. 3. nx loop added for 3m_sqp and dgemm_sqp. 4. single 3m_sqp workspace allocation with smaller memory footprint. 5. sqp framework done from dgemm and zgemm. 6. sqp kernels moved to seperate kernel file. 7. residue kernel core added to handle mx<8. 8. multi-instance tuning for 3m_sqp done. 9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp. AMD-Internal: [CPUPL-1521] Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c

1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added. AMD-Internal: [CPUPL-1521] Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f

1. Induced Method turned off, till the path fully tested for different alpha,beta conditions. 2. Fix for Beta =0, and C = NAN done. Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000

1. New err_t param in bli_malloc_user added. 2. AOCL_DTL log removed.

This reverts commit 231a464.

madanm3 added 13 commits December 13, 2021 15:26

sup zgemm improvement

231a464

1. In zgemm, mkernel outperforms nkernel for both m > n, and n > m. 2. Irrespective of mu and nu sizes, mkernel is forced for zgemm based on analysis done. Change-Id: Iafb7ddb2519c17cf2225da84d6cc74ed985cc21e AMD-Internal: [CPUPL-1352]

sqp commenting

5dc5ffa

1. Added comments. AMD-Internal: [CPUPL-1429] Change-Id: Ie37e24e58cd8bf836038a2258ebd09c3912fab9e

Enabling 3m_sqp and 3m1 methods

acfec6a

1. Re-enabling 3m methods for zgemm. 2. Vectorization of pack_sum routines re-enabled with bug fix. 3. 8mx6n kernel added. AMD-Internal: [CPUPL-1352] Change-Id: Id9f010ba763afc52d268c2e68805f069919b8810

3m_sqp conjugate support added

35ad5d8

1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added. AMD-Internal: [CPUPL-1521] Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f

Induced method turned off, fix for beta=0 & C = NAN

93e3d7a

1. Induced Method turned off, till the path fully tested for different alpha,beta conditions. 2. Fix for Beta =0, and C = NAN done. Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000

compile error fixes

7cd7968

1. New err_t param in bli_malloc_user added. 2. AOCL_DTL log removed.

Revert "sup zgemm improvement"

59029ee

This reverts commit 231a464.

code clean and comments added

b3e82ba

bug fix and print added when bli_gemm_sqp fails

0f984c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

squarePacked GEMM. #586

squarePacked GEMM. #586

madanm3 commented Dec 15, 2021 •

edited

Loading

squarePacked GEMM. #586

Are you sure you want to change the base?

squarePacked GEMM. #586

Conversation

madanm3 commented Dec 15, 2021 • edited Loading

madanm3 commented Dec 15, 2021 •

edited

Loading