Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

squarePacked GEMM. #586

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

squarePacked GEMM. #586

wants to merge 13 commits into from

Conversation

madanm3
Copy link
Contributor

@madanm3 madanm3 commented Dec 15, 2021

  • squarePacked(sqp) GEMM framework is mainly targeted for real and complex GEMM operation for square and small matrix sizes (where, m=k=n <= 512 and m is multiple of 8).
  • First level implementation is done for double precision & double complex precision only.
  • Framework uses set of column major dgemm kernels. mx8: {8mx6n, 8mx5n, 8mx4n, 8mx3n, 8mx2n, 8mx1n}
  • In dgemm, by default A matrix is always packed and for AtxB operations A transpose is done while packing.
  • These real dgemm kernels are re-used in induced zgemm implementation with 3m algorithm.
  • 3m implementation in sqp, packs all matrix (A, B & C) and its real and imaginary component.
  • New 3m method offers up-to 25% gain over other zgemm implementation in blis for the targeted sizes.
  • Though PR implementation is efficient for square and smaller matrix sizes, implementation allows tuning for parameters for other sizes and shapes.
  • Replacing multiplication with add and sub when alpha, beta = +/-1 is done in the implementation for both real and complex GEMM.
  • Basic multithreading implementation of threading work items along m dimension is done, which could be useful for large m and relatively smaller n and k sizes. This method is currently disabled since more generic implementation with further framework modification is in progress.
  • By default, k partition is not done since focus was for small matrix sizes. This does single load and store of C matrix.
  • But there is provision in the implementation to do k partition by changing kx parameter.
  • Complete implementation is lightweight and limited to 2 c files. One file for framework and another for kernels.
  • There is provision to add new kernel set, apart from mx8 kernel set, which is currently used. (new kernel sets - WIP)

1. In zgemm, mkernel outperforms nkernel for both m > n, and n > m.
2. Irrespective of mu and nu sizes, mkernel is forced for zgemm based on analysis done.

Change-Id: Iafb7ddb2519c17cf2225da84d6cc74ed985cc21e
AMD-Internal: [CPUPL-1352]
1. SquarePacked algorithm focuses on efficient zgemm/dgemm implementation for square matrix sizes (m=k=n)
2. Variation of 3m algorithm (3m_sqp) is implemented to allow single load and store of C matrix in kernel.
3. Currently the method supports only m multiple of 8. Residues cases to be implemented later.
4. dgemm Real kernel (dgemm_sqp) implementation without alpha, beta multiple is done,
    since real alpha and beta scaling are in 3m_sqp framework.
5. gemm_sqp supports dgemm when alpha = +/-1.0 and beta = 1.0.

Change-Id: I49becaf6079da4be29be5b06057ff4e50770a7d8
AMD-Internal: [CPUPL-1352]
1. Added comments.

AMD-Internal: [CPUPL-1429]
Change-Id: Ie37e24e58cd8bf836038a2258ebd09c3912fab9e
1. bli_malloc modified to normal malloc and address alignment within 3m_sqp.
2. function added to pack A real,imag and sum.
3. function added to pack B real,imag and sum.
4. function added to pack C real,imag and beta handling.
4. sum and sub vectorized.

AMD-Internal: [CPUPL-1352]
Change-Id: I514e9efb053d529caef2de413d74d0dac2ceca54
1. mx1, mx4 kernel addition and framework modification.
2. 8mx6n kernel addition.
3. NULL check added in dgemm_sqp malloc.
4. mem tracing added.
5. Restricted 3m_sqp to limited matrix sizes.
6. Induced methods disabled temporarily for debug.

AMD-Internal: [CPUPL-1352]
Change-Id: I31671859b32bfbb359687fb7c9056f9eb904c8b2
1. Re-enabling 3m methods for zgemm.
2. Vectorization of pack_sum routines re-enabled with bug fix.
3. 8mx6n kernel added.

AMD-Internal: [CPUPL-1352]
Change-Id: Id9f010ba763afc52d268c2e68805f069919b8810
1. kx partitions added to k loop for dgemm and zgemm.
2. mx loop based threading model added for dgemm as prototype of zgemm.
3. nx loop added for 3m_sqp and dgemm_sqp.
4. single 3m_sqp workspace allocation with smaller memory footprint.
5. sqp framework done from dgemm and zgemm.
6. sqp kernels moved to seperate kernel file.
7. residue kernel core added to handle mx<8.
8. multi-instance tuning for 3m_sqp done.
9. user can set env "BLIS_MULTI_INSTANCE" to 1 for better multi-instance behavior of 3m_sqp.

AMD-Internal: [CPUPL-1521]
Change-Id: Ibef50a8a37fe99f164edb4621acb44fc0c86514c
1. 3m_sqp support for A matrix with conjugate_no_transpose and conjugate_transpose added.

AMD-Internal: [CPUPL-1521]
Change-Id: Ie6e5c49cf86f7d3b95d78705cf445e57f20b3d1f
1. Induced Method turned off, till the path fully tested for different alpha,beta conditions.
2. Fix for Beta =0, and C = NAN done.

Change-Id: I5a7bd1393ac245c2ebb72f9a634728af4c0d4000
1. New err_t param in bli_malloc_user added.
2. AOCL_DTL log removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant