Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Armv8-A Row-major Kernel Improvements #698

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

xrq-phys
Copy link
Collaborator

@xrq-phys xrq-phys commented Dec 16, 2022

Status
This is a 8x6 row-major kernel for ARMv8-A so its internal structure is basically the same as the current 6x8 column-preferring one.

Updates

  • Instead of clearing C-microtile registers at beginning of the assembly, execute the first k-loop using fmul instead of fmla. Codepath within assembly is handled to (basically) not introduce additional branching cost.
  • Scatter prefetching code for C into microkernel loops.

Restrictions
This kernel assumes hardware prefetching for packed A/B blocks (so as not to bother the pipeline with additional instructions or the DMA with additional loads).
Older chips like ThunderX2 may not perform well with it since they may have no hardware prefetching at all, while newer ones like Amazon's C6g tend to be happier with it.

This update also contains somehow prerequisite changes for my gemmsup+packm work here which I'd also like to merge later as a BLIS sandbox.

- Only DGEMM at this moment.
- Prefetch whole lines.
- Scatter prefetching insts.
Instead of clearing C rows, Deploy first-k FMUL
 so that instructions are saved.
Instead of loading from stack, directly pass regs in.
Arm64 has 30 regs for use. This may or may not speed up a tiny bit.
Forget to commit header for ad73717.
- Init k-loop clears C.
- Scattered C preloading.
@fgvanzee
Copy link
Member

Thanks @xrq-phys! I've asked Jeff to take a look at the new kernel for feedback. I think he and his application could stand to benefit from this, given the inherent advantage row-preferring kernel have with left-sided trsm (which is the only trsm code path that BLIS implements).

Happy holidays! 🎄 🎁 🍾

@GodTamIt
Copy link

GodTamIt commented Oct 3, 2023

Hi there, I know this is a bit old but came across this change from this paper.

I was just wondering what the status was for having this (and other changes) merged upstream and/or if there was a plan to do so?

@fgvanzee
Copy link
Member

fgvanzee commented Oct 3, 2023

Hey @GodTamIt, thanks for your inquiry. I guess we're still waiting on @jdiamondGitHub to look over this PR. I'll reach out to him separately as well.

fgvanzee added a commit that referenced this pull request Oct 6, 2023
Details:
- Integrated changes from PR #698 to enable testing in the context of
  the 'stable' branch. These changes add row-preferential sgemm and
  dgemm microkernels for the armv8a kernel set.
- Updated the 'altra' subconfig to easily switch between the previous
  (column-preferential) ukernel and the aforementioned row-pref ukernel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants