Move +dotprod AArch64 intrinsics behind compiler options#2862
Move +dotprod AArch64 intrinsics behind compiler options#2862MDevereau wants to merge 1 commit intogoogle:masterfrom
Conversation
Move +dotprod AArch64 intrinsics behind compiler options. Also remove their reliance on NEON_BF16 to be enabled on other NEON targets.
|
@MDevereau and @jan-wassenberg - Should DotProd and I8MM be enabled by default for the NEON_BF16 target on Apple OS's as all Apple Silicon AArch64 CPU's that support the AArch64 BF16 extension (Apple A15 and later and Apple M2 and later) are known to have support for both DotProd and I8MM according to https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64Processors.td? All Apple Silicon Macs (but not the iPhone X/XR/XS or earlier or 8th generation iPad or earlier) have support for the AArch64 DotProd extension, including Apple Silicon Macs with the Apple M1 CPU, according to https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64Processors.td. Should HWY_NEON be the default baseline target for Apple OS's as all Apple Silicon CPU's support the AArch64 AES extension (even the Apple A7 CPU found on the first Apple devices with AArch64 CPU's such as the iPhone 5S, iPad Mini 2, and original iPad Air) according to https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64Processors.td (but there are some non-Apple Silicon AArch64 CPU's out there that do not support the AArch64 AES extension)? |
| // ------------------------------ SumOfMulQuadAccumulate | ||
|
|
||
| #if HWY_TARGET == HWY_NEON_BF16 | ||
| #ifdef __ARM_FEATURE_DOTPROD |
There was a problem hiding this comment.
Unfortunately, we are rarely able to add flags to builds.
Hence we'd like to detect what's available at runtime and then dispatch to various targets.
We can certainly discuss what features map to which targets, and there's perhaps also space for another target.
What's the main driver for you here: do we want to add dot to NEON, or remove it from NEON_BF16?
Depending on the answer: would it work if we defined(feature) || TARGET==NEON_BF16?
There was a problem hiding this comment.
We want to remove its dependance on NEON_BF16.
We have seen issues with certain HWCAPS being reported as unavailable in certain circumstances, which can result in the entire target of NEON_BF16 being denied because of the absence of i8mm. There are also cores such as the Apple A14 which (according to LLVM) have dotprod and aes, but not bf16 and i8mm which contributes to the problem of some cores being gated out of their full instruction set.
defined(feature) || TARGET==NEON_BF16 sounds like it would work.
Rather than create another target based flag combinations, would it make sense to have a target for specific cores? That way, a baseline NEON target could have all features optional for customisation, but a neoverse-v2 core target could still use runtime detection to enable its full instruction set.
There was a problem hiding this comment.
Got it. I believe John has already fixed the issue of the A14.
It sounds like defined(feature) || TARGET==NEON_BF16 will help but not be a full solution for what you want (allow bf16 without i8mm). We'd at least have to add another target.
We're open to suggestions on what combination of features to group into a target. On x86 in practice we also have avx2=Haswell, avx3=Skylake, avx3_DL=Icelake, and an explicit SPR and Zen4+ target.
The Arm bitfield has 3 free spots below HWY_SVE2_128, and arguably 2 more which we could reclaim by rearranging/compacting gaps around HWY_NEON_BF16. The last time we did such a reordering, I do not recall any reported issues.
Just bear in mind that a lower bit value means better, and the lowest one reported as supported will be used. This leads to an interesting question: if we have i8mm\BF16, and BF16\i8mm, which is better?
Thanks for checking that. Makes sense to me. I'm happy to enable as much as we can without requiring build flags, which as mentioned are problematic in our setting.
Also sounds good to me.
This as well :) |
|
A neon target with i8mm would be very helpful for scann, https://github.com/google-research/google-research/tree/master/scann. Either NEON_BF16 or a new target. i8mm is a big win there, but we don't have any easy way to set build flags (-march +i8mm). |
|
ACK, thanks. @MDevereau any thoughts on which targets we should add? |
|
II don't have any particular targets in mind, at least not for now. This PR was mainly about making +dotprod intrinsics available from both NEON and NEON_BF16 individually for testing different combinations, and to streamline it to be similar to how SVE incorporates its bf16 instructions. If this isn't worth the trade-off for additional flags and performance then I'll close it with no further asks. |
|
Got it, thanks. We're happy to enable this change/your PR in modified form, if you change the if(feature) to |
Move +dotprod AArch64 intrinsics behind compiler options. Also remove their reliance on NEON_BF16 to be enabled on other NEON targets.