-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teach the SIMD metadata group match to defer masking #4595
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LG, though some comments because I can't poke too much on due to lack of a neon machine.
if constexpr (ByteEncodingMask != 0) { | ||
// Apply an increment mask to the bits first. This is used with the byte | ||
// encoding when the mask isn't needed until we begin incrementing. | ||
static_assert(BitIndexT::ByteEncoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at this because it's the only use of BitIndexT::ByteEncoding
(the surrounding code doesn't access it outside the static_assert
). I believe that because this is inside if constexpr
removing it is not a compile failure on most platforms (i.e., I'm on x86, I can freely revert the static constexpr bool
addition without a compile error).
Had you considered shifting this to make it a compile error, like a class-level static_assert(ByteEncodingMask == 0 || BitIndexT::ByteEncoding);
or maybe something with requires
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done (with requires as that seems cleaner)!
template <typename FriendBitIndexT, | ||
FriendBitIndexT::BitsT FriendByteEncodingMask> | ||
friend class BitIndexRange; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this used? Is it something specific to arm? I was messing around and it causes issues for requires
due to the different template argument names; I tried deleting it and that worked fine, but maybe it's something with SIMDMatchPresent? Maybe something suitable for comments and/or something to cause a cross-platform compilation error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the operator==
above that is heterogenous that ends up requiring this.
I've just added the requires to both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see now on the build bots. I think this is a problem with Clang-16 sadly. I'll use a static_assert
variation I guess.
// Return whichever result we're using. This uses an invoked lambda to deduce | ||
// the type from only the selected return statement, allowing them to be | ||
// different types. | ||
return [&] { | ||
if constexpr (UseSIMD) { | ||
return simd_result; | ||
} else { | ||
return portable_result; | ||
} | ||
}(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given you do this twice, and it's kind of subtle, had you considered a helper function? i.e., something like:
// Return whichever result we're using. This uses an invoked lambda to deduce
// the type from only the selected return statement, allowing them to be
// different types.
template <bool If, typename ThenT, typename ElseT>
inline auto ConstexprTernary(ThenT then_val, ElseT else_val) -> auto {
return [&] {
if constexpr (If) {
return then_val;
} else {
return else_val;
}
}();
}
While thinking about this, I also was wondering whether there was a good template solution, which got me thinking about requires
. So here's that thought:
// Behaves as a ternary, but allowing different types on the return.
template <bool If, typename ThenT, typename ElseT> requires (If)
inline auto ConstexprTernary(ThenT then_val, ElseT /*else_val*/) -> ThenT {
return then_val;
}
template <bool If, typename ThenT, typename ElseT> requires (!If)
inline auto ConstexprTernary(ThenT /*then_val*/, ElseT else_val) -> ElseT {
return else_val;
}
Allowing (either way):
return ConstexprTernary<UseSIMD>(simd_result, portable_result);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done. Went with if constexpr
as I try to use that over overloads whenever I can as it seems conceptually simpler. But no need for the nested lambda once its in a template function.
When using a byte-encoding for matched group metadata we need to mask down to a single bit in each matching byte to make the iteration of a range of match indices work. In most cases, this mask can be folded into the overall match computation, but for Arm Neon, there is avoidable overhead from this. Instead, we can defer the mask until starting to iterate. Doing more than one iteration is relative rare so this doesn't accumulate much waste and makes common paths a bit faster. For the M1 this makes the SIMD match path about 2-4% faster. This isn't enough to catch the portable match code path on the M1 though. For some Neoverse cores the difference here is more significant (>10% improvement) and it makes the SIMD and scalar code paths have comparable latency. Still not clear which is better as the latency is comparable and beyond latency the factors are very hard to analyze -- port pressure on different parts of the CPU, etc. Leaving the selected code path as portable since that's so much better on the M1, and I'm hoping to avoid different code paths for different Arm CPUs for a while. Co-authored-by: Danila Kutenin <[email protected]>
1fb70a7
to
e20f161
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think all suggestion done so merging!
if constexpr (ByteEncodingMask != 0) { | ||
// Apply an increment mask to the bits first. This is used with the byte | ||
// encoding when the mask isn't needed until we begin incrementing. | ||
static_assert(BitIndexT::ByteEncoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done (with requires as that seems cleaner)!
template <typename FriendBitIndexT, | ||
FriendBitIndexT::BitsT FriendByteEncodingMask> | ||
friend class BitIndexRange; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the operator==
above that is heterogenous that ends up requiring this.
I've just added the requires to both.
// Return whichever result we're using. This uses an invoked lambda to deduce | ||
// the type from only the selected return statement, allowing them to be | ||
// different types. | ||
return [&] { | ||
if constexpr (UseSIMD) { | ||
return simd_result; | ||
} else { | ||
return portable_result; | ||
} | ||
}(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done. Went with if constexpr
as I try to use that over overloads whenever I can as it seems conceptually simpler. But no need for the nested lambda once its in a template function.
3ae968a
When using a byte-encoding for matched group metadata we need to mask down to a single bit in each matching byte to make the iteration of a range of match indices work. In most cases, this mask can be folded into the overall match computation, but for Arm Neon, there is avoidable overhead from this. Instead, we can defer the mask until starting to iterate. Doing more than one iteration is relative rare so this doesn't accumulate much waste and makes common paths a bit faster.
For the M1 this makes the SIMD match path about 2-4% faster. This isn't enough to catch the portable match code path on the M1 though.
For some Neoverse cores the difference here is more significant (>10% improvement) and it makes the SIMD and scalar code paths have comparable latency. Still not clear which is better as the latency is comparable and beyond latency the factors are very hard to analyze -- port pressure on different parts of the CPU, etc.
Leaving the selected code path as portable since that's so much better on the M1, and I'm hoping to avoid different code paths for different Arm CPUs for a while.