Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teach the SIMD metadata group match to defer masking #4595

Merged
merged 4 commits into from
Dec 20, 2024

Conversation

chandlerc
Copy link
Contributor

When using a byte-encoding for matched group metadata we need to mask down to a single bit in each matching byte to make the iteration of a range of match indices work. In most cases, this mask can be folded into the overall match computation, but for Arm Neon, there is avoidable overhead from this. Instead, we can defer the mask until starting to iterate. Doing more than one iteration is relative rare so this doesn't accumulate much waste and makes common paths a bit faster.

For the M1 this makes the SIMD match path about 2-4% faster. This isn't enough to catch the portable match code path on the M1 though.

For some Neoverse cores the difference here is more significant (>10% improvement) and it makes the SIMD and scalar code paths have comparable latency. Still not clear which is better as the latency is comparable and beyond latency the factors are very hard to analyze -- port pressure on different parts of the CPU, etc.

Leaving the selected code path as portable since that's so much better on the M1, and I'm hoping to avoid different code paths for different Arm CPUs for a while.

Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LG, though some comments because I can't poke too much on due to lack of a neon machine.

if constexpr (ByteEncodingMask != 0) {
// Apply an increment mask to the bits first. This is used with the byte
// encoding when the mask isn't needed until we begin incrementing.
static_assert(BitIndexT::ByteEncoding);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at this because it's the only use of BitIndexT::ByteEncoding (the surrounding code doesn't access it outside the static_assert). I believe that because this is inside if constexpr removing it is not a compile failure on most platforms (i.e., I'm on x86, I can freely revert the static constexpr bool addition without a compile error).

Had you considered shifting this to make it a compile error, like a class-level static_assert(ByteEncodingMask == 0 || BitIndexT::ByteEncoding); or maybe something with requires?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done (with requires as that seems cleaner)!

Comment on lines +275 to +278
template <typename FriendBitIndexT,
FriendBitIndexT::BitsT FriendByteEncodingMask>
friend class BitIndexRange;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this used? Is it something specific to arm? I was messing around and it causes issues for requires due to the different template argument names; I tried deleting it and that worked fine, but maybe it's something with SIMDMatchPresent? Maybe something suitable for comments and/or something to cause a cross-platform compilation error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the operator== above that is heterogenous that ends up requiring this.

I've just added the requires to both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now on the build bots. I think this is a problem with Clang-16 sadly. I'll use a static_assert variation I guess.

Comment on lines 628 to 646
// Return whichever result we're using. This uses an invoked lambda to deduce
// the type from only the selected return statement, allowing them to be
// different types.
return [&] {
if constexpr (UseSIMD) {
return simd_result;
} else {
return portable_result;
}
}();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you do this twice, and it's kind of subtle, had you considered a helper function? i.e., something like:

// Return whichever result we're using. This uses an invoked lambda to deduce
// the type from only the selected return statement, allowing them to be
// different types.
template <bool If, typename ThenT, typename ElseT>
inline auto ConstexprTernary(ThenT then_val, ElseT else_val) -> auto {
  return [&] {
    if constexpr (If) {
      return then_val;
    } else {
      return else_val;
    }
  }();
}

While thinking about this, I also was wondering whether there was a good template solution, which got me thinking about requires. So here's that thought:

// Behaves as a ternary, but allowing different types on the return.
template <bool If, typename ThenT, typename ElseT> requires (If)
inline auto ConstexprTernary(ThenT then_val, ElseT /*else_val*/) -> ThenT {
  return then_val;
}
template <bool If, typename ThenT, typename ElseT> requires (!If)
inline auto ConstexprTernary(ThenT /*then_val*/, ElseT else_val) -> ElseT {
  return else_val;
}

Allowing (either way):

return ConstexprTernary<UseSIMD>(simd_result, portable_result);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done. Went with if constexpr as I try to use that over overloads whenever I can as it seems conceptually simpler. But no need for the nested lambda once its in a template function.

chandlerc and others added 2 commits December 20, 2024 01:52
When using a byte-encoding for matched group metadata we need to mask
down to a single bit in each matching byte to make the iteration of
a range of match indices work. In most cases, this mask can be folded
into the overall match computation, but for Arm Neon, there is avoidable
overhead from this. Instead, we can defer the mask until starting to
iterate. Doing more than one iteration is relative rare so this doesn't
accumulate much waste and makes common paths a bit faster.

For the M1 this makes the SIMD match path about 2-4% faster. This isn't
enough to catch the portable match code path on the M1 though.

For some Neoverse cores the difference here is more significant (>10%
improvement) and it makes the SIMD and scalar code paths have comparable
latency. Still not clear which is better as the latency is comparable
and beyond latency the factors are very hard to analyze -- port pressure
on different parts of the CPU, etc.

Leaving the selected code path as portable since that's so much better
on the M1, and I'm hoping to avoid different code paths for different
Arm CPUs for a while.

Co-authored-by: Danila Kutenin <[email protected]>
Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think all suggestion done so merging!

if constexpr (ByteEncodingMask != 0) {
// Apply an increment mask to the bits first. This is used with the byte
// encoding when the mask isn't needed until we begin incrementing.
static_assert(BitIndexT::ByteEncoding);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done (with requires as that seems cleaner)!

Comment on lines +275 to +278
template <typename FriendBitIndexT,
FriendBitIndexT::BitsT FriendByteEncodingMask>
friend class BitIndexRange;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the operator== above that is heterogenous that ends up requiring this.

I've just added the requires to both.

Comment on lines 628 to 646
// Return whichever result we're using. This uses an invoked lambda to deduce
// the type from only the selected return statement, allowing them to be
// different types.
return [&] {
if constexpr (UseSIMD) {
return simd_result;
} else {
return portable_result;
}
}();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done. Went with if constexpr as I try to use that over overloads whenever I can as it seems conceptually simpler. But no need for the nested lambda once its in a template function.

@chandlerc chandlerc added this pull request to the merge queue Dec 20, 2024
Merged via the queue into carbon-language:trunk with commit 3ae968a Dec 20, 2024
7 of 8 checks passed
@chandlerc chandlerc deleted the simd-tweak branch December 20, 2024 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants