Teach the SIMD metadata group match to defer masking #4595

chandlerc · 2024-11-26T20:07:38Z

When using a byte-encoding for matched group metadata we need to mask down to a single bit in each matching byte to make the iteration of a range of match indices work. In most cases, this mask can be folded into the overall match computation, but for Arm Neon, there is avoidable overhead from this. Instead, we can defer the mask until starting to iterate. Doing more than one iteration is relative rare so this doesn't accumulate much waste and makes common paths a bit faster.

For the M1 this makes the SIMD match path about 2-4% faster. This isn't enough to catch the portable match code path on the M1 though.

For some Neoverse cores the difference here is more significant (>10% improvement) and it makes the SIMD and scalar code paths have comparable latency. Still not clear which is better as the latency is comparable and beyond latency the factors are very hard to analyze -- port pressure on different parts of the CPU, etc.

Leaving the selected code path as portable since that's so much better on the M1, and I'm hoping to avoid different code paths for different Arm CPUs for a while.

jonmeow

Generally LG, though some comments because I can't poke too much on due to lack of a neon machine.

jonmeow · 2024-12-02T17:26:22Z

common/raw_hashtable_metadata_group.h

+      if constexpr (ByteEncodingMask != 0) {
+        // Apply an increment mask to the bits first. This is used with the byte
+        // encoding when the mask isn't needed until we begin incrementing.
+        static_assert(BitIndexT::ByteEncoding);


I'm looking at this because it's the only use of BitIndexT::ByteEncoding (the surrounding code doesn't access it outside the static_assert). I believe that because this is inside if constexpr removing it is not a compile failure on most platforms (i.e., I'm on x86, I can freely revert the static constexpr bool addition without a compile error).

Had you considered shifting this to make it a compile error, like a class-level static_assert(ByteEncodingMask == 0 || BitIndexT::ByteEncoding); or maybe something with requires?

Good idea, done (with requires as that seems cleaner)!

jonmeow · 2024-12-02T17:44:40Z

common/raw_hashtable_metadata_group.h

+  template <typename FriendBitIndexT,
+            FriendBitIndexT::BitsT FriendByteEncodingMask>
+  friend class BitIndexRange;


How is this used? Is it something specific to arm? I was messing around and it causes issues for requires due to the different template argument names; I tried deleting it and that worked fine, but maybe it's something with SIMDMatchPresent? Maybe something suitable for comments and/or something to cause a cross-platform compilation error?

It's the operator== above that is heterogenous that ends up requiring this.

I've just added the requires to both.

Ah, I see now on the build bots. I think this is a problem with Clang-16 sadly. I'll use a static_assert variation I guess.

jonmeow · 2024-12-02T18:21:03Z

common/raw_hashtable_metadata_group.h

+  // Return whichever result we're using. This uses an invoked lambda to deduce
+  // the type from only the selected return statement, allowing them to be
+  // different types.
+  return [&] {
+    if constexpr (UseSIMD) {
+      return simd_result;
+    } else {
+      return portable_result;
+    }
+  }();


Given you do this twice, and it's kind of subtle, had you considered a helper function? i.e., something like:

// Return whichever result we're using. This uses an invoked lambda to deduce // the type from only the selected return statement, allowing them to be // different types. template <bool If, typename ThenT, typename ElseT> inline auto ConstexprTernary(ThenT then_val, ElseT else_val) -> auto { return [&] { if constexpr (If) { return then_val; } else { return else_val; } }(); }

While thinking about this, I also was wondering whether there was a good template solution, which got me thinking about requires. So here's that thought:

// Behaves as a ternary, but allowing different types on the return. template <bool If, typename ThenT, typename ElseT> requires (If) inline auto ConstexprTernary(ThenT then_val, ElseT /*else_val*/) -> ThenT { return then_val; } template <bool If, typename ThenT, typename ElseT> requires (!If) inline auto ConstexprTernary(ThenT /*then_val*/, ElseT else_val) -> ElseT { return else_val; }

Allowing (either way):

return ConstexprTernary<UseSIMD>(simd_result, portable_result);

Sure, done. Went with if constexpr as I try to use that over overloads whenever I can as it seems conceptually simpler. But no need for the nested lambda once its in a template function.

When using a byte-encoding for matched group metadata we need to mask down to a single bit in each matching byte to make the iteration of a range of match indices work. In most cases, this mask can be folded into the overall match computation, but for Arm Neon, there is avoidable overhead from this. Instead, we can defer the mask until starting to iterate. Doing more than one iteration is relative rare so this doesn't accumulate much waste and makes common paths a bit faster. For the M1 this makes the SIMD match path about 2-4% faster. This isn't enough to catch the portable match code path on the M1 though. For some Neoverse cores the difference here is more significant (>10% improvement) and it makes the SIMD and scalar code paths have comparable latency. Still not clear which is better as the latency is comparable and beyond latency the factors are very hard to analyze -- port pressure on different parts of the CPU, etc. Leaving the selected code path as portable since that's so much better on the M1, and I'm hoping to avoid different code paths for different Arm CPUs for a while. Co-authored-by: Danila Kutenin <[email protected]>

chandlerc

Thanks, I think all suggestion done so merging!

chandlerc · 2024-12-20T02:04:38Z

common/raw_hashtable_metadata_group.h

+      if constexpr (ByteEncodingMask != 0) {
+        // Apply an increment mask to the bits first. This is used with the byte
+        // encoding when the mask isn't needed until we begin incrementing.
+        static_assert(BitIndexT::ByteEncoding);


Good idea, done (with requires as that seems cleaner)!

chandlerc · 2024-12-20T02:20:58Z

common/raw_hashtable_metadata_group.h

+  template <typename FriendBitIndexT,
+            FriendBitIndexT::BitsT FriendByteEncodingMask>
+  friend class BitIndexRange;


It's the operator== above that is heterogenous that ends up requiring this.

I've just added the requires to both.

chandlerc · 2024-12-20T03:06:24Z

common/raw_hashtable_metadata_group.h

+  // Return whichever result we're using. This uses an invoked lambda to deduce
+  // the type from only the selected return statement, allowing them to be
+  // different types.
+  return [&] {
+    if constexpr (UseSIMD) {
+      return simd_result;
+    } else {
+      return portable_result;
+    }
+  }();


Sure, done. Went with if constexpr as I try to use that over overloads whenever I can as it seems conceptually simpler. But no need for the nested lambda once its in a template function.

github-actions bot added the toolchain label Nov 26, 2024

github-actions bot requested a review from jonmeow November 26, 2024 20:07

jonmeow approved these changes Dec 2, 2024

View reviewed changes

chandlerc and others added 2 commits December 20, 2024 01:52

tweak from review

e20f161

chandlerc force-pushed the simd-tweak branch from 1fb70a7 to e20f161 Compare December 20, 2024 02:10

more tweaks from review

186d667

chandlerc commented Dec 20, 2024

View reviewed changes

switch to static assert since requires breaks clang16

6394306

chandlerc added this pull request to the merge queue Dec 20, 2024

Merged via the queue into carbon-language:trunk with commit 3ae968a Dec 20, 2024
7 of 8 checks passed

chandlerc deleted the simd-tweak branch December 20, 2024 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teach the SIMD metadata group match to defer masking #4595

Teach the SIMD metadata group match to defer masking #4595

chandlerc commented Nov 26, 2024

jonmeow left a comment

jonmeow Dec 2, 2024

chandlerc Dec 20, 2024

jonmeow Dec 2, 2024

chandlerc Dec 20, 2024

chandlerc Dec 20, 2024

jonmeow Dec 2, 2024

chandlerc Dec 20, 2024

chandlerc left a comment

chandlerc Dec 20, 2024

chandlerc Dec 20, 2024

chandlerc Dec 20, 2024

Teach the SIMD metadata group match to defer masking #4595

Teach the SIMD metadata group match to defer masking #4595

Conversation

chandlerc commented Nov 26, 2024

jonmeow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment