Skip to content

Updated the get mask utility for ~10% performance gain on apple silicon#1668

Open
lsawade wants to merge 1 commit intodevelfrom
simd-all-true-mask
Open

Updated the get mask utility for ~10% performance gain on apple silicon#1668
lsawade wants to merge 1 commit intodevelfrom
simd-all-true-mask

Conversation

@lsawade
Copy link
Collaborator

@lsawade lsawade commented Feb 25, 2026

Description

The mask is created for every single chunk/gll combination but for 99% of chunks the mask is true for all!

Problem

Every masked SIMD kernel call constructed its mask via a per-lane lambda, even for the ~99% of chunks that are full and trivially all-true:

// runs on every chunk — opaque to optimizer, can't constant-fold
mask_type mask([&](std::size_t lane) { return int(lane) < number_elements; });

This appeared hot in 13 call sites across field I/O, Jacobians, and boundary conditions.

Fix

Added get_mask<simd_type>() to point::index and point::assembly_index to fast-path the common case:

if (number_elements >= simd_type::size())
    return mask_type(true);  // single vmov.i32 on NEON; branch is ~always taken
return mask_type([&](std::size_t lane) { return int(lane) < number_elements; });

All 13 sites updated. Partial-chunk behavior is unchanged.

Result

Runtime (Apple M3, 2D fluid-solid benchmark) ~10% faster
Correctness impact None

Issue Number

If there is an issue created for these changes, link it here

Checklist

Please make sure to check developer documentation on specfem docs.

  • I ran the code through pre-commit to check style
  • THE DOCUMENTATION BUILDS WITHOUT WARNINGS/ERRORS
  • I have added labels to the PR (see right hand side of the PR page)
  • My code passes all the integration tests
  • I have added sufficient unittests to test my changes
  • I have added/updated documentation for the changes I am proposing
  • I have updated CMakeLists to ensure my code builds
  • My code builds across all platforms

@lsawade lsawade requested review from Rohit-Kakodkar and icui and removed request for Rohit-Kakodkar February 25, 2026 20:14
@lsawade lsawade changed the title Updated the get mask utility for ~10% performance gain Updated the get mask utility for ~10% performance gain on apple silicon Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants