Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ARM NEON has a byte-wise popcount instruction, which helps to optimize
select_bit
andPopCount::count
. Use it for AArch64 (64-bit ARM).15% speedup for
Rank1
, 4% forSelect0
and 3% forSelect1
.(60% for
PopCount::count
itself.)This gives a 9% speedup on
select0
and 7% onselect1
.(Tested on Pixel 3 in armeabi-v7a mode.)
This is likely because the branches of this unrolled linear
search are more predictable than the binary search that was
used previously.
Instead of computing
(counts | MASK_80) - ((i + 1) * MASK_01)
,we pre-compute a lookup table
then use
counts + PREFIX_SUM_OVERFLOW[i]
.This uses a
UInt64[64]
or 0.5kiB lookup table. The trick is from:Gog, Simon and Matthias Petri. “Optimized succinct data structures for
massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314.
https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3
This gives a 2-3% speedup for
BitVector::select0
/select1
.