Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was looking through how you implemented unxorshift, and noticed it seems over-complicated and could be optimized using a clever implementation.
The current implementation un-xorshifts the value
shift
bits at a time. The idea there is clear: only the firstshift
bits are unaffected by the xorshift, so you first use those to recover the nextshift
bits, which are used to recover the nextshift
bits, etc. The current implementation chose to implement this repeat recursively instead of iteratively for some reason, which may be less optimal if the compiler doesn't optimize it (which it is less likely to do since it's not tail recursion).My implementation makes use of the fact that if
y = x ^ (x >> c)
, theny ^ (y >> c) = x ^ (x >> 2c)
:This allows for a simple recursive step:
unxorshift(x, bits, shift) -> unxorshift(x ^ (x >> shift), bits, shift * 2)
. Since this is tail recursion, this can be trivially optimized as an iterative solution, which I went ahead and did manually in this PR (since it was pretty simple to do). I could have implemented it as a for loop, but since the initial loop condition check is unnecessary given the preconditionshift < bits
, implementing it with a do-while loop removes a conditional jump from the compiled assembly.This new implementation also scales better with larger input sizes. The current implementation recurses
ceil(bits / shift - 1)
times, which, assuming a constant value forshift
, is linear. My implementation loopsceil(log_2(bits / shift))
times, which is logarithmic.You can see just how much this optimizes the implementation by comparing the resulting assembly code (shown for 32-bit values): https://godbolt.org/z/E64G7vq83 (Using Clang because GCC does some optimizations that make the comparison look worse than it is).