-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bias in returned values #1
Open
starius
wants to merge
4
commits into
encryptio:master
Choose a base branch
from
starius:fix-bias
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There were two issues. * Bias of distribution of remainders of Int31. Demonstrated in TestTail. Items from the first half of the table get one addition remainder. Can be fixed without overhead by rethrowing RNG in that rare case when its value goes belongs to that tail. * The random number used to choose bucket is not independent from the random number used to decide whether to switch to alias. Demonstrated in TestBalanceInsideBucket. I don't know how to fix without overhead of second call to RNG. Both issues were resolved in this commit. Statistical tests added. Impact on performance: Before: BenchmarkGen5-8 100000000 24.1 ns/op BenchmarkGen50-8 50000000 28.7 ns/op BenchmarkGen500-8 50000000 27.2 ns/op BenchmarkGen5000-8 50000000 27.9 ns/op BenchmarkGen50000-8 50000000 31.3 ns/op After: BenchmarkGen5-8 50000000 38.8 ns/op BenchmarkGen50-8 30000000 53.0 ns/op BenchmarkGen500-8 30000000 40.9 ns/op BenchmarkGen5000-8 30000000 41.5 ns/op BenchmarkGen50000-8 30000000 43.9 ns/op
Now performance is good again.
The original Alias is based on floats, which can result in biased distributions for some inputs because of float errors. The new Alias variant returns results with probabilities that are exactly proportional to input weights. In int based algo number 1.0 was replaced with an average weight (avg). If the average weight of input array is not an an integer, we add a dummy element with such a weight that the average weight becomes an integer. We do not return the dummy element as a result of Gen(): if it is about to be returned, we jump to the beginning of Gen() method using goto. For float based Alias average weight is always (1<<31)-1. The random value rj (used inside a piece to decide whether to switch to an alias) is now in range [0, avg). Member field divRj stores avg. Member maxRj stores maximum allowed value of the second RNG. (The second RNG provides input for the choice inside piece.) New kind of Alias is generated with NewInt() function. Both of alias kinds use the same struct Alias. New fields were added to this structto support integer based alias: dummy, maxRj and divRj. They have fixed values in float based Alias. Operation of method Gen() was slowed down by this change, but for float based Alias this effect was small: divRj is a power a 2, so modulus operation is fast; maxRj is the maximum possible int32, so it never causes rethrowing; dummy is always set to a value that can not be returned, so it also never causes rethrowing. Int based Alias operates slower, mostly due to rethrowing. Probability of rethrowing can be decreased by using small input values (this effect was demonstrated in the benchmark). Benchmark results on the same machine. Before: BenchmarkGen5-4 100000000 17.5 ns/op BenchmarkGen50-4 100000000 21.9 ns/op BenchmarkGen500-4 100000000 20.7 ns/op BenchmarkGen5000-4 100000000 21.6 ns/op BenchmarkGen50000-4 50000000 24.6 ns/op After: BenchmarkGen5-4 100000000 19.7 ns/op BenchmarkGen50-4 50000000 23.8 ns/op BenchmarkGen500-4 100000000 22.6 ns/op BenchmarkGen5000-4 100000000 23.3 ns/op BenchmarkGen50000-4 50000000 25.5 ns/op BenchmarkGenIntLargeP5-4 50000000 42.4 ns/op BenchmarkGenIntLargeP50-4 50000000 23.7 ns/op BenchmarkGenIntLargeP500-4 30000000 41.6 ns/op BenchmarkGenIntLargeP5000-4 30000000 42.9 ns/op BenchmarkGenIntLargeP50000-4 50000000 44.4 ns/op BenchmarkGenIntSmallP5-4 50000000 31.0 ns/op BenchmarkGenIntSmallP50-4 50000000 25.7 ns/op BenchmarkGenIntSmallP500-4 50000000 25.6 ns/op BenchmarkGenIntSmallP5000-4 50000000 24.1 ns/op BenchmarkGenIntSmallP50000-4 50000000 29.2 ns/op
I merged these commits to master branch of my fork: https://github.com/starius/alias/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There were two issues.
Bias of distribution of remainders of Int31.
Demonstrated in TestTail. Items from the first half of the table
get one addition remainder. Can be fixed without overhead
by rethrowing RNG in that rare case when its value goes belongs
to that tail.
The random number used to choose bucket is not independent from
the random number used to decide whether to switch to alias.
Demonstrated in TestBalanceInsideBucket. I don't know how to
fix without overhead of second call to RNG.
Both issues were resolved in this commit. Statistical tests added.
Impact on performance:
Before:
After: