Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bias in returned values #1

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

starius
Copy link

@starius starius commented Mar 10, 2017

There were two issues.

  • Bias of distribution of remainders of Int31.
    Demonstrated in TestTail. Items from the first half of the table
    get one addition remainder. Can be fixed without overhead
    by rethrowing RNG in that rare case when its value goes belongs
    to that tail.

  • The random number used to choose bucket is not independent from
    the random number used to decide whether to switch to alias.
    Demonstrated in TestBalanceInsideBucket. I don't know how to
    fix without overhead of second call to RNG.

Both issues were resolved in this commit. Statistical tests added.

Impact on performance:

Before:

BenchmarkGen5-8                 100000000               24.1 ns/op
BenchmarkGen50-8                50000000                28.7 ns/op
BenchmarkGen500-8               50000000                27.2 ns/op
BenchmarkGen5000-8              50000000                27.9 ns/op
BenchmarkGen50000-8             50000000                31.3 ns/op

After:

BenchmarkGen5-8                 50000000                38.8 ns/op
BenchmarkGen50-8                30000000                53.0 ns/op
BenchmarkGen500-8               30000000                40.9 ns/op
BenchmarkGen5000-8              30000000                41.5 ns/op
BenchmarkGen50000-8             30000000                43.9 ns/op

starius added 4 commits March 10, 2017 17:24
There were two issues.

 * Bias of distribution of remainders of Int31.
   Demonstrated in TestTail. Items from the first half of the table
   get one addition remainder. Can be fixed without overhead
   by rethrowing RNG in that rare case when its value goes belongs
   to that tail.

 * The random number used to choose bucket is not independent from
   the random number used to decide whether to switch to alias.
   Demonstrated in TestBalanceInsideBucket. I don't know how to
   fix without overhead of second call to RNG.

Both issues were resolved in this commit. Statistical tests added.

Impact on performance:

Before:

BenchmarkGen5-8                 100000000               24.1 ns/op
BenchmarkGen50-8                50000000                28.7 ns/op
BenchmarkGen500-8               50000000                27.2 ns/op
BenchmarkGen5000-8              50000000                27.9 ns/op
BenchmarkGen50000-8             50000000                31.3 ns/op

After:

BenchmarkGen5-8                 50000000                38.8 ns/op
BenchmarkGen50-8                30000000                53.0 ns/op
BenchmarkGen500-8               30000000                40.9 ns/op
BenchmarkGen5000-8              30000000                41.5 ns/op
BenchmarkGen50000-8             30000000                43.9 ns/op
Now performance is good again.
The original Alias is based on floats, which can result in biased
distributions for some inputs because of float errors. The new
Alias variant returns results with probabilities that are exactly
proportional to input weights.

In int based algo number 1.0 was replaced with an average weight (avg).
If the average weight of input array is not an an integer, we add
a dummy element with such a weight that the average weight becomes
an integer. We do not return the dummy element as a result of Gen():
if it is about to be returned, we jump to the beginning of Gen() method
using goto. For float based Alias average weight is always (1<<31)-1.

The random value rj (used inside a piece to decide whether to switch
to an alias) is now in range [0, avg). Member field divRj stores avg.
Member maxRj stores maximum allowed value of the second RNG.
(The second RNG provides input for the choice inside piece.)

New kind of Alias is generated with NewInt() function. Both of alias kinds
use the same struct Alias. New fields were added to this structto support
integer based alias: dummy, maxRj and divRj. They have fixed values in float
based Alias. Operation of method Gen() was slowed down by this change, but
for float based Alias this effect was small: divRj is a power a 2, so
modulus operation is fast; maxRj is the maximum possible int32, so it never
causes rethrowing; dummy is always set to a value that can not be returned,
so it also never causes rethrowing. Int based Alias operates slower, mostly
due to rethrowing. Probability of rethrowing can be decreased by using small
input values (this effect was demonstrated in the benchmark).

Benchmark results on the same machine.

Before:

BenchmarkGen5-4         100000000           17.5 ns/op
BenchmarkGen50-4        100000000           21.9 ns/op
BenchmarkGen500-4       100000000           20.7 ns/op
BenchmarkGen5000-4      100000000           21.6 ns/op
BenchmarkGen50000-4     50000000            24.6 ns/op

After:

BenchmarkGen5-4                 100000000           19.7 ns/op
BenchmarkGen50-4                50000000            23.8 ns/op
BenchmarkGen500-4               100000000           22.6 ns/op
BenchmarkGen5000-4              100000000           23.3 ns/op
BenchmarkGen50000-4             50000000            25.5 ns/op
BenchmarkGenIntLargeP5-4        50000000            42.4 ns/op
BenchmarkGenIntLargeP50-4       50000000            23.7 ns/op
BenchmarkGenIntLargeP500-4      30000000            41.6 ns/op
BenchmarkGenIntLargeP5000-4     30000000            42.9 ns/op
BenchmarkGenIntLargeP50000-4    50000000            44.4 ns/op
BenchmarkGenIntSmallP5-4        50000000            31.0 ns/op
BenchmarkGenIntSmallP50-4       50000000            25.7 ns/op
BenchmarkGenIntSmallP500-4      50000000            25.6 ns/op
BenchmarkGenIntSmallP5000-4     50000000            24.1 ns/op
BenchmarkGenIntSmallP50000-4    50000000            29.2 ns/op
@starius
Copy link
Author

starius commented Mar 13, 2017

I merged these commits to master branch of my fork: https://github.com/starius/alias/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant