fix bias in returned values #1

starius · 2017-03-10T16:26:59Z

There were two issues.

Bias of distribution of remainders of Int31.
Demonstrated in TestTail. Items from the first half of the table
get one addition remainder. Can be fixed without overhead
by rethrowing RNG in that rare case when its value goes belongs
to that tail.
The random number used to choose bucket is not independent from
the random number used to decide whether to switch to alias.
Demonstrated in TestBalanceInsideBucket. I don't know how to
fix without overhead of second call to RNG.

Both issues were resolved in this commit. Statistical tests added.

Impact on performance:

Before:

BenchmarkGen5-8                 100000000               24.1 ns/op
BenchmarkGen50-8                50000000                28.7 ns/op
BenchmarkGen500-8               50000000                27.2 ns/op
BenchmarkGen5000-8              50000000                27.9 ns/op
BenchmarkGen50000-8             50000000                31.3 ns/op

After:

BenchmarkGen5-8                 50000000                38.8 ns/op
BenchmarkGen50-8                30000000                53.0 ns/op
BenchmarkGen500-8               30000000                40.9 ns/op
BenchmarkGen5000-8              30000000                41.5 ns/op
BenchmarkGen50000-8             30000000                43.9 ns/op

There were two issues. * Bias of distribution of remainders of Int31. Demonstrated in TestTail. Items from the first half of the table get one addition remainder. Can be fixed without overhead by rethrowing RNG in that rare case when its value goes belongs to that tail. * The random number used to choose bucket is not independent from the random number used to decide whether to switch to alias. Demonstrated in TestBalanceInsideBucket. I don't know how to fix without overhead of second call to RNG. Both issues were resolved in this commit. Statistical tests added. Impact on performance: Before: BenchmarkGen5-8 100000000 24.1 ns/op BenchmarkGen50-8 50000000 28.7 ns/op BenchmarkGen500-8 50000000 27.2 ns/op BenchmarkGen5000-8 50000000 27.9 ns/op BenchmarkGen50000-8 50000000 31.3 ns/op After: BenchmarkGen5-8 50000000 38.8 ns/op BenchmarkGen50-8 30000000 53.0 ns/op BenchmarkGen500-8 30000000 40.9 ns/op BenchmarkGen5000-8 30000000 41.5 ns/op BenchmarkGen50000-8 30000000 43.9 ns/op

Now performance is good again.

The original Alias is based on floats, which can result in biased distributions for some inputs because of float errors. The new Alias variant returns results with probabilities that are exactly proportional to input weights. In int based algo number 1.0 was replaced with an average weight (avg). If the average weight of input array is not an an integer, we add a dummy element with such a weight that the average weight becomes an integer. We do not return the dummy element as a result of Gen(): if it is about to be returned, we jump to the beginning of Gen() method using goto. For float based Alias average weight is always (1<<31)-1. The random value rj (used inside a piece to decide whether to switch to an alias) is now in range [0, avg). Member field divRj stores avg. Member maxRj stores maximum allowed value of the second RNG. (The second RNG provides input for the choice inside piece.) New kind of Alias is generated with NewInt() function. Both of alias kinds use the same struct Alias. New fields were added to this structto support integer based alias: dummy, maxRj and divRj. They have fixed values in float based Alias. Operation of method Gen() was slowed down by this change, but for float based Alias this effect was small: divRj is a power a 2, so modulus operation is fast; maxRj is the maximum possible int32, so it never causes rethrowing; dummy is always set to a value that can not be returned, so it also never causes rethrowing. Int based Alias operates slower, mostly due to rethrowing. Probability of rethrowing can be decreased by using small input values (this effect was demonstrated in the benchmark). Benchmark results on the same machine. Before: BenchmarkGen5-4 100000000 17.5 ns/op BenchmarkGen50-4 100000000 21.9 ns/op BenchmarkGen500-4 100000000 20.7 ns/op BenchmarkGen5000-4 100000000 21.6 ns/op BenchmarkGen50000-4 50000000 24.6 ns/op After: BenchmarkGen5-4 100000000 19.7 ns/op BenchmarkGen50-4 50000000 23.8 ns/op BenchmarkGen500-4 100000000 22.6 ns/op BenchmarkGen5000-4 100000000 23.3 ns/op BenchmarkGen50000-4 50000000 25.5 ns/op BenchmarkGenIntLargeP5-4 50000000 42.4 ns/op BenchmarkGenIntLargeP50-4 50000000 23.7 ns/op BenchmarkGenIntLargeP500-4 30000000 41.6 ns/op BenchmarkGenIntLargeP5000-4 30000000 42.9 ns/op BenchmarkGenIntLargeP50000-4 50000000 44.4 ns/op BenchmarkGenIntSmallP5-4 50000000 31.0 ns/op BenchmarkGenIntSmallP50-4 50000000 25.7 ns/op BenchmarkGenIntSmallP500-4 50000000 25.6 ns/op BenchmarkGenIntSmallP5000-4 50000000 24.1 ns/op BenchmarkGenIntSmallP50000-4 50000000 29.2 ns/op

starius · 2017-03-13T00:31:51Z

I merged these commits to master branch of my fork: https://github.com/starius/alias/

starius added 4 commits March 10, 2017 17:24

add sanity check

6d01a3e

use rand.Int63 instead of two rand.Int31

db3cba3

Now performance is good again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bias in returned values #1

fix bias in returned values #1

starius commented Mar 10, 2017

starius commented Mar 13, 2017

fix bias in returned values #1

Are you sure you want to change the base?

fix bias in returned values #1

Conversation

starius commented Mar 10, 2017

starius commented Mar 13, 2017