Skip to content

merge is invalid #15

@kurtbrose

Description

@kurtbrose

The merge operation is not valid for this algorithm. It is not a distributed algorithm, and intermediate results from two streams cannot be combined.

The definition of g from the paper, Delta in the go code: "gi is the difference between the lowest possible rank of item i and the lowest possible rank of item i − 1"

Consider two sequences, with stored nodes marked with ():

...(1)111(3)...

...0(2)...

The node with value=3 will have g/Delta of 4. After the lists are merged:

...0(1)111(2)(3)...

The node with value=3 still has g/Delta of 4, but this is incorrect. The actual g/Delta should be 1. The invariant of the algorithm is broken, and therefore the guaranteed error bounds are also broken.

If there are no longer guaranteed error bounds, you might as well use a reservoir sample.

The fix is to remove the merge() operation. (Multiple goroutines would need to either push points into a queue, or use a lock around the data structure, or some other fancy architecture.)

Q-digest is an example of a distributed algorithm in this space:
http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf

Edit: the original authors of the paper also extended it to be a distributed algorithm.
http://www.cis.upenn.edu/~mbgreen/papers/pods04.pdf
The data structure used is a bit different, but similar. This would require a total re-write of the core algorithm, but would allow for a valid merge() operation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions