Implementation decisions of reservoir_sample aggregate function

We are adding the aggregate function [reservoir_sample](https://prestodb.io/docs/current/functions/aggregate.html#reservoir_sample) to Presto Native/C++. There are several places in the Presto [ReservoirSampleFunction](https://github.com/prestodb/presto/blob/47a708022aba003ebfda483ea3c940231fdbd282/presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/reservoirsample/ReservoirSampleFunction.java) class today where we think efficiency could be improved, and we need help to understand the rationale behind the implementation decisions. We are trying to see if different implementation approach exists so that it could be more performant when we implement it on top of the Native aggregate function framework.

The general question is, is the reservoir_sample implementation today based on a certain *distributed* reservoir sampling algorithm? Any pointer to the resource?

The main questions we have are
1. Today's implementation carries the InitialSample all the way to the end and merge it at the last at the final aggregation. What is the rationale behind it, compared to starting to merge it at the beginning at the partial aggregation? Does it increase the probability of the InitialSample elements being selected in the final result? We are trying to see if we can save space by not carrying InitialSample all the way but merging it at the beginning.
2. The [combine()](https://github.com/prestodb/presto/blob/47a708022aba003ebfda483ea3c940231fdbd282/presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/reservoirsample/ReservoirSampleFunction.java#L83) method and [merge()](https://github.com/prestodb/presto/blob/e03eff1b10329554af6c58828fb2a39c5204cffe/presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/reservoirsample/ReservoirSample.java#L240-L243) method shuffle the two inputs and then randomly select elements from the two inputs alternately. Will it work if we either shuffle or randomly select but not both?

PR https://github.com/prestodb/presto/pull/21296



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation decisions of reservoir_sample aggregate function #26781

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementation decisions of reservoir_sample aggregate function #26781

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions