Skip to content

Implementation decisions of reservoir_sample aggregate function #26781

@gggrace14

Description

@gggrace14

We are adding the aggregate function reservoir_sample to Presto Native/C++. There are several places in the Presto ReservoirSampleFunction class today where we think efficiency could be improved, and we need help to understand the rationale behind the implementation decisions. We are trying to see if different implementation approach exists so that it could be more performant when we implement it on top of the Native aggregate function framework.

The general question is, is the reservoir_sample implementation today based on a certain distributed reservoir sampling algorithm? Any pointer to the resource?

The main questions we have are

  1. Today's implementation carries the InitialSample all the way to the end and merge it at the last at the final aggregation. What is the rationale behind it, compared to starting to merge it at the beginning at the partial aggregation? Does it increase the probability of the InitialSample elements being selected in the final result? We are trying to see if we can save space by not carrying InitialSample all the way but merging it at the beginning.
  2. The combine() method and merge() method shuffle the two inputs and then randomly select elements from the two inputs alternately. Will it work if we either shuffle or randomly select but not both?

PR #21296

Metadata

Metadata

Assignees

Labels

Type

Projects

Status

🆕 Unprioritized

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions