-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Open
Labels
Description
We are adding the aggregate function reservoir_sample to Presto Native/C++. There are several places in the Presto ReservoirSampleFunction class today where we think efficiency could be improved, and we need help to understand the rationale behind the implementation decisions. We are trying to see if different implementation approach exists so that it could be more performant when we implement it on top of the Native aggregate function framework.
The general question is, is the reservoir_sample implementation today based on a certain distributed reservoir sampling algorithm? Any pointer to the resource?
The main questions we have are
- Today's implementation carries the InitialSample all the way to the end and merge it at the last at the final aggregation. What is the rationale behind it, compared to starting to merge it at the beginning at the partial aggregation? Does it increase the probability of the InitialSample elements being selected in the final result? We are trying to see if we can save space by not carrying InitialSample all the way but merging it at the beginning.
- The combine() method and merge() method shuffle the two inputs and then randomly select elements from the two inputs alternately. Will it work if we either shuffle or randomly select but not both?
PR #21296
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
🆕 Unprioritized