-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Previously in parquet, we should read a whold RowGroup into memory and then extract what we need. This is obviously wasted.
Therefore, I thought of to only read the page we need, and cache the pages for future read.
The previous part is solved thanks to #7850 , and I begin to work after this pr released.
Describe the solution you'd like
I thought of adding a cache mechanism into decode_page
in impl RowGroupReader for SerializedRowGroupReader
. In this way we can avoid some decode and decompress cost.
Describe alternatives you've considered
I have considered to also add cache to filter stage, but this part is already implemented.
I have also considered about page level prefetch, but I think it may be not so profitable.