You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ParquetFileReader provides a PageIndexReader via which we can eventually get to a ColumnIndex and an OffsetIndex - so far so good. Those indexes provide page based information, but in virtually all APIs the concept of pages is completely abstracted away. For higher level APIs that makes sense, but even if we go down to the level of the PageReader we can only read all pages serially one after the other. The only way I found to skip some pages is via the PageReader's data page filter, but that only operates on the page's metadata and does not utilize the index. I did not find a way to load a specific page (e.g.,via index or file offset). But then I don't see how one can utilize the PageIndex with the current API. Did I miss anything?
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered:
You're right. There has been a discussion to introduce RowRanges API and leverage the page index to skip pages: https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM. There is a stale PR but the author no longer works on it. Recently I started to pick it up by implementing the RowRanges API and it will take some time to finish: #45234
Describe the enhancement requested
The
ParquetFileReader
provides aPageIndexReader
via which we can eventually get to aColumnIndex
and anOffsetIndex
- so far so good. Those indexes provide page based information, but in virtually all APIs the concept of pages is completely abstracted away. For higher level APIs that makes sense, but even if we go down to the level of thePageReader
we can only read all pages serially one after the other. The only way I found to skip some pages is via thePageReader
's data page filter, but that only operates on the page's metadata and does not utilize the index. I did not find a way to load a specific page (e.g.,via index or file offset). But then I don't see how one can utilize the PageIndex with the current API. Did I miss anything?Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: