You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Parquet via PyArrow (versions 17.0 and 18.0).
I have tried many options (setting pre_buffer=True, use_threads=False, increasing buffer_size) with pq.read_table(), but I can't find a way to reduce seeking and increase the size of read() calls.
As I understand, the Parquet format can be read by reading the footer once or twice, then streaming the row groups in order and passing the data to decompress and decode.
If I have a reasonable number of files, having a few sequential readers (one reader per file) should provide the most throughput when the storage is inefficient with small random reads.
Currently, I use pq.read_table(BytesIO(Path().read_bytes())). This is wasteful and clunky: it allocates too much contiguous memory, and the read doesn't free the processed row groups.
I understand this can get complex when filters and column projection are involved, but developing an I/O plan where I can specify in-order reads and a minimum read size could work (e.g., reading the small gaps and discarding unused data afterward).
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered:
Describe the enhancement requested
I'm using Parquet via PyArrow (versions 17.0 and 18.0).
I have tried many options (setting
pre_buffer=True
,use_threads=False
, increasingbuffer_size
) withpq.read_table()
, but I can't find a way to reduce seeking and increase the size ofread()
calls.As I understand, the Parquet format can be read by reading the footer once or twice, then streaming the row groups in order and passing the data to decompress and decode.
If I have a reasonable number of files, having a few sequential readers (one reader per file) should provide the most throughput when the storage is inefficient with small random reads.
Currently, I use
pq.read_table(BytesIO(Path().read_bytes()))
. This is wasteful and clunky: it allocates too much contiguous memory, and the read doesn't free the processed row groups.I understand this can get complex when filters and column projection are involved, but developing an I/O plan where I can specify in-order reads and a minimum read size could work (e.g., reading the small gaps and discarding unused data afterward).
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: