[C++][Parquet][Python] Sequential reading of parquet files #45298

alippai · 2025-01-17T21:09:24Z

Describe the enhancement requested

I'm using Parquet via PyArrow (versions 17.0 and 18.0).

I have tried many options (setting pre_buffer=True, use_threads=False, increasing buffer_size) with pq.read_table(), but I can't find a way to reduce seeking and increase the size of read() calls.

As I understand, the Parquet format can be read by reading the footer once or twice, then streaming the row groups in order and passing the data to decompress and decode.

If I have a reasonable number of files, having a few sequential readers (one reader per file) should provide the most throughput when the storage is inefficient with small random reads.

Currently, I use pq.read_table(BytesIO(Path().read_bytes())). This is wasteful and clunky: it allocates too much contiguous memory, and the read doesn't free the processed row groups.

I understand this can get complex when filters and column projection are involved, but developing an I/O plan where I can specify in-order reads and a minimum read size could work (e.g., reading the small gaps and discarding unused data afterward).

Component(s)

Parquet, Python

The text was updated successfully, but these errors were encountered:

alippai added the Type: enhancement label Jan 17, 2025

github-actions bot added Component: Parquet Component: Python labels Jan 17, 2025

amoeba added the Component: C++ label Jan 18, 2025

amoeba changed the title ~~Sequential reading of parquet files~~ [C++][Parquet][Python] Sequential reading of parquet files Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet][Python] Sequential reading of parquet files #45298

[C++][Parquet][Python] Sequential reading of parquet files #45298

alippai commented Jan 17, 2025

[C++][Parquet][Python] Sequential reading of parquet files #45298

[C++][Parquet][Python] Sequential reading of parquet files #45298

Comments

alippai commented Jan 17, 2025

Describe the enhancement requested

Component(s)