Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet][Python] Sequential reading of parquet files #45298

Open
alippai opened this issue Jan 17, 2025 · 0 comments
Open

[C++][Parquet][Python] Sequential reading of parquet files #45298

alippai opened this issue Jan 17, 2025 · 0 comments

Comments

@alippai
Copy link
Contributor

alippai commented Jan 17, 2025

Describe the enhancement requested

I'm using Parquet via PyArrow (versions 17.0 and 18.0).

I have tried many options (setting pre_buffer=True, use_threads=False, increasing buffer_size) with pq.read_table(), but I can't find a way to reduce seeking and increase the size of read() calls.

As I understand, the Parquet format can be read by reading the footer once or twice, then streaming the row groups in order and passing the data to decompress and decode.

If I have a reasonable number of files, having a few sequential readers (one reader per file) should provide the most throughput when the storage is inefficient with small random reads.

Currently, I use pq.read_table(BytesIO(Path().read_bytes())). This is wasteful and clunky: it allocates too much contiguous memory, and the read doesn't free the processed row groups.

I understand this can get complex when filters and column projection are involved, but developing an I/O plan where I can specify in-order reads and a minimum read size could work (e.g., reading the small gaps and discarding unused data afterward).

Component(s)

Parquet, Python

@amoeba amoeba changed the title Sequential reading of parquet files [C++][Parquet][Python] Sequential reading of parquet files Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants