Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Metadata related memory leak when reading parquet dataset #45287

Open
icexelloss opened this issue Jan 16, 2025 · 5 comments
Open

[C++] Metadata related memory leak when reading parquet dataset #45287

icexelloss opened this issue Jan 16, 2025 · 5 comments

Comments

@icexelloss
Copy link
Contributor

icexelloss commented Jan 16, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Hi,

I have observed some memory leak when loading parquet dataset, which I think is related to metadata file.

I ran with Pyarrow 19.0 .0 Here is the code to repro

import pyarrow.parquet as pq
t = pq.read_table("bamboo-streaming-parquet-test-data/10000col_2_short_name", columns=['time', 'id'])
print(t)

Here is the description of the dataset:

  • It is a daily partitioned parquet dataset, total size is 1G, each parquet / partition is 3.7M, total 260 parquet files.
  • Each partition has a single row, and 10k double columns.

The dataset roughly looks like this

                         time  id      md_0      md_1      md_2      md_3      md_4      md_5      md_6      md_7      md_8      md_9     md_10     md_11     md_12     md_13     md_14  ...   md_9986   md_9987   md_9988   md_9989   md_9990   md_9991   md_9992   md_9993   md_9994   md_9995   md_9996   md_9997   md_9998   md_9999  year  month  day
0   2023-01-02 09:00:00+00:00   0  0.345584  0.821618  0.330437 -1.303157  0.905356  0.446375 -0.536953  0.581118  0.364572  0.294132  0.028422  0.546713 -0.736454 -0.162910 -0.482119  ... -0.559077  0.422268 -0.694504 -0.024630 -1.142861  2.203289 -0.293591 -1.076218 -2.264640  1.424887  1.601123  0.301252 -0.771280  0.185484  2023      1    2
1   2023-01-03 09:00:00+00:00   0 -0.581676 -0.889318  0.487676  0.678370 -0.834241  0.990142 -0.502560 -3.089640 -1.354553  0.669394  0.173036  0.904321  0.528163  1.386469 -1.018272  ...  2.348579  0.682227 -0.212912  0.404263 -1.527967 -0.636490 -1.094308 -0.049889  0.290552 -0.428462 -0.688299  1.856678  1.714070  0.228840  2023      1    3
2   2023-01-04 09:00:00+00:00   0 -0.436375  1.554100  1.583000 -0.427829 -0.105547 -1.210442 -1.995322 -0.676878  0.957899 -1.569809  0.411940  0.190030 -1.502412 -0.006992  0.086427  ... -0.039152 -0.325682 -3.200570  0.415924 -1.892018 -0.324783 -0.397570  1.310791  1.284943  0.148449  0.844266 -0.045938  0.745099  1.037851  2023      1    4
3   2023-01-05 09:00:00+00:00   0 -0.158549 -1.239811 -4.030404  1.357348  0.323645 -1.222858 -0.285377  0.963126 -0.531556 -0.652767  0.161818 -0.727889 -0.845209  2.557909  0.192841  ...  0.349263  1.362306  0.993748 -0.198351 -0.270906  0.667339  0.265590 -0.344429 -0.025954 -0.751611 -0.614933  0.629236 -0.765841  1.214225  2023      1    5
4   2023-01-06 09:00:00+00:00   0  0.165239  1.645823  1.345670 -0.966753 -1.149769  0.245695  0.731457 -0.902745  1.270495  2.031029  0.312967 -1.554449  1.177362 -0.843873 -0.216501  ... -0.070219  1.582911  0.146530 -2.169505 -0.474960  0.896453 -1.591739  0.560348 -1.130101  1.137671  1.327553 -0.383506 -0.346886 -0.189187  2023      1    6
..                        ...  ..       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...  ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...   ...    ...  ...
255 2023-12-25 09:00:00+00:00   0 -1.346927  0.359054  0.539482  0.367916 -1.574514  0.986346 -0.695192  0.658779  1.335143  1.846663 -0.341364  0.817412 -0.797522  0.073098  0.821410  ... -1.186771  0.887036  1.411563 -0.292395  0.430151  1.141385  0.496770 -0.644220 -0.799314 -1.696699  0.862889  2.979495  0.630375  1.303667  2023     12   25
256 2023-12-26 09:00:00+00:00   0  0.227574 -1.466949 -0.333808 -1.710143  1.314850 -0.322474  0.048659  0.470558 -0.045580  1.193444 -1.826998 -1.368194  0.489085  0.947896  0.640531  ...  0.914886  0.261353 -0.691675 -0.399880  2.045703 -2.356994  1.374474  0.398776 -1.112503 -0.821812  1.238957 -0.940858 -0.912673 -0.784034  2023     12   26
257 2023-12-27 09:00:00+00:00   0  0.054617 -1.524966  0.890249  0.360648  2.271556 -0.964410  1.819533 -0.050139  1.859295 -0.590993  0.306090  0.354523  0.094928  0.191593 -0.225309  ... -0.488067 -0.309505  0.544273 -0.408513 -0.111164  0.974175 -0.441507  2.331777  0.726422 -0.165301 -1.163866  0.077637  0.404457  1.498559  2023     12   27
258 2023-12-28 09:00:00+00:00   0  0.827725  1.090989  0.273126  0.586210  0.753180 -1.544673  0.180036 -1.136032  0.919575 -0.733295 -0.661449  0.194519  0.228403 -0.531628 -0.226339  ... -0.986043  0.099540 -0.729874  0.692716 -0.506130 -0.122421  0.321638 -2.592867  0.083722  0.418742 -0.076682  1.067173 -0.331503  0.617221  2023     12   28
259 2023-12-29 09:00:00+00:00   0  0.527097  0.358271 -0.659745  1.500467 -0.977564  1.198143  0.650929  0.876694 -0.144450  1.175169  0.749327 -0.475795 -0.978405 -0.888626  0.041753  ... -0.090532 -2.414195  1.619769 -0.005002 -0.672586  0.638271  1.819008 -0.446535 -0.629320 -1.241598  0.926157 -0.304448 -0.129029  0.750146  2023     12   29

[260 rows x 10005 columns]

When running the code above with "time -v", it shows the memory usage is about 6G, which is significantly larger than the data loaded so I think there is some metadata related memory leak. I also noticed that the memory usage increases if I use longer column names, e.g., if I prepend a 128 char long prefix to the column names, the memory usage is about 11G.

This issue is probably the same root cause as #37630

There is script that can be used to generate the dataset for repro, but has permissioned access (due to company policy), but happy to give permission to who is looking into this:
https://github.com/twosigma/bamboo-streaming/blob/master/notebooks/generate_parquet_test_data.ipynb

Component(s)

Parquet, C++

@icexelloss icexelloss changed the title Metadata related memory leak when reading parquet dataset [C++] Metadata related memory leak when reading parquet dataset Jan 21, 2025
@icexelloss
Copy link
Contributor Author

icexelloss commented Jan 21, 2025

Data can be generated with

python -m datagen.batch_process \
    BatchStreamParquetFileWriter \
    test.parquet \
    TableMDGenerator \
    '{"begin_date": "2023-01-01", "end_date": "2023-12-31", "seed": 1, "freq": "1d", "ids": 1, "cols": 10000}'

and the code in the notebook above

@pitrou
Copy link
Member

pitrou commented Jan 21, 2025

I haven't tried to look down for the precise source of memory consumption (yet?) but some quick comments already:

When running the code above with "time -v", it shows the memory usage is about 6G, which is significantly larger than the data loaded so I think there is some metadata related memory leak

A quick back of the envelope calculation says that this is roughly 2 kB per column per file.

I also noticed that the memory usage increases if I use longer column names, e.g., if I prepend a 128 char long prefix to the column names, the memory usage is about 11G.

Interesting data point. That would be 4 kB per column per file, so quite a bit of additional overhead just for 128 additional characters...

Each partition has a single row, and 10k double columns.

I would stress that "a single row and 10 kB columns" is never going to be a good use case for Parquet, which is designed from the ground up as a columnar format. If you're storing less than e.g. 1k rows (regardless of the number of columns), the format will certainly impose a lot of overhead.

Of course, we can still try to find out if there's some low-hanging fruit that would allow reducing the memory usage of metadata.

@icexelloss
Copy link
Contributor Author

icexelloss commented Jan 21, 2025

A quick back of the envelope calculation says that this is roughly 2 kB per column per file.

I was expecting the metadata memory usage to be more of O(C) where C=number_columns instead of O(C * F) where C=number_columns and F=number_files? Since once a parquet file is loaded to pyarrow Table, we don't need to keep the metadata around (all files have the same scheme), but perhaps I am misunderstanding how read parquet works.

Interesting data point. That would be 4 kB per column per file, so quite a bit of additional overhead just for 128 additional characters...

Yeah certainly feels the that there are multiple copies of the string for column name even though all file/partition has the same schema.

I would stress that "a single row and 10 kB columns" is never going to be a good use case for Parquet

Yeah this is a extreme case just to show the repro. In practice the file has a couple thousands row per file.

Of course, we can still try to find out if there's some low-hanging fruit that would allow reducing the memory usage of metadata.

It would be great to reduce metadata memory usage when the files being read all have the same schema since this is a quite common case I think

@pitrou
Copy link
Member

pitrou commented Jan 22, 2025

I was expecting the metadata memory usage to be more of O(C) where C=number_columns instead of O(C * F) where C=number_columns and F=number_files? Since once a parquet file is loaded to pyarrow Table, we don't need to keep the metadata around (all files have the same scheme), but perhaps I am misunderstanding how read parquet works.

Hmm, this needs clarifying a bit then :) What do the memory usage numbers you posted represent? Is it peak memory usage? Is it memory usage after loading the dataset as a Arrow table? Is the dataset object still alive at that point?

It would be great to reduce metadata memory usage when the files being read all have the same schema since this is a quite common case I think

Definitely.

@pitrou
Copy link
Member

pitrou commented Jan 22, 2025

Yeah this is a extreme case just to show the repro. In practice the file has a couple thousands row per file.

How many row groups per file (or rows per row group)? It turns out much of the Parquet metadata consumption is in ColumnChunk entries. A Thrift-deserialized ColumnChunk is 640 bytes long, and there are O(CRF) ColumnChunks in your dataset, with C=number_columns, R=number_row_groups_per_file and F=number_files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants