Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770

wiedld · 2024-08-01T18:49:01Z

Describe the bug

We have been using two parquet writers: ArrowWriter vs ParquetSink (parallelized writes). We discovered a bug where the ArrowWriter includes the arrow schema (by default) in the parquet metadata on write. Whereas datafusion's ParquetSink does not include the arrow schema in the file metadata (a.k.a. it's missing here). This missing arrow schema metadata is important, as it's inclusion aids with later reading.

To Reproduce

Write parquet with ParquetSink.
Write parquet with ArrowWriter (default options).
Attempt to read the arrow schema from the parquet metadata, using the below/linked APIs:

let file_metadata: FileMetadata = <get from file per API>;

let arrow_schema = parquet_to_arrow_schema(
file_metadata.schema_descr(),
file_metadata.key_value_metadata(),
);

An error is returned for parquet written by ParquetSink.

Expected behavior

Parquet written by ParquetSink should have the same default behavior (to include the arrow schema in the parquet metadata) as the ArrowWriter.

Additional context

No response

wiedld · 2024-08-01T18:49:09Z

take

wiedld · 2024-08-01T19:00:04Z

Also, a proposed followup work (to handle the general case):

Consider whether we have a testing gap for ParquetSink:

do we need to have more e2es which ensure that the parquet encoders all encode uniformly?
e.g. parquet encoded by either ArrowWriter (under defaults) or ParquetSink (under defaults) should be identical.

alamb · 2024-08-01T19:47:52Z

BTW I think you can write a test for the embedded schema in a parquet file using the describe command

DataFusion CLI v40.0.0
> copy (values (1)) to '/tmp/foo.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.051 seconds.

> describe '/tmp/foo.parquet';
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| column1     | Int64     | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.

Perhaps we could extend some of the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/copy.slt with describe as a way to verify your fix

kylebarron · 2024-12-19T22:42:30Z

This was hit in datafusion-python here: apache/datafusion-python#957

I'd argue this is an important bug, because there are a bunch of Arrow types that aren't representable as a Parquet logical type. Additionally it's not the case of a wrong default, there's no option today to force the Parquet writer to include the Arrow metadata.

@wiedld did you get anywhere with this issue?

wiedld · 2024-12-20T06:04:52Z

Since we did our own hack around it, I wasn't able to justify prioritizing the fix.

Seems like that has changed. I'll queue it up (fyi to @alamb ).

wiedld added the bug Something isn't working label Aug 1, 2024

wiedld mentioned this issue Aug 1, 2024

Improve ability for users to create their own writers using the low level parquet APIs. apache/arrow-rs#6177

Open

kylebarron mentioned this issue Dec 19, 2024

fixed size list type is not retained when writing to parquet apache/datafusion-python#957

Open

wiedld linked a pull request Dec 21, 2024 that will close this issue

WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. #13866

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770

Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770

wiedld commented Aug 1, 2024

wiedld commented Aug 1, 2024

wiedld commented Aug 1, 2024

alamb commented Aug 1, 2024

kylebarron commented Dec 19, 2024 •

edited

Loading

wiedld commented Dec 20, 2024

Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770

Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770

Comments

wiedld commented Aug 1, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context

wiedld commented Aug 1, 2024

wiedld commented Aug 1, 2024

alamb commented Aug 1, 2024

kylebarron commented Dec 19, 2024 • edited Loading

wiedld commented Dec 20, 2024

kylebarron commented Dec 19, 2024 •

edited

Loading