Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770

Open
wiedld opened this issue Aug 1, 2024 · 5 comments · May be fixed by #13866
Open

Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770

wiedld opened this issue Aug 1, 2024 · 5 comments · May be fixed by #13866
Labels
bug Something isn't working

Comments

@wiedld
Copy link
Contributor

wiedld commented Aug 1, 2024

Describe the bug

We have been using two parquet writers: ArrowWriter vs ParquetSink (parallelized writes). We discovered a bug where the ArrowWriter includes the arrow schema (by default) in the parquet metadata on write. Whereas datafusion's ParquetSink does not include the arrow schema in the file metadata (a.k.a. it's missing here). This missing arrow schema metadata is important, as it's inclusion aids with later reading.

To Reproduce

  1. Write parquet with ParquetSink.
  2. Write parquet with ArrowWriter (default options).
  3. Attempt to read the arrow schema from the parquet metadata, using the below/linked APIs:

let file_metadata: FileMetadata = <get from file per API>;

let arrow_schema = parquet_to_arrow_schema(
file_metadata.schema_descr(),
file_metadata.key_value_metadata(),
);

  1. An error is returned for parquet written by ParquetSink.

Expected behavior

Parquet written by ParquetSink should have the same default behavior (to include the arrow schema in the parquet metadata) as the ArrowWriter.

Additional context

No response

@wiedld wiedld added the bug Something isn't working label Aug 1, 2024
@wiedld
Copy link
Contributor Author

wiedld commented Aug 1, 2024

take

@wiedld
Copy link
Contributor Author

wiedld commented Aug 1, 2024

Also, a proposed followup work (to handle the general case):

Consider whether we have a testing gap for ParquetSink:

  • do we need to have more e2es which ensure that the parquet encoders all encode uniformly?
  • e.g. parquet encoded by either ArrowWriter (under defaults) or ParquetSink (under defaults) should be identical.

@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

BTW I think you can write a test for the embedded schema in a parquet file using the describe command

DataFusion CLI v40.0.0
> copy (values (1)) to '/tmp/foo.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.051 seconds.

> describe '/tmp/foo.parquet';
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| column1     | Int64     | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.

Perhaps we could extend some of the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/copy.slt with describe as a way to verify your fix

@kylebarron
Copy link
Contributor

kylebarron commented Dec 19, 2024

This was hit in datafusion-python here: apache/datafusion-python#957

I'd argue this is an important bug, because there are a bunch of Arrow types that aren't representable as a Parquet logical type. Additionally it's not the case of a wrong default, there's no option today to force the Parquet writer to include the Arrow metadata.

@wiedld did you get anywhere with this issue?

@wiedld
Copy link
Contributor Author

wiedld commented Dec 20, 2024

Since we did our own hack around it, I wasn't able to justify prioritizing the fix.

Seems like that has changed. I'll queue it up (fyi to @alamb ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants