-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow schema is missing from the parquet metadata, for files written by ParquetSink. #11770
Comments
take |
Also, a proposed followup work (to handle the general case): Consider whether we have a testing gap for ParquetSink:
|
BTW I think you can write a test for the embedded schema in a parquet file using the DataFusion CLI v40.0.0
> copy (values (1)) to '/tmp/foo.parquet';
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.051 seconds.
> describe '/tmp/foo.parquet';
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| column1 | Int64 | YES |
+-------------+-----------+-------------+
1 row(s) fetched. Perhaps we could extend some of the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/copy.slt with |
This was hit in datafusion-python here: apache/datafusion-python#957 I'd argue this is an important bug, because there are a bunch of Arrow types that aren't representable as a Parquet logical type. Additionally it's not the case of a wrong default, there's no option today to force the Parquet writer to include the Arrow metadata. @wiedld did you get anywhere with this issue? |
Since we did our own hack around it, I wasn't able to justify prioritizing the fix. Seems like that has changed. I'll queue it up (fyi to @alamb ). |
take |
Describe the bug
We have been using two parquet writers: ArrowWriter vs ParquetSink (parallelized writes). We discovered a bug where the ArrowWriter includes the arrow schema (by default) in the parquet metadata on write. Whereas datafusion's ParquetSink does not include the arrow schema in the file metadata (a.k.a. it's missing here). This missing arrow schema metadata is important, as it's inclusion aids with later reading.
To Reproduce
let file_metadata: FileMetadata = <get from file per API>;
let arrow_schema = parquet_to_arrow_schema(
file_metadata.schema_descr(),
file_metadata.key_value_metadata(),
);
Expected behavior
Parquet written by ParquetSink should have the same default behavior (to include the arrow schema in the parquet metadata) as the ArrowWriter.
Additional context
No response
The text was updated successfully, but these errors were encountered: