Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ability for users to create their own writers using the low level parquet APIs. #6177

Open
wiedld opened this issue Aug 1, 2024 · 1 comment
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@wiedld
Copy link
Contributor

wiedld commented Aug 1, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We have been using at least two parquet writers that both utilize the low-level APIs provided by the parquet crate (e.g. SerializedFileWriter). One of the writers (ArrowWriter) is provided as part of the parquet crate, whereas the other parallel writer (datafusion's ParquetSink) is not. However, in both cases we later attempt to read these files using the parquet crate's readers.

The challenge is that we keep encountering unexpected differences in the parquet written by these two writers. The most recent example is that the arrow schema is missing when using datafusion's parallel writer, whereas it is included in the ArrowWriter on parquet write.

Describe the solution you'd like

Can we update the lower level APIs (in the parquet crate) to make it easier for users to create their own parquet writers -- without encountering surprise differences from the behavior of parquet's ArrowWriter? Provide better documentation? Provide guidance for testing when creating your own parquet writer?

Describe alternatives you've considered

Alternatively, we could consider this problem as the responsibility for the users that create their own writers. We already plan to file a datafusion ticket proposing that we need integration tests to ensure byte equivalency in the output parquet (vs parquet written by ArrowWriter).

Additional context

This is not the first time that we have discovered differences in the output parquet from the ArrowWriter vs datafusion's ParquetSink. However, we are unclear on the best way to divide responsibilities (for more testing) vs API design.

@wiedld wiedld added the enhancement Any new improvement worthy of a entry in the changelog label Aug 1, 2024
@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

In particular DataFusion uses the ArrowColumnWriter: https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html

The example on that structure

// Compute the parquet schema
let parquet_schema = arrow_to_parquet_schema(schema.as_ref()).unwrap();
let props = Arc::new(WriterProperties::default());

Which doesn't encode the arrow schema in the metadata of props the way that ArrowWriter does:
https://docs.rs/parquet/latest/src/parquet/arrow/arrow_writer/mod.rs.html#188-191

At least one reason it doesn't is that the add_encoded_arrow_schema_to_metadata is not public:

https://docs.rs/parquet/latest/parquet/?search=add_encoded_arrow_schema_to_metadata

Screenshot 2024-08-01 at 3 54 18 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants