WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. #13866

wiedld · 2024-12-21T01:07:29Z

WIP: have a test still to debug -- looks like partitioning needs to be checked

Which issue does this PR close?

Rationale for this change

The ArrowWriter with it's default ArrowWriterOptions will encode the arrow schema in the parquet kv_metadata, unless explicitly skipped. Skipping is done via ArrowWriterOptions::with_skip_arrow_metadata.

In datafusion's ParquetSink, we can write in either single threaded or parallelized format. When in single-threaded mode, we use the default ArrowWriterOptions and the arrow schema is inserted into file kv_meta. However, when performing parallelized writes we do not use the ArrowWriter and instead rely upon the SerializedFileWriter. As a result, we are missing the arrow schema metadata in the parquet files (see the issue ticket).

ArrowWriterOptions vs WriterProperties

The SerializedFileWriter, along with other associated writers, rely upon the WriterProperties. The WriterProperties differ from the ArrowWriterOptions only in terms of the skip_arrow_metadata and the schema_root:

pub struct ArrowWriterOptions {
    properties: WriterProperties,
    skip_arrow_metadata: bool,
    schema_root: Option<String>,
}

The skip_arrow_metadata config is only used to decide if the schema should be added to the WriterProperties.kv_metadata.

Proposed Solution

Since we have WriterProperties, not ArrowWriterOptions, I focused on solutions which construct the proper WriterProperties.kv_metadata (with or without the arrow schema).

Our established pattern is to take TableParquetOptions configurations, and provide WriterProperties from those. There I updates those conversion methods to consider arrow schema insertion.

What changes are included in this PR?

add a new configuration ParquetOptions.skip_arrow_metadata
have ParquetSink single-threaded writes, which use the ArrowWriter, respect this configuration
have ParquetSink multiple-threaded writes, which use WriterProperties, respect this configuration

Are these changes tested?

yes.

Are there any user-facing changes?

We have new APIs:

ParquetOptions.skip_arrow_metadata configuration
replace/deprecate ParquetWriterOptions::try_from(ParquetWriterOptions) and replace with methods which explicitly handle arrow schema

wiedld · 2024-12-21T01:10:54Z

datafusion/common/src/file_options/parquet_writer.rs

+impl TableParquetOptions {
+    #[deprecated(
+        since = "44.0.0",
+        note = "Please use `TableParquetOptions::into_writer_properties_builder` and `TableParquetOptions::into_writer_properties_builder_with_arrow_schema`"
+    )]
+    pub fn try_from(table_opts: &TableParquetOptions) -> Result<ParquetWriterOptions> {


I cannot deprecate trait impls. Instead, this at least gives a deprecation notice under some conditions.

wiedld · 2024-12-21T01:11:48Z

datafusion/common/src/file_options/parquet_writer.rs

+/// Encodes the Arrow schema into the IPC format, and base64 encodes it
+///
+/// TODO: make arrow schema encoding available in a public API.
+/// Refer to currently private `add_encoded_arrow_schema_to_metadata` and `encode_arrow_schema` public.
+/// <https://github.com/apache/arrow-rs/blob/2908a80d9ca3e3fb0414e35b67856f1fb761304c/parquet/src/arrow/schema/mod.rs#L172-L221>
+fn encode_arrow_schema(schema: &Arc<Schema>) -> String {


If we are in agreement on need, I'll go make the arrow-rs PR.

wiedld · 2024-12-21T01:14:11Z

datafusion/core/src/datasource/file_format/parquet.rs

    async fn parquet_sink_write() -> Result<()> {
        let parquet_sink = create_written_parquet_sink("file:///").await?;

-        // assert written
-        let mut written = parquet_sink.written();
-        let written = written.drain();
-        assert_eq!(
-            written.len(),
-            1,
-            "expected a single parquet files to be written, instead found {}",
-            written.len()
-        );
+        // assert written to proper path
+        let (path, file_metadata) = get_written(parquet_sink)?;
+        let path_parts = path.parts().collect::<Vec<_>>();
+        assert_eq!(path_parts.len(), 1, "should not have path prefix");


I refactored the tests first, into a series of helpers. May be easier to review in this commit: 09004c5

wiedld · 2024-12-21T01:14:33Z

datafusion/core/src/datasource/file_format/parquet.rs

+    }
+
+    #[tokio::test]
+    async fn parquet_sink_parallel_write() -> Result<()> {


New test added.

wiedld · 2024-12-21T01:14:40Z

datafusion/core/src/datasource/file_format/parquet.rs

+    }
+
+    #[tokio::test]
+    async fn parquet_sink_write_insert_schema_into_metadata() -> Result<()> {


New test added.

…ady writing the arrow schema in the kv_meta, and allow disablement

…herent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not

…file metadata, based on the ParquetOptions

refactor: make ParquetSink tests a bit more readable

09004c5

github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate labels Dec 21, 2024

wiedld force-pushed the 11770/parquet-sink-metadata branch from 93278ee to 1a9da6f Compare December 21, 2024 01:10

wiedld commented Dec 21, 2024

View reviewed changes

wiedld added 6 commits December 20, 2024 17:41

chore(11770): add new ParquetOptions.skip_arrow_metadata

b20b151

test(11770): demonstrate that the single threaded ParquetSink is alre…

ce9510c

…ady writing the arrow schema in the kv_meta, and allow disablement

refactor(11770): replace with new method, since the kv_metadata is in…

da88cec

…herent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not

fix(11770): fix parallel ParquetSink to encode arrow schema into the …

013b098

…file metadata, based on the ParquetOptions

refactor(11770): provide deprecation warning for TryFrom

aac8571

test(11770): update tests with new default to include arrow schema

0b960d9

wiedld force-pushed the 11770/parquet-sink-metadata branch from 1a9da6f to 0b960d9 Compare December 21, 2024 01:53

wiedld changed the title ~~ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata.~~ WIP: ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata. Dec 21, 2024

wiedld changed the title ~~WIP: ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata.~~ WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. #13866

WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. #13866

wiedld commented Dec 21, 2024 •

edited

Loading

wiedld Dec 21, 2024

wiedld Dec 21, 2024

wiedld Dec 21, 2024

wiedld Dec 21, 2024

wiedld Dec 21, 2024

WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. #13866

Are you sure you want to change the base?

WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. #13866

Conversation

wiedld commented Dec 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

ArrowWriterOptions vs WriterProperties

Proposed Solution

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

wiedld Dec 21, 2024

Choose a reason for hiding this comment

wiedld Dec 21, 2024

Choose a reason for hiding this comment

wiedld Dec 21, 2024

Choose a reason for hiding this comment

wiedld Dec 21, 2024

Choose a reason for hiding this comment

wiedld Dec 21, 2024

Choose a reason for hiding this comment

wiedld commented Dec 21, 2024 •

edited

Loading