Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. #13866

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Dec 21, 2024

WIP: have a test still to debug -- looks like partitioning needs to be checked

Which issue does this PR close?

Closes #11770

Rationale for this change

The ArrowWriter with it's default ArrowWriterOptions will encode the arrow schema in the parquet kv_metadata, unless explicitly skipped. Skipping is done via ArrowWriterOptions::with_skip_arrow_metadata.

In datafusion's ParquetSink, we can write in either single threaded or parallelized format. When in single-threaded mode, we use the default ArrowWriterOptions and the arrow schema is inserted into file kv_meta. However, when performing parallelized writes we do not use the ArrowWriter and instead rely upon the SerializedFileWriter. As a result, we are missing the arrow schema metadata in the parquet files (see the issue ticket).

ArrowWriterOptions vs WriterProperties

The SerializedFileWriter, along with other associated writers, rely upon the WriterProperties. The WriterProperties differ from the ArrowWriterOptions only in terms of the skip_arrow_metadata and the schema_root:

pub struct ArrowWriterOptions {
    properties: WriterProperties,
    skip_arrow_metadata: bool,
    schema_root: Option<String>,
}

The skip_arrow_metadata config is only used to decide if the schema should be added to the WriterProperties.kv_metadata.

Proposed Solution

Since we have WriterProperties, not ArrowWriterOptions, I focused on solutions which construct the proper WriterProperties.kv_metadata (with or without the arrow schema).

Our established pattern is to take TableParquetOptions configurations, and provide WriterProperties from those. There I updates those conversion methods to consider arrow schema insertion.

What changes are included in this PR?

  • add a new configuration ParquetOptions.skip_arrow_metadata
  • have ParquetSink single-threaded writes, which use the ArrowWriter, respect this configuration
  • have ParquetSink multiple-threaded writes, which use WriterProperties, respect this configuration

Are these changes tested?

yes.

Are there any user-facing changes?

We have new APIs:

  • ParquetOptions.skip_arrow_metadata configuration
  • replace/deprecate ParquetWriterOptions::try_from(ParquetWriterOptions) and replace with methods which explicitly handle arrow schema

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate labels Dec 21, 2024
@wiedld wiedld force-pushed the 11770/parquet-sink-metadata branch from 93278ee to 1a9da6f Compare December 21, 2024 01:10
Comment on lines +61 to +66
impl TableParquetOptions {
#[deprecated(
since = "44.0.0",
note = "Please use `TableParquetOptions::into_writer_properties_builder` and `TableParquetOptions::into_writer_properties_builder_with_arrow_schema`"
)]
pub fn try_from(table_opts: &TableParquetOptions) -> Result<ParquetWriterOptions> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot deprecate trait impls. Instead, this at least gives a deprecation notice under some conditions.

Comment on lines +165 to +170
/// Encodes the Arrow schema into the IPC format, and base64 encodes it
///
/// TODO: make arrow schema encoding available in a public API.
/// Refer to currently private `add_encoded_arrow_schema_to_metadata` and `encode_arrow_schema` public.
/// <https://github.com/apache/arrow-rs/blob/2908a80d9ca3e3fb0414e35b67856f1fb761304c/parquet/src/arrow/schema/mod.rs#L172-L221>
fn encode_arrow_schema(schema: &Arc<Schema>) -> String {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are in agreement on need, I'll go make the arrow-rs PR.

Comment on lines 2347 to +2353
async fn parquet_sink_write() -> Result<()> {
let parquet_sink = create_written_parquet_sink("file:///").await?;

// assert written
let mut written = parquet_sink.written();
let written = written.drain();
assert_eq!(
written.len(),
1,
"expected a single parquet files to be written, instead found {}",
written.len()
);
// assert written to proper path
let (path, file_metadata) = get_written(parquet_sink)?;
let path_parts = path.parts().collect::<Vec<_>>();
assert_eq!(path_parts.len(), 1, "should not have path prefix");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the tests first, into a series of helpers. May be easier to review in this commit: 09004c5

}

#[tokio::test]
async fn parquet_sink_parallel_write() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New test added.

}

#[tokio::test]
async fn parquet_sink_write_insert_schema_into_metadata() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New test added.

@wiedld wiedld force-pushed the 11770/parquet-sink-metadata branch from 1a9da6f to 0b960d9 Compare December 21, 2024 01:53
@wiedld wiedld changed the title ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata. WIP: ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata. Dec 21, 2024
@wiedld wiedld changed the title WIP: ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata. WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow schema is missing from the parquet metadata, for files written by ParquetSink.
1 participant