-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: ParquetSink should be aware of arrow schema encoding for the file metadata. #13866
base: main
Are you sure you want to change the base?
Conversation
93278ee
to
1a9da6f
Compare
impl TableParquetOptions { | ||
#[deprecated( | ||
since = "44.0.0", | ||
note = "Please use `TableParquetOptions::into_writer_properties_builder` and `TableParquetOptions::into_writer_properties_builder_with_arrow_schema`" | ||
)] | ||
pub fn try_from(table_opts: &TableParquetOptions) -> Result<ParquetWriterOptions> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot deprecate trait impls. Instead, this at least gives a deprecation notice under some conditions.
/// Encodes the Arrow schema into the IPC format, and base64 encodes it | ||
/// | ||
/// TODO: make arrow schema encoding available in a public API. | ||
/// Refer to currently private `add_encoded_arrow_schema_to_metadata` and `encode_arrow_schema` public. | ||
/// <https://github.com/apache/arrow-rs/blob/2908a80d9ca3e3fb0414e35b67856f1fb761304c/parquet/src/arrow/schema/mod.rs#L172-L221> | ||
fn encode_arrow_schema(schema: &Arc<Schema>) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are in agreement on need, I'll go make the arrow-rs PR.
async fn parquet_sink_write() -> Result<()> { | ||
let parquet_sink = create_written_parquet_sink("file:///").await?; | ||
|
||
// assert written | ||
let mut written = parquet_sink.written(); | ||
let written = written.drain(); | ||
assert_eq!( | ||
written.len(), | ||
1, | ||
"expected a single parquet files to be written, instead found {}", | ||
written.len() | ||
); | ||
// assert written to proper path | ||
let (path, file_metadata) = get_written(parquet_sink)?; | ||
let path_parts = path.parts().collect::<Vec<_>>(); | ||
assert_eq!(path_parts.len(), 1, "should not have path prefix"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the tests first, into a series of helpers. May be easier to review in this commit: 09004c5
} | ||
|
||
#[tokio::test] | ||
async fn parquet_sink_parallel_write() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New test added.
} | ||
|
||
#[tokio::test] | ||
async fn parquet_sink_write_insert_schema_into_metadata() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New test added.
…ady writing the arrow schema in the kv_meta, and allow disablement
…herent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not
…file metadata, based on the ParquetOptions
1a9da6f
to
0b960d9
Compare
WIP: have a test still to debug -- looks like partitioning needs to be checked
Which issue does this PR close?
Closes #11770
Rationale for this change
The ArrowWriter with it's default
ArrowWriterOptions
will encode the arrow schema in the parquet kv_metadata, unless explicitly skipped. Skipping is done via ArrowWriterOptions::with_skip_arrow_metadata.In datafusion's ParquetSink, we can write in either single threaded or parallelized format. When in single-threaded mode, we use the default
ArrowWriterOptions
and the arrow schema is inserted into file kv_meta. However, when performing parallelized writes we do not use the ArrowWriter and instead rely upon the SerializedFileWriter. As a result, we are missing the arrow schema metadata in the parquet files (see the issue ticket).ArrowWriterOptions vs WriterProperties
The SerializedFileWriter, along with other associated writers, rely upon the WriterProperties. The
WriterProperties
differ from theArrowWriterOptions
only in terms of theskip_arrow_metadata
and theschema_root
:The skip_arrow_metadata config is only used to decide if the schema should be added to the
WriterProperties.kv_metadata
.Proposed Solution
Since we have WriterProperties, not ArrowWriterOptions, I focused on solutions which construct the proper
WriterProperties.kv_metadata
(with or without the arrow schema).Our established pattern is to take
TableParquetOptions
configurations, and provide WriterProperties from those. There I updates those conversion methods to consider arrow schema insertion.What changes are included in this PR?
ParquetOptions.skip_arrow_metadata
Are these changes tested?
yes.
Are there any user-facing changes?
We have new APIs:
ParquetOptions.skip_arrow_metadata
configurationParquetWriterOptions::try_from(ParquetWriterOptions)
and replace with methods which explicitly handle arrow schema