Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use projected_table_schema for projection in DeltaSchemaAdapter #3068

Merged
merged 1 commit into from
Dec 20, 2024

Conversation

jkylling
Copy link
Contributor

Description

After upgrading from deltalake 0.20.1 to 0.22.3 it looks like Parquet column projection is broken when using DeltaTable::scan. Instead of scanning only the a single column, it looks like all columns are fetched from storage.

Inspection with a debugger reveals that the adapted_projections are wrong here:
https://github.com/apache/datafusion/blob/88f58bf929167c5c5e2250ad87caa88d4dff11e5/datafusion/core/src/datasource/physical_plan/parquet/opener.rs#L153-L159 The adapted_projections are obtained in

let mut projection = Vec::with_capacity(file_schema.fields().len());
for (file_idx, file_field) in file_schema.fields.iter().enumerate() {
if self.table_schema.fields().find(file_field.name()).is_some() {
projection.push(file_idx);
}
}
Ok((
Arc::new(SchemaMapping {
projected_schema: self.projected_table_schema.clone(),
table_schema: self.table_schema.clone(),
}),
projection,
))
Changing line 49 to use the projected_table_schema seems to solve the problem.

Opening this to see if we have test coverage.

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Dec 18, 2024
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@jkylling jkylling force-pushed the use-projected_table_schema branch 2 times, most recently from 1840bb7 to 35fdee8 Compare December 18, 2024 19:31
@jkylling
Copy link
Contributor Author

For reference, the corresponding code in Datafusion 42 is https://github.com/apache/datafusion/blob/a43ce8bd67e8d649e0e4f5260f8a5e6e10d62dbc/datafusion/core/src/datasource/physical_plan/parquet/opener.rs#L80-L82 When upgrading from 42 to 43 the meaning of table_schema was changed from being the projected schema, to be the full table schema, and a new argument projected_schema was added. This was mapped to projected_table_schema and table_schema in our code, but we forgot to update all uses of table_schema in the DeltaSchemaAdapter to use the projected_table_schema.

The changes to the SchemaAdapter API were introduced in apache/datafusion#12135

@rtyler rtyler self-assigned this Dec 19, 2024
@rtyler rtyler added the bug Something isn't working label Dec 19, 2024
@rtyler rtyler added this to the v0.23 milestone Dec 19, 2024
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have this merged one main goes green again

@rtyler rtyler enabled auto-merge December 19, 2024 14:18
@ion-elgreco ion-elgreco force-pushed the use-projected_table_schema branch from 35fdee8 to 43f2b8c Compare December 20, 2024 07:49
@ion-elgreco ion-elgreco changed the title Use projected_table_schema for projection in DeltaSchemaAdapter ix: use projected_table_schema for projection in DeltaSchemaAdapter Dec 20, 2024
@ion-elgreco ion-elgreco changed the title ix: use projected_table_schema for projection in DeltaSchemaAdapter fix: use projected_table_schema for projection in DeltaSchemaAdapter Dec 20, 2024
After upgrading from deltalake 0.20.1 to 0.22.3 it looks like Parquet
column projection is  broken when using DeltaTable::scan. Instead of
scanning only the a single column, it looks like all columns are
fetched from storage.

Inspection with a debugger revelas that the adapted_projections are
wrong here:
https://github.com/apache/datafusion/blob/88f58bf929167c5c5e2250ad87caa88d4dff11e5/datafusion/core/src/datasource/physical_plan/parquet/opener.rs#L153-L159
The adapted_projections are obtained in
https://github.com/delta-io/delta-rs/blob/5b2f46b06e0eb508f932a8b39feb11b568a78a32/crates/core/src/delta_datafusion/schema_adapter.rs#L46-L60
Changing line 49 to use the projected_table_schema seems to solve the
problem.

Signed-off-by: Jonas Irgens Kylling <[email protected]>
@ion-elgreco ion-elgreco force-pushed the use-projected_table_schema branch from 43f2b8c to 7a89c3b Compare December 20, 2024 08:11
auto-merge was automatically disabled December 20, 2024 08:11

Head branch was pushed to by a user without write access

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.42%. Comparing base (b241bf6) to head (7a89c3b).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3068      +/-   ##
==========================================
+ Coverage   72.40%   72.42%   +0.02%     
==========================================
  Files         128      128              
  Lines       41007    41012       +5     
  Branches    41007    41012       +5     
==========================================
+ Hits        29690    29703      +13     
+ Misses       9425     9424       -1     
+ Partials     1892     1885       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ion-elgreco ion-elgreco added this pull request to the merge queue Dec 20, 2024
Merged via the queue into delta-io:main with commit 2272ff7 Dec 20, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants