Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve partition pruning during file listing #238

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

TheR1sing3un
Copy link
Member

@TheR1sing3un TheR1sing3un commented Jan 8, 2025

Description

  1. improve partition pruning during file listing
  2. fix incorrect adding FileSystemView::partition_to_file_groups

Closes #205

How are the changes test-covered

  • N/A
  • Automated tests (unit and/or integration tests)
  • Manual tests
    • Details are described below

Copy link

codecov bot commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 94.73684% with 4 lines in your changes missing coverage. Please review.

Project coverage is 92.60%. Comparing base (1779e77) to head (d8b2792).

Files with missing lines Patch % Lines
crates/core/src/table/partition.rs 92.10% 3 Missing ⚠️
crates/core/src/table/fs_view.rs 97.36% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #238      +/-   ##
==========================================
+ Coverage   92.50%   92.60%   +0.10%     
==========================================
  Files          29       29              
  Lines        1361     1420      +59     
==========================================
+ Hits         1259     1315      +56     
- Misses        102      105       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TheR1sing3un
Copy link
Member Author

@xushiyan Hi, thank you for reviewing it for me when you have time!

@xushiyan xushiyan self-assigned this Jan 8, 2025
@xushiyan xushiyan added the refactor Code refactoring without any functionality or behavior change label Jan 8, 2025
@xushiyan xushiyan added this to the release-0.3.0 milestone Jan 8, 2025
1. improve partition pruning during file listing
2. fix incorrect adding `FileSystemView::partition_to_file_groups`

style: clippy the code

1. clippy the code

style: fix code style

1. fix code style
1. add more partition-pruning ut
@TheR1sing3un TheR1sing3un force-pushed the improve_partition_listing branch from 6b21e72 to d8b2792 Compare January 9, 2025 02:33
@xushiyan xushiyan added performance rust Related to Rust codebase and removed refactor Code refactoring without any functionality or behavior change labels Jan 9, 2025
@xushiyan
Copy link
Member

xushiyan commented Jan 9, 2025

thanks @TheR1sing3un will go through this over the weekend.

@xushiyan xushiyan self-requested a review January 9, 2025 19:04
@xushiyan
Copy link
Member

I'm looking into the conflicts as I have more context

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheR1sing3un thanks for the good work. i cleaned up relevant code paths in #251 which the improvement here can be built upon. Left concrete ideas in the comments below.

Comment on lines 90 to 93
if partition_paths.is_empty() {
// TODO: reconsider is it reasonable to add empty partition path? For partitioned table, we should return empty vec rather than vec with empty string
partition_paths.push("".to_string())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing this out. This should be improved. I've made changes for this in #251 . In short, we should return "" for non-partitioned tables, this aligns with the convention we're using in timeline commit metadata where "" is also used as a key for the partition write stats. We should return empty list like you said here for partitioned table if there is not yet any partition written.

Comment on lines +99 to +104
async fn list_partition_paths_with_leveled_pruning(
storage: &Storage,
path: String,
partition_pruner: &PartitionPruner,
current_level: usize,
) -> Result<Vec<String>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be merged with the get_leaf_dirs()? so that we don't need to go 2 paths that one does multi-level pruning and the other does not. We can by default use multi-level pruning and early return when possible.

Check out this line https://github.com/apache/hudi-rs/pull/251/files#diff-a6c5a16b1d83ac9b5d186863a123fd1b273daf8710c12444c9f1b6f6fec96914R140

we probably need to move get_leaf_dirs() into FileLister such that it's the lister's responsibility to use pruner to check partition segments.

PartitionPruner owns the partition schema, so we should make use of it to understand which level is currently parsed and how many levels there should be.

To make code logic easy to understand, maybe iterative is better than recursive here - just loop through all the top level dirs and go down according to the schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance rust Related to Rust codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve partition pruning during file listing
2 participants