-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve partition pruning during file listing #238
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #238 +/- ##
==========================================
+ Coverage 92.50% 92.60% +0.10%
==========================================
Files 29 29
Lines 1361 1420 +59
==========================================
+ Hits 1259 1315 +56
- Misses 102 105 +3 ☔ View full report in Codecov by Sentry. |
@xushiyan Hi, thank you for reviewing it for me when you have time! |
1. improve partition pruning during file listing 2. fix incorrect adding `FileSystemView::partition_to_file_groups` style: clippy the code 1. clippy the code style: fix code style 1. fix code style
1. add more partition-pruning ut
6b21e72
to
d8b2792
Compare
thanks @TheR1sing3un will go through this over the weekend. |
I'm looking into the conflicts as I have more context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheR1sing3un thanks for the good work. i cleaned up relevant code paths in #251 which the improvement here can be built upon. Left concrete ideas in the comments below.
if partition_paths.is_empty() { | ||
// TODO: reconsider is it reasonable to add empty partition path? For partitioned table, we should return empty vec rather than vec with empty string | ||
partition_paths.push("".to_string()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing this out. This should be improved. I've made changes for this in #251 . In short, we should return ""
for non-partitioned tables, this aligns with the convention we're using in timeline commit metadata where ""
is also used as a key for the partition write stats. We should return empty list like you said here for partitioned table if there is not yet any partition written.
async fn list_partition_paths_with_leveled_pruning( | ||
storage: &Storage, | ||
path: String, | ||
partition_pruner: &PartitionPruner, | ||
current_level: usize, | ||
) -> Result<Vec<String>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be merged with the get_leaf_dirs()
? so that we don't need to go 2 paths that one does multi-level pruning and the other does not. We can by default use multi-level pruning and early return when possible.
Check out this line https://github.com/apache/hudi-rs/pull/251/files#diff-a6c5a16b1d83ac9b5d186863a123fd1b273daf8710c12444c9f1b6f6fec96914R140
we probably need to move get_leaf_dirs()
into FileLister
such that it's the lister's responsibility to use pruner to check partition segments.
PartitionPruner owns the partition schema, so we should make use of it to understand which level is currently parsed and how many levels there should be.
To make code logic easy to understand, maybe iterative is better than recursive here - just loop through all the top level dirs and go down according to the schema.
Description
FileSystemView::partition_to_file_groups
Closes #205
How are the changes test-covered