-
Notifications
You must be signed in to change notification settings - Fork 377
Open
Labels
type:bugSomething isn't workingSomething isn't working
Description
What happened?
public boolean isMinorNecessary() {
int smallFileCount = fragmentFileCount + equalityDeleteFileCount;
return smallFileCount >= config.getMinorLeastFileCount()
|| (smallFileCount > 1 && reachMinorInterval())
|| combinePosSegmentFileCount > 0;
}
protected boolean reachMinorInterval() {
return config.getMinorLeastInterval() >= 0
&& planTime - lastMinorOptimizingTime > config.getMinorLeastInterval();
}
If a table has some partitions with many small files and others with only two or three small files, the condition (smallFileCount > 1 && reachMinorInterval()) for those partitions with just two or three small files will never evaluate to true. Consequently, these partitions will never be included in minor optimizations. Essentially, reachMinorInterval should be evaluated at the partition level rather than the table level.
Affects Versions
0.8.1
What table formats are you seeing the problem on?
Iceberg
What engines are you seeing the problem on?
Spark
How to reproduce
No response
Relevant log output
Anything else
protected boolean reachMinorInterval() {
if (config.getMinorLeastInterval() < 0) {
return false;
}
long interval = planTime - lastMinorOptimizingTime;
if (interval > config.getMinorLeastInterval()) {
return true;
}
return isDifferentDay(lastMinorOptimizingTime, planTime);
}
Perhaps the reachMinorInterval can be modified to follow the aforementioned logic, ensuring that it evaluates to true at least once per day. This way, partitions with only two or three small files will also have a chance to be optimized.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
type:bugSomething isn't workingSomething isn't working