Skip to content

[Bug]: The reachMinorInterval function may never evaluate to true. #4055

@lintingbin

Description

@lintingbin

What happened?

  public boolean isMinorNecessary() {
    int smallFileCount = fragmentFileCount + equalityDeleteFileCount;
    return smallFileCount >= config.getMinorLeastFileCount()
        || (smallFileCount > 1 && reachMinorInterval())
        || combinePosSegmentFileCount > 0;
  }

  protected boolean reachMinorInterval() {
    return config.getMinorLeastInterval() >= 0
        && planTime - lastMinorOptimizingTime > config.getMinorLeastInterval();
  }

If a table has some partitions with many small files and others with only two or three small files, the condition (smallFileCount > 1 && reachMinorInterval()) for those partitions with just two or three small files will never evaluate to true. Consequently, these partitions will never be included in minor optimizations. Essentially, reachMinorInterval should be evaluated at the partition level rather than the table level.

Affects Versions

0.8.1

What table formats are you seeing the problem on?

Iceberg

What engines are you seeing the problem on?

Spark

How to reproduce

No response

Relevant log output

Anything else

protected boolean reachMinorInterval() {
    if (config.getMinorLeastInterval() < 0) {
        return false;
    }
    
    long interval = planTime - lastMinorOptimizingTime;
    
    if (interval > config.getMinorLeastInterval()) {
        return true;
    }
    
    return isDifferentDay(lastMinorOptimizingTime, planTime);
}

Perhaps the reachMinorInterval can be modified to follow the aforementioned logic, ensuring that it evaluates to true at least once per day. This way, partitions with only two or three small files will also have a chance to be optimized.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions