Skip to content

[HUDI-7030] Commit-based Clustering Plan Strategy#18251

Open
prashantwason wants to merge 2 commits intoapache:masterfrom
prashantwason:pw_oss_commit_porting_53
Open

[HUDI-7030] Commit-based Clustering Plan Strategy#18251
prashantwason wants to merge 2 commits intoapache:masterfrom
prashantwason:pw_oss_commit_porting_53

Conversation

@prashantwason
Copy link
Member

Summary

This PR introduces a new commit-based clustering plan strategy for Apache Hudi that processes data files based on their commit/ingestion order. This is particularly useful for streaming ingestion workloads where:

  • Newly generated small files need to be merged promptly to avoid performance degradation
  • Files produced earlier should be merged first to ensure accurate latency measurement
  • Already-clustered files should be excluded from future clustering plans

Key Changes

  • New CommitBasedClusteringPlanStrategy class - Processes commits in order and generates clustering plans based on commit patterns
  • Checkpoint-based tracking - Uses a checkpoint to track the last processed commit, enabling incremental clustering across runs
  • Replace commit handling - Properly excludes files that have been replaced by previous clustering jobs
  • Remaining partition handling - Creates clustering groups for partitions that haven't been fully processed

Configuration

Property Description
hoodie.clustering.plan.strategy.class Set to CommitBasedClusteringPlanStrategy
hoodie.clustering.plan.strategy.max.bytes.per.group Maximum bytes per clustering group
hoodie.clustering.plan.strategy.max.num.groups Maximum number of clustering groups per plan
hoodie.clustering.plan.last.commit Checkpoint for the last processed commit

Test plan

  • Unit tests for CommitBasedClusteringPlanStrategy (7 tests, all passing)
    • testGenerateClusteringPlanWithNoCommits
    • testGenerateClusteringPlanWithoutCheckpoint
    • testGenerateClusteringPlanWithSingleCommitSinglePartition (parameterized)
    • testGenerateClusteringPlanWithSingleCommitMultiPartitions
    • testGenerateClusteringPlanWithMultiCommits
    • testGenerateClusteringPlanWithReplaceCommit

🤖 Generated with Claude Code

Summary: The first version of commit based clustering plan is scope to append only ingestion jobs

Test Plan: unit tests

Reviewers: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, vamshi, bkrishen, ureview

Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, vamshi, bkrishen

Tags: #has_java

JIRA Issues: HUDI-7030

Differential Revision: https://code.uberinternal.com/D18833959
Summary: create clustering groups for remaining partitions

Test Plan: unit tests

Reviewers: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, ureview, vamshi

Reviewed By: O955 Project Hoodie Project Reviewer: Add blocking reviewers, #hoodie_blocking_reviewers, #ldap_hudi, vamshi

Subscribers: vamshi

Tags: #has_java

JIRA Issues: HUDI-7030, HUDI-7197

Differential Revision: https://code.uberinternal.com/D18935003
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2026
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 4.31655% with 133 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.23%. Comparing base (5fb1b34) to head (54f071b).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...er/strategy/CommitBasedClusteringPlanStrategy.java 0.00% 130 Missing ⚠️
...org/apache/hudi/config/HoodieClusteringConfig.java 75.00% 2 Missing ⚠️
...java/org/apache/hudi/config/HoodieWriteConfig.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18251      +/-   ##
============================================
- Coverage     57.29%   57.23%   -0.07%     
- Complexity    18542    18559      +17     
============================================
  Files          1945     1946       +1     
  Lines        106202   106395     +193     
  Branches      13130    13151      +21     
============================================
+ Hits          60848    60891      +43     
- Misses        39628    39780     +152     
+ Partials       5726     5724       -2     
Flag Coverage Δ
hadoop-mr-java-client 45.31% <4.31%> (-0.08%) ⬇️
spark-java-tests 47.37% <4.31%> (-0.05%) ⬇️
spark-scala-tests 45.45% <4.31%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/config/HoodieWriteConfig.java 84.94% <0.00%> (-0.06%) ⬇️
...org/apache/hudi/config/HoodieClusteringConfig.java 87.73% <75.00%> (-0.33%) ⬇️
...er/strategy/CommitBasedClusteringPlanStrategy.java 0.00% <0.00%> (ø)

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants