feat(blob): RFC-100: Update plan for cleaner by the-other-tim-brown · Pull Request #18259 · apache/hudi

the-other-tim-brown · 2026-02-26T22:59:44Z

Describe the issue this Pull Request addresses

Summary and Changelog

Updates the plan for how the cleaning will work

Impact

Documents the approach we will be taking

Risk Level

None

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

vinothchandar

Few qs and wrote down some scenarios

vinothchandar · 2026-02-26T23:42:36Z

rfc/rfc-100/rfc-100.md


+**Implementation Details**:
+
+The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore.


we could not validate this either way in the weekly sync yesterday

vinothchandar · 2026-02-26T23:45:47Z

rfc/rfc-100/rfc-100.md

+The cleaner plan will remain the same but during the cleaner execution, we will search for blobs that are no longer referenced by iterating through the files being removed and creating a dataset of the managed, blob references contained in those files. Then we will create a dataset of the remaining blob references and use the `HoodieEngineContext` to left-join with the removed blob references to identify the unreferenced blobs. These unreferenced blobs will then be deleted from storage.
+The blob deletion must therefore happen before removing the files marked for deletion. If the cleaner crashes during execution, we should be able to re-run the plan in an idempotent manner. To account for this, we can skip any files that are already deleted when searching for de-referenced blobs.
+
+If global updates are enabled for the table, we will need to search through all the file slices since the data can move between partitions. If global updates are not enabled, we can limit the search with the following optimizations:


by global updates -- we mean updation of partition path of the record. if so, can we clarify that?

vinothchandar · 2026-02-27T00:18:23Z

rfc/rfc-100/rfc-100.md

+
+The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore.
+
+The cleaner plan will remain the same but during the cleaner execution, we will search for blobs that are no longer referenced by iterating through the files being removed and creating a dataset of the managed, blob references contained in those files. Then we will create a dataset of the remaining blob references and use the `HoodieEngineContext` to left-join with the removed blob references to identify the unreferenced blobs. These unreferenced blobs will then be deleted from storage.


Thinking through the cases, with the assumptions stated above:

This means that we will be a reading a different snapshot of the table R, than the snapshot S clean was planned at. Every file group in clean plan, will have at-least 2 file slices (file slice A, B) at the time of clean planning (done within a lock). File slice A is cleaned in plan.

Initially, just assuming R is just 1 more completed action than S.

if R=S , then all is good. no changes between clean planning and execution.

if R=replacecommit, then its handled like a regular handling of replace commit in blob cleaning. Blob references may still be live on the new file group?.

if R=deltacommit, then it could have updated/deleted some blob references, causing more blob references to be now garbage (not referenced anymore) during clean execution vs clean planning. still all good? do we need special handling for deletes?

if R=commit, then its a new base file produced from compaction (MoR) /merge (CoW). updates produce more garbage which clean execution will now delete.

Importantly, the main problematic race condition -- a concurrent write adds a new file slice C during clean execution, that may want retaining more blobs references vs deleting them based on checking just A and B -- is side-stepped because we ensure 2 file slices at time of planning logic.

The latest file slice as of clean planning (file slice B), will retain those unchanged blob references.

The latest file slice as of clean planning B, will also ensure any changed blob references in concurrent file slice C, will eventually be cleaned up when file slice B is cleaned

I think it ll be good to capture the correctness argument in detail and we add a test around this.

Given the assumption that the blob reference is not reused, we should also be fine to limit the set of retained files to the view of the table at the time of the cleaning. This may make it easier to debug any issues in the future.

There is special handling for deletes that I forgot to mention, we have to read the unmerged rows so that a delete in a log file does not hide any previous state of a row.

yeah, agree. if its all correct, then we should limit the view. so its determinisitic and isolated.

vinothchandar · 2026-02-27T00:19:03Z

rfc/rfc-100/rfc-100.md

+
+If global updates are enabled for the table, we will need to search through all the file slices since the data can move between partitions. If global updates are not enabled, we can limit the search with the following optimizations:
+- For files that are being removed but have a newer file slice for the file group, we can limit the search to files within the same file group.
+- For files that are being removed and do not have a newer file slice for the file group (this will occur during replace commits & clustering), we will need to inspect all the retained files in the partition that were created after the creation of the removed file slice since the data can move between file groups within the same partition.


can we run this scenario through a concurrent replace commit scenario as well

Can you describe the sequence of commits you are imagining? I will write up a test case for it

Same as above, the problematic race, except R is a concurrent replacecommit that cleaning execution does not see..

hudi-bot · 2026-02-27T00:24:32Z

CI report:

9ad868c Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

update plan for cleaner

9ad868c

the-other-tim-brown assigned vinothchandar Feb 26, 2026

github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2026

vinothchandar reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(blob): RFC-100: Update plan for cleaner#18259

feat(blob): RFC-100: Update plan for cleaner#18259
the-other-tim-brown wants to merge 1 commit intoapache:masterfrom
the-other-tim-brown:rfc-100-cleaner-execution-plan

the-other-tim-brown commented Feb 26, 2026

Uh oh!

vinothchandar left a comment

Uh oh!

vinothchandar Feb 26, 2026

Uh oh!

vinothchandar Feb 26, 2026

Uh oh!

vinothchandar Feb 27, 2026

Uh oh!

the-other-tim-brown Feb 27, 2026

Uh oh!

vinothchandar Feb 27, 2026

Uh oh!

vinothchandar Feb 27, 2026

Uh oh!

the-other-tim-brown Feb 27, 2026

Uh oh!

vinothchandar Feb 27, 2026

Uh oh!

hudi-bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Implementation Details:

		The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore.


		The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore.

		The cleaner plan will remain the same but during the cleaner execution, we will search for blobs that are no longer referenced by iterating through the files being removed and creating a dataset of the managed, blob references contained in those files. Then we will create a dataset of the remaining blob references and use the `HoodieEngineContext` to left-join with the removed blob references to identify the unreferenced blobs. These unreferenced blobs will then be deleted from storage.

Conversation

the-other-tim-brown commented Feb 26, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Feb 27, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants