Skip to content

feat(blob): RFC-100: Update plan for cleaner#18259

Open
the-other-tim-brown wants to merge 1 commit intoapache:masterfrom
the-other-tim-brown:rfc-100-cleaner-execution-plan
Open

feat(blob): RFC-100: Update plan for cleaner#18259
the-other-tim-brown wants to merge 1 commit intoapache:masterfrom
the-other-tim-brown:rfc-100-cleaner-execution-plan

Conversation

@the-other-tim-brown
Copy link
Contributor

Describe the issue this Pull Request addresses

Summary and Changelog

  • Updates the plan for how the cleaning will work

Impact

  • Documents the approach we will be taking

Risk Level

None

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2026
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few qs and wrote down some scenarios


**Implementation Details**:

The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could not validate this either way in the weekly sync yesterday

The cleaner plan will remain the same but during the cleaner execution, we will search for blobs that are no longer referenced by iterating through the files being removed and creating a dataset of the managed, blob references contained in those files. Then we will create a dataset of the remaining blob references and use the `HoodieEngineContext` to left-join with the removed blob references to identify the unreferenced blobs. These unreferenced blobs will then be deleted from storage.
The blob deletion must therefore happen before removing the files marked for deletion. If the cleaner crashes during execution, we should be able to re-run the plan in an idempotent manner. To account for this, we can skip any files that are already deleted when searching for de-referenced blobs.

If global updates are enabled for the table, we will need to search through all the file slices since the data can move between partitions. If global updates are not enabled, we can limit the search with the following optimizations:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by global updates -- we mean updation of partition path of the record. if so, can we clarify that?


The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore.

The cleaner plan will remain the same but during the cleaner execution, we will search for blobs that are no longer referenced by iterating through the files being removed and creating a dataset of the managed, blob references contained in those files. Then we will create a dataset of the remaining blob references and use the `HoodieEngineContext` to left-join with the removed blob references to identify the unreferenced blobs. These unreferenced blobs will then be deleted from storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through the cases, with the assumptions stated above:

This means that we will be a reading a different snapshot of the table R, than the snapshot S clean was planned at. Every file group in clean plan, will have at-least 2 file slices (file slice A, B) at the time of clean planning (done within a lock). File slice A is cleaned in plan.

Initially, just assuming R is just 1 more completed action than S.

  1. if R=S , then all is good. no changes between clean planning and execution.
  2. if R=replacecommit, then its handled like a regular handling of replace commit in blob cleaning. Blob references may still be live on the new file group?.
  3. if R=deltacommit, then it could have updated/deleted some blob references, causing more blob references to be now garbage (not referenced anymore) during clean execution vs clean planning. still all good? do we need special handling for deletes?
  4. if R=commit, then its a new base file produced from compaction (MoR) /merge (CoW). updates produce more garbage which clean execution will now delete.

Importantly, the main problematic race condition -- a concurrent write adds a new file slice C during clean execution, that may want retaining more blobs references vs deleting them based on checking just A and B -- is side-stepped because we ensure 2 file slices at time of planning logic.

  • The latest file slice as of clean planning (file slice B), will retain those unchanged blob references.
  • The latest file slice as of clean planning B, will also ensure any changed blob references in concurrent file slice C, will eventually be cleaned up when file slice B is cleaned

I think it ll be good to capture the correctness argument in detail and we add a test around this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the assumption that the blob reference is not reused, we should also be fine to limit the set of retained files to the view of the table at the time of the cleaning. This may make it easier to debug any issues in the future.

There is special handling for deletes that I forgot to mention, we have to read the unmerged rows so that a delete in a log file does not hide any previous state of a row.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, agree. if its all correct, then we should limit the view. so its determinisitic and isolated.


If global updates are enabled for the table, we will need to search through all the file slices since the data can move between partitions. If global updates are not enabled, we can limit the search with the following optimizations:
- For files that are being removed but have a newer file slice for the file group, we can limit the search to files within the same file group.
- For files that are being removed and do not have a newer file slice for the file group (this will occur during replace commits & clustering), we will need to inspect all the retained files in the partition that were created after the creation of the removed file slice since the data can move between file groups within the same partition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we run this scenario through a concurrent replace commit scenario as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe the sequence of commits you are imagining? I will write up a test case for it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, the problematic race, except R is a concurrent replacecommit that cleaning execution does not see..

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants