feat(blob): RFC-100: Update plan for cleaner#18259
feat(blob): RFC-100: Update plan for cleaner#18259the-other-tim-brown wants to merge 1 commit intoapache:masterfrom
Conversation
vinothchandar
left a comment
There was a problem hiding this comment.
Few qs and wrote down some scenarios
|
|
||
| **Implementation Details**: | ||
|
|
||
| The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore. |
There was a problem hiding this comment.
we could not validate this either way in the weekly sync yesterday
| The cleaner plan will remain the same but during the cleaner execution, we will search for blobs that are no longer referenced by iterating through the files being removed and creating a dataset of the managed, blob references contained in those files. Then we will create a dataset of the remaining blob references and use the `HoodieEngineContext` to left-join with the removed blob references to identify the unreferenced blobs. These unreferenced blobs will then be deleted from storage. | ||
| The blob deletion must therefore happen before removing the files marked for deletion. If the cleaner crashes during execution, we should be able to re-run the plan in an idempotent manner. To account for this, we can skip any files that are already deleted when searching for de-referenced blobs. | ||
|
|
||
| If global updates are enabled for the table, we will need to search through all the file slices since the data can move between partitions. If global updates are not enabled, we can limit the search with the following optimizations: |
There was a problem hiding this comment.
by global updates -- we mean updation of partition path of the record. if so, can we clarify that?
|
|
||
| The main assumption for out-of-line, managed blobs is that they will be used once. This implies that the blob will not be used by multiple rows in the dataset. Similar once a row is updated to point to a new blob, the old blob will not be referenced anymore. | ||
|
|
||
| The cleaner plan will remain the same but during the cleaner execution, we will search for blobs that are no longer referenced by iterating through the files being removed and creating a dataset of the managed, blob references contained in those files. Then we will create a dataset of the remaining blob references and use the `HoodieEngineContext` to left-join with the removed blob references to identify the unreferenced blobs. These unreferenced blobs will then be deleted from storage. |
There was a problem hiding this comment.
Thinking through the cases, with the assumptions stated above:
This means that we will be a reading a different snapshot of the table R, than the snapshot S clean was planned at. Every file group in clean plan, will have at-least 2 file slices (file slice A, B) at the time of clean planning (done within a lock). File slice A is cleaned in plan.
Initially, just assuming R is just 1 more completed action than S.
- if
R=S, then all is good. no changes between clean planning and execution. - if
R=replacecommit, then its handled like a regular handling of replace commit in blob cleaning. Blob references may still be live on the new file group?. - if
R=deltacommit, then it could have updated/deleted some blob references, causing more blob references to be now garbage (not referenced anymore) during clean execution vs clean planning. still all good? do we need special handling for deletes? - if
R=commit, then its a new base file produced from compaction (MoR) /merge (CoW). updates produce more garbage which clean execution will now delete.
Importantly, the main problematic race condition -- a concurrent write adds a new file slice C during clean execution, that may want retaining more blobs references vs deleting them based on checking just A and B -- is side-stepped because we ensure 2 file slices at time of planning logic.
- The latest file slice as of clean planning (file slice
B), will retain those unchanged blob references. - The latest file slice as of clean planning
B, will also ensure any changed blob references in concurrent file sliceC, will eventually be cleaned up when file sliceBis cleaned
I think it ll be good to capture the correctness argument in detail and we add a test around this.
There was a problem hiding this comment.
Given the assumption that the blob reference is not reused, we should also be fine to limit the set of retained files to the view of the table at the time of the cleaning. This may make it easier to debug any issues in the future.
There is special handling for deletes that I forgot to mention, we have to read the unmerged rows so that a delete in a log file does not hide any previous state of a row.
There was a problem hiding this comment.
yeah, agree. if its all correct, then we should limit the view. so its determinisitic and isolated.
|
|
||
| If global updates are enabled for the table, we will need to search through all the file slices since the data can move between partitions. If global updates are not enabled, we can limit the search with the following optimizations: | ||
| - For files that are being removed but have a newer file slice for the file group, we can limit the search to files within the same file group. | ||
| - For files that are being removed and do not have a newer file slice for the file group (this will occur during replace commits & clustering), we will need to inspect all the retained files in the partition that were created after the creation of the removed file slice since the data can move between file groups within the same partition. |
There was a problem hiding this comment.
can we run this scenario through a concurrent replace commit scenario as well
There was a problem hiding this comment.
Can you describe the sequence of commits you are imagining? I will write up a test case for it
There was a problem hiding this comment.
Same as above, the problematic race, except R is a concurrent replacecommit that cleaning execution does not see..
Describe the issue this Pull Request addresses
Summary and Changelog
Impact
Risk Level
None
Documentation Update
Contributor's checklist