start gc only after all the appended log before crash are replayed by JacksonYao287 · Pull Request #389 · eBay/HomeObject

JacksonYao287 · 2026-02-09T03:03:47Z

we only start gc after all the logs are replayed, otherwise, the log replay of put_blob might lead to a data error issue as following:

let`s say cp_lsn and dc_lsn is 10, lsn 11 is put_blob (blob -> pba-chunk-1), and lsn 12 is seal_shard(shard-1 , chunk-1).

1 before crash, lsn 11 and lsn 12 are both committed. as a result , we have a blob -> pba-chunk-1 in the wbcache of indextable and a persisted superblk of shard-1 with a state sealed.

2 crash happens. after restart, blob -> pba-chunk-1 is lost since it only existes in wbcache and not be flushed to disk. but shard-1 has a stat of sealed since shard superblk is persisted before crash. now, since no open shard in chunk-1, chunk-1 is selected for gc and all the blobs of shard-1 are moved to chunk-2 , and chunk-1 becomes a reserved chunk.

3 since dc_lsn is 10, after log replay, we start committing lsn 11. since blob -> pba-chunk-1 does not exist in pg-index-table, on_blob_put_commit will insert a new item blob -> pba-chunk-1 to pg-index-table. this is where issue happens. blob belong to shard-1, which has been moved to chunk-2. but on_blob_put_commit adds blob to indextable with a stale pba belongs to chunk-1 , which is now a reserved chunk and will be purged later.

#388

codecov-commenter · 2026-02-09T09:19:31Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.52%. Comparing base (1746bcc) to head (e0e4c21).
⚠️ Report is 149 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #389      +/-   ##
==========================================
- Coverage   63.15%   54.52%   -8.64%     
==========================================
  Files          32       35       +3     
  Lines        1900     4950    +3050     
  Branches      204      627     +423     
==========================================
+ Hits         1200     2699    +1499     
- Misses        600     1966    +1366     
- Partials      100      285     +185

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

xiaoxichen

lgtm

JacksonYao287 · 2026-02-09T11:19:40Z

@xiaoxichen pls give an approval if it looks good to you , thanks

xiaoxichen

sorry I think you made some changes....pls let me review again....

updated

xiaoxichen · 2026-02-09T12:19:22Z

src/lib/homestore_backend/hs_homeobject.cpp

+    // issue as following:
+    /*
+    T1: shard=1, blob=100 committed at LSN 100.
+    T2: SM crashed with cp_lsn=90 and dc_lsn=110.


what if cp_lsn=1 and dc_lsn =1 ? all of the logs (lsn 100) will be commit after GC starts

good question! like I said here, #388 (comment), the right solution are
1 postpone persisting shard meta superblk to cp flush.
or
2 before committing seal_shard , trigger gc to make sure the cp lsn is upto the the one prior to seal_shard.
or
3 before gc, wait until all the appended log are committed

anyway, we should guarantee that if the state of a shard is sealed(which means it can be scheduled for gc), no put_blob will be committed to this shard.

@xiaoxichen I updated the pr using solution 3, which makes no changes to the current write path.

src/lib/homestore_backend/hs_homeobject.cpp

xiaoxichen

General Feedback:

[Comments in code]
Can you use AI to polish the comments , and/or let AI create a better doc (as MD file into github) for the corner cases

[Testing]

can we add some UT to test the behavior? Not necessary the GC but more the coordination of GC, when it could start. I am having a low confidence level on our implementation to be fully correct based on the number of times bugs were found in solution.
In most of the GC UT, we check intergrity by reading back data, can we check index? Because without overwritten even the index is pointing to old chunk( move_from chunk) , it still can be read.

[Feature]

can we update active_blob_count after GC to fix the counting issue.

     // active_blob_count is updated by put/delete blob, not change it here.

as well cc @yuwmao

[Correctness]

If the PG went through a BR after restarting , the notify_committed_lsn will be called in next log commit, is that OK? If this is fine, do we need to take care the rollback?

[Code style]

The nofity_committed_lsn has two consumer, EGC and enable_gc_flag, their logic is very similar , basically it is a CB that wants to be execute when a given LSN is committed, can we have a more general implementation for this, like a map[lsn]lamba ?

xiaoxichen · 2026-02-10T16:35:01Z

src/lib/homestore_backend/replication_state_machine.cpp

    const auto [target_lsn, chunk_id] = get_no_space_left_error_info();
    if (std::numeric_limits< homestore::repl_lsn_t >::max() == target_lsn) {
        // no pending no_space_left error to handle
        return;


if there is no EGC task, it will returns here, the newly added code would not be execute

JacksonYao287 · 2026-02-11T10:00:16Z

[Comments in code] Can you use AI to polish the comments , and/or let AI create a better doc (as MD file into github) for the corner cases

sure , will try to polish the comments. it`s a good idea to collect all the corner cases(not only gc related) into a MD file, I believe this will be very helpful for AI to deeply understand how homeobject works.

[Testing]

can we add some UT to test the behavior? Not necessary the GC but more the coordination of GC, when it could start. I am having a low confidence level on our implementation to be fully correct based on the number of times bugs were found in solution.

In most of the GC UT, we check intergrity by reading back data, can we check index? Because without overwritten even the index is pointing to old chunk( move_from chunk) , it still can be read.

1 coordination part of GC is hard to precisely control and write it into UT. what`s more, without a specific issue case, it is hard to imagine from scratch what corner case could bring issues. if we do not know a specific scenario, we can not write the UT code to cover the scenario accordingly. I can use a seperate PR trying to add UT for the two found issues with the help of AI(build pdev heap and log replay case), not in this PR.

2 I get your point. we can change the start command of gc UT to use only one reserved chunk for each pdev, so that all the gc task will share only one reserved chunk, and thus this reserved chunk(move_from_chunk) will be definitely overwritten, in next gc

[Feature]
can we update active_blob_count after GC to fix the counting issue.
     // active_blob_count is updated by put/delete blob, not change it here.
as well cc @yuwmao

I`m not quite clear about this suggestion, what counting issue do you mean? gc should never change active blob count. do I miss anything ? cc @yuwmao

[Correctness]

If the PG went through a BR after restarting , the notify_committed_lsn will be called in next log commit, is that OK? If this is fine, do we need to take care the rollback?

1 notify_committed_lsn is always called periodically(in a timer of homestore), and will not stop being called even if it is in BR.
2 rollback case has already been handled in this PR in normal case (not BR)
3 if BR happens, later coming rollback will not move back to less than snapshot_lsn, since any rollback will not rollback a lsn before snapshot lsn.
so this is fine.

[Code style]

The nofity_committed_lsn has two consumer, EGC and enable_gc_flag, their logic is very similar , basically it is a CB that wants to be execute when a given LSN is committed, can we have a more general implementation for this, like a map[lsn]lamba ?

there is nuance between enable_gc_flag and egc related no_space_error_info. enable_gc_flag is pg related, no_space_error_info is statemachine related. will think about it later

actually, the current code is just a work around(and also seems a little complicated) , but not solve this issue thoroughly. the root cause is that seal_shard breaks the assumption that when we commit lsn n(even if in log replay), for all the changes made by the logs with lsn > n , should not been made and seen. I am very concerned about this potential problem , and may be it will bring other issues later. should we drop this work around and postpone the superblk update of shard to cp_flush , like I said here ?
#388 (comment).

JacksonYao287 · 2026-02-11T10:11:47Z

I think a simple solution is that we force to trigger a cp_flush before updating shard metablk(change the persisted shard state to sealed) in on_commit_seal_shard, so that we can make sure no put_blob log will be recommitted for a sealed shard in log replay phase after restart @Besroy @xiaoxichen also cc @yuwmao

xiaoxichen · 2026-02-11T16:05:04Z

assumption that when we commit lsn n(even if in log replay), for all the changes made by the logs with lsn > n , should not been made and seen.

This assumption is never true even without seal shard, it is possible to see put_blob of LSN Y but not the put_blob of LSN Y-1, if their index was partially flushed.

The GC is changing the state-machine outside of raft, this is something we "break" the state-machine, and we believe we dont break it based on the observation of LSN X , however, if the SM is not yet reaching LSN X , then the change(GC) might be problematic. GC used the assumption you memtioned above, but that assumption is never true in our SM implementation.

If we believe we need it , we can add seal_lsn into shard_info, shard_info has version number, which is easier to make incompatibility changes.

xiaoxichen · 2026-02-11T16:12:09Z

coordination part of GC is hard to precisely control and write it into UT. what`s more, without a specific issue case, it is hard to imagine from scratch what corner case could bring issues.

The issues we hit , for example , this one, can it be reproduced from a randomly reprodced SH test , into a reproducible HO test case. See we have wrong implementation for this fix in this PR for several times, none of them had been catch by UT, this is a serious problem.

xiaoxichen · 2026-02-11T16:21:58Z

Can we in on_log_replay_done , we store the last_log_lsn into gc_start_lsn into PG, then add a can_gc() interface into PG which simply compare the commited_lsn vs gc_start_lsn.

Ignoring rollback --- the normal GC can start later , or even never start, if there is no new update to the PG, it is fine that the GC in this PG never started..

Ignore snapshot recovery --- lsn is increaing monodically .

JacksonYao287 · 2026-02-12T03:43:49Z

Ignoring rollback --- the normal GC can start later , or even never start, if there is no new update to the PG, it is fine that the GC in this PG never started.

if gc_start_lsn is 100, rollback happens and rollback log to lsn 90, commit new log to lsn 94, then emergent gc is triggered at lsn 95, since current commit_lsn is 94( less than gc_start_lsn which is 100), when we submit an emergent gc task, this pg is unable to be gc and fail emergent gc. this follower is stuck.

I rethink about gc_start_lsn related solution many times, and IMO too many corner cases will be probably involved.

triggering a cp_flush before updating shard metablk for seal_shared will not involve any new logic, and can guarantee if shard is seal, any blob index of this shard have been persisted. only cost is one more cp flush before seal shard, which is very fast for nvme ssd

JacksonYao287 · 2026-02-12T06:12:54Z

none of them had been catch by UT, this is a serious problem.

yes , I agree. we should spend more time on this to get a stronger UT

xiaoxichen · 2026-02-12T15:36:12Z

Ignoring rollback --- the normal GC can start later , or even never start, if there is no new update to the PG, it is fine that the GC in this PG never started.

if gc_start_lsn is 100, rollback happens and rollback log to lsn 90, commit new log to lsn 94, then emergent gc is triggered at lsn 95, since current commit_lsn is 94( less than gc_start_lsn which is 100), when we submit an emergent gc task, this pg is unable to be gc and fail emergent gc. this follower is stuck.

I rethink about gc_start_lsn related solution many times, and IMO too many corner cases will be probably involved.

triggering a cp_flush before updating shard metablk for seal_shared will not involve any new logic, and can guarantee if shard is seal, any blob index of this shard have been persisted. only cost is one more cp flush before seal shard, which is very fast for nvme ssd

The implementation is wired, unless we can flush just data around that chunk. Just similar to if you close a file , it is reasonable to flush the file but calling echo 3> /sys/proc/drop_caches is wired.

How about lets persist the DC_LSN? That should be enough to serve the purpose as consistent point. And the interface is already there , just call repl_dev::flush_durable_commit_lsn (probably make it public).

if gc_start_lsn is 100, rollback happens and rollback log to lsn 90, commit new log to lsn 94, then emergent gc is triggered at lsn 95, since current commit_lsn is 94( less than gc_start_lsn which is 100), when we submit an emergent gc task, this pg is unable to be gc and fail emergent gc. this follower is stuck.

I think EGC can be safely exclude .

JacksonYao287 requested review from Besroy and xiaoxichen February 9, 2026 03:05

JacksonYao287 force-pushed the issue-388 branch from f6ceeeb to 8fc47f5 Compare February 9, 2026 03:26

JacksonYao287 mentioned this pull request Feb 9, 2026

Blob read fails due to stale blob_id→location index when GC runs before log replay #388

Open

JacksonYao287 force-pushed the issue-388 branch from 2128452 to e0e4c21 Compare February 9, 2026 10:28

xiaoxichen reviewed Feb 9, 2026

View reviewed changes

xiaoxichen previously approved these changes Feb 9, 2026

View reviewed changes

xiaoxichen reviewed Feb 9, 2026

View reviewed changes

src/lib/homestore_backend/hs_homeobject.cpp Outdated Show resolved Hide resolved

JacksonYao287 force-pushed the issue-388 branch 2 times, most recently from 58fd345 to e3a6b02 Compare February 10, 2026 00:00

start gc only after all appended log are committed

9340b6a

JacksonYao287 force-pushed the issue-388 branch from e3a6b02 to 9340b6a Compare February 10, 2026 00:43

xiaoxichen reviewed Feb 10, 2026

View reviewed changes

src/lib/homestore_backend/hs_homeobject.cpp Outdated Show resolved Hide resolved

JacksonYao287 added 4 commits February 10, 2026 12:02

update

98da047

update

79c1fca

update

94b174a

change name

6cb449d

xiaoxichen reviewed Feb 10, 2026

View reviewed changes

JacksonYao287 changed the title ~~start gc only after log replay done~~ start gc only after all the appended log before crash are replayed Feb 11, 2026

Conversation

JacksonYao287 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiaoxichen left a comment

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 commented Feb 9, 2026

Uh oh!

xiaoxichen left a comment

Choose a reason for hiding this comment

Uh oh!

xiaoxichen Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xiaoxichen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoxichen Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 commented Feb 11, 2026

Uh oh!

JacksonYao287 commented Feb 11, 2026

Uh oh!

xiaoxichen commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaoxichen commented Feb 11, 2026

Uh oh!

xiaoxichen commented Feb 11, 2026

Uh oh!

JacksonYao287 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacksonYao287 commented Feb 12, 2026

Uh oh!

xiaoxichen commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JacksonYao287 commented Feb 9, 2026 •

edited

Loading

codecov-commenter commented Feb 9, 2026 •

edited

Loading

JacksonYao287 Feb 9, 2026 •

edited

Loading

xiaoxichen left a comment •

edited

Loading

xiaoxichen commented Feb 11, 2026 •

edited

Loading

JacksonYao287 commented Feb 12, 2026 •

edited

Loading

xiaoxichen commented Feb 12, 2026 •

edited

Loading