Skip to content

start gc only after all the appended log before crash are replayed#389

Open
JacksonYao287 wants to merge 5 commits intoeBay:mainfrom
JacksonYao287:issue-388
Open

start gc only after all the appended log before crash are replayed#389
JacksonYao287 wants to merge 5 commits intoeBay:mainfrom
JacksonYao287:issue-388

Conversation

@JacksonYao287
Copy link
Collaborator

@JacksonYao287 JacksonYao287 commented Feb 9, 2026

we only start gc after all the logs are replayed, otherwise, the log replay of put_blob might lead to a data error issue as following:

let`s say cp_lsn and dc_lsn is 10, lsn 11 is put_blob (blob -> pba-chunk-1), and lsn 12 is seal_shard(shard-1 , chunk-1).

1 before crash, lsn 11 and lsn 12 are both committed. as a result , we have a blob -> pba-chunk-1 in the wbcache of indextable and a persisted superblk of shard-1 with a state sealed.

2 crash happens. after restart, blob -> pba-chunk-1 is lost since it only existes in wbcache and not be flushed to disk. but shard-1 has a stat of sealed since shard superblk is persisted before crash. now, since no open shard in chunk-1, chunk-1 is selected for gc and all the blobs of shard-1 are moved to chunk-2 , and chunk-1 becomes a reserved chunk.

3 since dc_lsn is 10, after log replay, we start committing lsn 11. since blob -> pba-chunk-1 does not exist in pg-index-table, on_blob_put_commit will insert a new item blob -> pba-chunk-1 to pg-index-table. this is where issue happens. blob belong to shard-1, which has been moved to chunk-2. but on_blob_put_commit adds blob to indextable with a stale pba belongs to chunk-1 , which is now a reserved chunk and will be purged later.

#388

@codecov-commenter
Copy link

codecov-commenter commented Feb 9, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.52%. Comparing base (1746bcc) to head (e0e4c21).
⚠️ Report is 149 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #389      +/-   ##
==========================================
- Coverage   63.15%   54.52%   -8.64%     
==========================================
  Files          32       35       +3     
  Lines        1900     4950    +3050     
  Branches      204      627     +423     
==========================================
+ Hits         1200     2699    +1499     
- Misses        600     1966    +1366     
- Partials      100      285     +185     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@JacksonYao287
Copy link
Collaborator Author

@xiaoxichen pls give an approval if it looks good to you , thanks

xiaoxichen
xiaoxichen previously approved these changes Feb 9, 2026
Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I think you made some changes....pls let me review again....

// issue as following:
/*
T1: shard=1, blob=100 committed at LSN 100.
T2: SM crashed with cp_lsn=90 and dc_lsn=110.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if cp_lsn=1 and dc_lsn =1 ? all of the logs (lsn 100) will be commit after GC starts

Copy link
Collaborator Author

@JacksonYao287 JacksonYao287 Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question! like I said here, #388 (comment), the right solution are
1 postpone persisting shard meta superblk to cp flush.
or
2 before committing seal_shard , trigger gc to make sure the cp lsn is upto the the one prior to seal_shard.
or
3 before gc, wait until all the appended log are committed

anyway, we should guarantee that if the state of a shard is sealed(which means it can be scheduled for gc), no put_blob will be committed to this shard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxichen I updated the pr using solution 3, which makes no changes to the current write path.

@JacksonYao287 JacksonYao287 force-pushed the issue-388 branch 2 times, most recently from 58fd345 to e3a6b02 Compare February 10, 2026 00:00
Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General Feedback:

[Comments in code]
Can you use AI to polish the comments , and/or let AI create a better doc (as MD file into github) for the corner cases

[Testing]

  1. can we add some UT to test the behavior? Not necessary the GC but more the coordination of GC, when it could start. I am having a low confidence level on our implementation to be fully correct based on the number of times bugs were found in solution.
  2. In most of the GC UT, we check intergrity by reading back data, can we check index? Because without overwritten even the index is pointing to old chunk( move_from chunk) , it still can be read.

[Feature]

  1. can we update active_blob_count after GC to fix the counting issue.

         // active_blob_count is updated by put/delete blob, not change it here.
    

as well cc @yuwmao

[Correctness]

  1. If the PG went through a BR after restarting , the notify_committed_lsn will be called in next log commit, is that OK? If this is fine, do we need to take care the rollback?

[Code style]

  1. The nofity_committed_lsn has two consumer, EGC and enable_gc_flag, their logic is very similar , basically it is a CB that wants to be execute when a given LSN is committed, can we have a more general implementation for this, like a map[lsn]lamba ?

const auto [target_lsn, chunk_id] = get_no_space_left_error_info();
if (std::numeric_limits< homestore::repl_lsn_t >::max() == target_lsn) {
// no pending no_space_left error to handle
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no EGC task, it will returns here, the newly added code would not be execute

@JacksonYao287 JacksonYao287 changed the title start gc only after log replay done start gc only after all the appended log before crash are replayed Feb 11, 2026
@JacksonYao287
Copy link
Collaborator Author

[Comments in code] Can you use AI to polish the comments , and/or let AI create a better doc (as MD file into github) for the corner cases

sure , will try to polish the comments. it`s a good idea to collect all the corner cases(not only gc related) into a MD file, I believe this will be very helpful for AI to deeply understand how homeobject works.

[Testing]

  1. can we add some UT to test the behavior? Not necessary the GC but more the coordination of GC, when it could start. I am having a low confidence level on our implementation to be fully correct based on the number of times bugs were found in solution.
  2. In most of the GC UT, we check intergrity by reading back data, can we check index? Because without overwritten even the index is pointing to old chunk( move_from chunk) , it still can be read.

1 coordination part of GC is hard to precisely control and write it into UT. what`s more, without a specific issue case, it is hard to imagine from scratch what corner case could bring issues. if we do not know a specific scenario, we can not write the UT code to cover the scenario accordingly. I can use a seperate PR trying to add UT for the two found issues with the help of AI(build pdev heap and log replay case), not in this PR.

2 I get your point. we can change the start command of gc UT to use only one reserved chunk for each pdev, so that all the gc task will share only one reserved chunk, and thus this reserved chunk(move_from_chunk) will be definitely overwritten, in next gc

[Feature]

  1. can we update active_blob_count after GC to fix the counting issue.
         // active_blob_count is updated by put/delete blob, not change it here.
    

as well cc @yuwmao

I`m not quite clear about this suggestion, what counting issue do you mean? gc should never change active blob count. do I miss anything ? cc @yuwmao

[Correctness]

  1. If the PG went through a BR after restarting , the notify_committed_lsn will be called in next log commit, is that OK? If this is fine, do we need to take care the rollback?

1 notify_committed_lsn is always called periodically(in a timer of homestore), and will not stop being called even if it is in BR.
2 rollback case has already been handled in this PR in normal case (not BR)
3 if BR happens, later coming rollback will not move back to less than snapshot_lsn, since any rollback will not rollback a lsn before snapshot lsn.
so this is fine.

[Code style]

  1. The nofity_committed_lsn has two consumer, EGC and enable_gc_flag, their logic is very similar , basically it is a CB that wants to be execute when a given LSN is committed, can we have a more general implementation for this, like a map[lsn]lamba ?

there is nuance between enable_gc_flag and egc related no_space_error_info. enable_gc_flag is pg related, no_space_error_info is statemachine related. will think about it later

actually, the current code is just a work around(and also seems a little complicated) , but not solve this issue thoroughly. the root cause is that seal_shard breaks the assumption that when we commit lsn n(even if in log replay), for all the changes made by the logs with lsn > n , should not been made and seen. I am very concerned about this potential problem , and may be it will bring other issues later. should we drop this work around and postpone the superblk update of shard to cp_flush , like I said here ?
#388 (comment).

@JacksonYao287
Copy link
Collaborator Author

I think a simple solution is that we force to trigger a cp_flush before updating shard metablk(change the persisted shard state to sealed) in on_commit_seal_shard, so that we can make sure no put_blob log will be recommitted for a sealed shard in log replay phase after restart @Besroy @xiaoxichen also cc @yuwmao

@xiaoxichen
Copy link
Collaborator

xiaoxichen commented Feb 11, 2026

assumption that when we commit lsn n(even if in log replay), for all the changes made by the logs with lsn > n , should not been made and seen.

This assumption is never true even without seal shard, it is possible to see put_blob of LSN Y but not the put_blob of LSN Y-1, if their index was partially flushed.

The GC is changing the state-machine outside of raft, this is something we "break" the state-machine, and we believe we dont break it based on the observation of LSN X , however, if the SM is not yet reaching LSN X , then the change(GC) might be problematic. GC used the assumption you memtioned above, but that assumption is never true in our SM implementation.

If we believe we need it , we can add seal_lsn into shard_info, shard_info has version number, which is easier to make incompatibility changes.

@xiaoxichen
Copy link
Collaborator

coordination part of GC is hard to precisely control and write it into UT. what`s more, without a specific issue case, it is hard to imagine from scratch what corner case could bring issues.

The issues we hit , for example , this one, can it be reproduced from a randomly reprodced SH test , into a reproducible HO test case. See we have wrong implementation for this fix in this PR for several times, none of them had been catch by UT, this is a serious problem.

@xiaoxichen
Copy link
Collaborator

Can we in on_log_replay_done , we store the last_log_lsn into gc_start_lsn into PG, then add a can_gc() interface into PG which simply compare the commited_lsn vs gc_start_lsn.

Ignoring rollback --- the normal GC can start later , or even never start, if there is no new update to the PG, it is fine that the GC in this PG never started..

Ignore snapshot recovery --- lsn is increaing monodically .

@JacksonYao287
Copy link
Collaborator Author

JacksonYao287 commented Feb 12, 2026

Ignoring rollback --- the normal GC can start later , or even never start, if there is no new update to the PG, it is fine that the GC in this PG never started.

if gc_start_lsn is 100, rollback happens and rollback log to lsn 90, commit new log to lsn 94, then emergent gc is triggered at lsn 95, since current commit_lsn is 94( less than gc_start_lsn which is 100), when we submit an emergent gc task, this pg is unable to be gc and fail emergent gc. this follower is stuck.

I rethink about gc_start_lsn related solution many times, and IMO too many corner cases will be probably involved.

triggering a cp_flush before updating shard metablk for seal_shared will not involve any new logic, and can guarantee if shard is seal, any blob index of this shard have been persisted. only cost is one more cp flush before seal shard, which is very fast for nvme ssd

@JacksonYao287
Copy link
Collaborator Author

none of them had been catch by UT, this is a serious problem.

yes , I agree. we should spend more time on this to get a stronger UT

@xiaoxichen
Copy link
Collaborator

xiaoxichen commented Feb 12, 2026

Ignoring rollback --- the normal GC can start later , or even never start, if there is no new update to the PG, it is fine that the GC in this PG never started.

if gc_start_lsn is 100, rollback happens and rollback log to lsn 90, commit new log to lsn 94, then emergent gc is triggered at lsn 95, since current commit_lsn is 94( less than gc_start_lsn which is 100), when we submit an emergent gc task, this pg is unable to be gc and fail emergent gc. this follower is stuck.

I rethink about gc_start_lsn related solution many times, and IMO too many corner cases will be probably involved.

triggering a cp_flush before updating shard metablk for seal_shared will not involve any new logic, and can guarantee if shard is seal, any blob index of this shard have been persisted. only cost is one more cp flush before seal shard, which is very fast for nvme ssd

The implementation is wired, unless we can flush just data around that chunk. Just similar to if you close a file , it is reasonable to flush the file but calling echo 3> /sys/proc/drop_caches is wired.

How about lets persist the DC_LSN? That should be enough to serve the purpose as consistent point. And the interface is already there , just call repl_dev::flush_durable_commit_lsn (probably make it public).

if gc_start_lsn is 100, rollback happens and rollback log to lsn 90, commit new log to lsn 94, then emergent gc is triggered at lsn 95, since current commit_lsn is 94( less than gc_start_lsn which is 100), when we submit an emergent gc task, this pg is unable to be gc and fail emergent gc. this follower is stuck.

I think EGC can be safely exclude .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants