start gc only after all the appended log before crash are replayed#389
start gc only after all the appended log before crash are replayed#389JacksonYao287 wants to merge 5 commits intoeBay:mainfrom
Conversation
f6ceeeb to
8fc47f5
Compare
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #389 +/- ##
==========================================
- Coverage 63.15% 54.52% -8.64%
==========================================
Files 32 35 +3
Lines 1900 4950 +3050
Branches 204 627 +423
==========================================
+ Hits 1200 2699 +1499
- Misses 600 1966 +1366
- Partials 100 285 +185 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2128452 to
e0e4c21
Compare
|
@xiaoxichen pls give an approval if it looks good to you , thanks |
xiaoxichen
left a comment
There was a problem hiding this comment.
sorry I think you made some changes....pls let me review again....
| // issue as following: | ||
| /* | ||
| T1: shard=1, blob=100 committed at LSN 100. | ||
| T2: SM crashed with cp_lsn=90 and dc_lsn=110. |
There was a problem hiding this comment.
what if cp_lsn=1 and dc_lsn =1 ? all of the logs (lsn 100) will be commit after GC starts
There was a problem hiding this comment.
good question! like I said here, #388 (comment), the right solution are
1 postpone persisting shard meta superblk to cp flush.
or
2 before committing seal_shard , trigger gc to make sure the cp lsn is upto the the one prior to seal_shard.
or
3 before gc, wait until all the appended log are committed
anyway, we should guarantee that if the state of a shard is sealed(which means it can be scheduled for gc), no put_blob will be committed to this shard.
There was a problem hiding this comment.
@xiaoxichen I updated the pr using solution 3, which makes no changes to the current write path.
58fd345 to
e3a6b02
Compare
e3a6b02 to
9340b6a
Compare
There was a problem hiding this comment.
General Feedback:
[Comments in code]
Can you use AI to polish the comments , and/or let AI create a better doc (as MD file into github) for the corner cases
[Testing]
- can we add some UT to test the behavior? Not necessary the GC but more the coordination of GC, when it could start. I am having a low confidence level on our implementation to be fully correct based on the number of times bugs were found in solution.
- In most of the GC UT, we check intergrity by reading back data, can we check index? Because without overwritten even the index is pointing to old chunk( move_from chunk) , it still can be read.
[Feature]
-
can we update active_blob_count after GC to fix the counting issue.
// active_blob_count is updated by put/delete blob, not change it here.
as well cc @yuwmao
[Correctness]
- If the PG went through a BR after restarting , the notify_committed_lsn will be called in next log commit, is that OK? If this is fine, do we need to take care the rollback?
[Code style]
- The nofity_committed_lsn has two consumer, EGC and enable_gc_flag, their logic is very similar , basically it is a CB that wants to be execute when a given LSN is committed, can we have a more general implementation for this, like a map[lsn]lamba ?
| const auto [target_lsn, chunk_id] = get_no_space_left_error_info(); | ||
| if (std::numeric_limits< homestore::repl_lsn_t >::max() == target_lsn) { | ||
| // no pending no_space_left error to handle | ||
| return; |
There was a problem hiding this comment.
if there is no EGC task, it will returns here, the newly added code would not be execute
sure , will try to polish the comments. it`s a good idea to collect all the corner cases(not only gc related) into a MD file, I believe this will be very helpful for AI to deeply understand how homeobject works.
1 coordination part of GC is hard to precisely control and write it into UT. what`s more, without a specific issue case, it is hard to imagine from scratch what corner case could bring issues. if we do not know a specific scenario, we can not write the UT code to cover the scenario accordingly. I can use a seperate PR trying to add UT for the two found issues with the help of AI(build pdev heap and log replay case), not in this PR. 2 I get your point. we can change the start command of gc UT to use only one reserved chunk for each pdev, so that all the gc task will share only one reserved chunk, and thus this reserved chunk(move_from_chunk) will be definitely overwritten, in next gc
I`m not quite clear about this suggestion, what counting issue do you mean? gc should never change active blob count. do I miss anything ? cc @yuwmao
1 notify_committed_lsn is always called periodically(in a timer of homestore), and will not stop being called even if it is in BR.
there is nuance between enable_gc_flag and egc related no_space_error_info. enable_gc_flag is pg related, no_space_error_info is statemachine related. will think about it later actually, the current code is just a work around(and also seems a little complicated) , but not solve this issue thoroughly. the root cause is that seal_shard breaks the assumption that when we commit lsn n(even if in log replay), for all the changes made by the logs with lsn > n , should not been made and seen. I am very concerned about this potential problem , and may be it will bring other issues later. should we drop this work around and postpone the superblk update of shard to cp_flush , like I said here ? |
|
I think a simple solution is that we force to trigger a cp_flush before updating shard metablk(change the persisted shard state to sealed) in on_commit_seal_shard, so that we can make sure no put_blob log will be recommitted for a sealed shard in log replay phase after restart @Besroy @xiaoxichen also cc @yuwmao |
This assumption is never true even without seal shard, it is possible to see put_blob of LSN Y but not the put_blob of LSN Y-1, if their index was partially flushed. The GC is changing the state-machine outside of raft, this is something we "break" the state-machine, and we believe we dont break it based on the observation of LSN X , however, if the SM is not yet reaching LSN X , then the change(GC) might be problematic. GC used the assumption you memtioned above, but that assumption is never true in our SM implementation. If we believe we need it , we can add seal_lsn into shard_info, shard_info has version number, which is easier to make incompatibility changes. |
The issues we hit , for example , this one, can it be reproduced from a randomly reprodced SH test , into a reproducible HO test case. See we have wrong implementation for this fix in this PR for several times, none of them had been catch by UT, this is a serious problem. |
|
Can we in Ignoring rollback --- the normal GC can start later , or even never start, if there is no new update to the PG, it is fine that the GC in this PG never started.. Ignore snapshot recovery --- lsn is increaing monodically . |
if gc_start_lsn is 100, rollback happens and rollback log to lsn 90, commit new log to lsn 94, then emergent gc is triggered at lsn 95, since current commit_lsn is 94( less than gc_start_lsn which is 100), when we submit an emergent gc task, this pg is unable to be gc and fail emergent gc. this follower is stuck. I rethink about gc_start_lsn related solution many times, and IMO too many corner cases will be probably involved. triggering a cp_flush before updating shard metablk for seal_shared will not involve any new logic, and can guarantee if shard is seal, any blob index of this shard have been persisted. only cost is one more cp flush before seal shard, which is very fast for nvme ssd |
yes , I agree. we should spend more time on this to get a stronger UT |
The implementation is wired, unless we can flush just data around that chunk. Just similar to if you close a file , it is reasonable to flush the file but calling How about lets persist the DC_LSN? That should be enough to serve the purpose as consistent point. And the interface is already there , just call repl_dev::flush_durable_commit_lsn (probably make it public).
I think EGC can be safely exclude . |
we only start gc after all the logs are replayed, otherwise, the log replay of put_blob might lead to a data error issue as following:
let`s say cp_lsn and dc_lsn is 10, lsn 11 is put_blob (blob -> pba-chunk-1), and lsn 12 is seal_shard(shard-1 , chunk-1).
1 before crash, lsn 11 and lsn 12 are both committed. as a result , we have a blob -> pba-chunk-1 in the wbcache of indextable and a persisted superblk of shard-1 with a state sealed.
2 crash happens. after restart, blob -> pba-chunk-1 is lost since it only existes in wbcache and not be flushed to disk. but shard-1 has a stat of sealed since shard superblk is persisted before crash. now, since no open shard in chunk-1, chunk-1 is selected for gc and all the blobs of shard-1 are moved to chunk-2 , and chunk-1 becomes a reserved chunk.
3 since dc_lsn is 10, after log replay, we start committing lsn 11. since blob -> pba-chunk-1 does not exist in pg-index-table, on_blob_put_commit will insert a new item blob -> pba-chunk-1 to pg-index-table. this is where issue happens. blob belong to shard-1, which has been moved to chunk-2. but on_blob_put_commit adds blob to indextable with a stale pba belongs to chunk-1 , which is now a reserved chunk and will be purged later.
#388