-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix risk of a partial write txn being applied #18749
Conversation
Hi @shyamjvs. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files
... and 18 files with indirect coverage changes @@ Coverage Diff @@
## main #18749 +/- ##
==========================================
+ Coverage 68.68% 68.72% +0.03%
==========================================
Files 420 420
Lines 35504 35503 -1
==========================================
+ Hits 24385 24398 +13
+ Misses 9685 9667 -18
- Partials 1434 1438 +4 Continue to review full report in Codecov by Sentry.
|
server/etcdserver/txn/txn_test.go
Outdated
// we verify the following things below: | ||
// - server panics after a write txn aply fails (invariant being tested: server should never try to move on from a failed write) | ||
// - no writes from the txn are applied to the backend (invariant being tested: failed write should have no side-effect on DB state besides panic) | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need to defer? assert.Panics
should handle panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not follow @serathius 's comment above.
- The purpose here is to verify the hash of the db did not change after the panicking. It's good!
- When panicking happens, any functions deferred are then executed as usual
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct, test function stops executing after assert.panic - so further verification needs to happen either inside a defer or a t.Cleanup()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what @serathius said is that the assert.Panics
is capable to recover the panic. there is no need to use defer.
We should check the hash after assert.Panics
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shyam - if the initial intended further assertion requires opening a bbolt file, it is protected by the advisory flock so it appears that so further verification could be added after assert.Panics
.
However, this advisory lock does not block you from opening the file and perform checksum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I stand corrected - the test indeed continues after the assert (the defer pattern was a remnant from earlier when I was calling Txn() directly without an assert)
however, the test still got stuck after adding the assert because of the defer betesting.Close(t, b)
at the start of the function, because it tries to write to an unbuffered channel here that is no longer read from anywhere
fixed both now, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what @serathius said is that the
assert.Panics
is capable to recover the panic. there is no need to use defer.
thx for the clarification. defer
is still a good pattern for such case, but not a big deal though.
server/etcdserver/txn/txn_test.go
Outdated
// we verify the following things below: | ||
// - server panics after a write txn aply fails (invariant being tested: server should never try to move on from a failed write) | ||
// - no writes from the txn are applied to the backend (invariant being tested: failed write should have no side-effect on DB state besides panic) | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not follow @serathius 's comment above.
- The purpose here is to verify the hash of the db did not change after the panicking. It's good!
- When panicking happens, any functions deferred are then executed as usual
/ok-to-test |
c3dc737
to
6825379
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @shyamjvs
Ideally it would be great if we could create an e2e or integration test to reproduce the partially committed/persisted issue.
@shyamjvs Please signoff the commit. Please read https://github.com/etcd-io/etcd/pull/18749/checks?check_run_id=31776671478 |
6825379
to
fd78ee4
Compare
done - thanks @ahrtr |
fd78ee4
to
714c57c
Compare
Signed-off-by: Shyam Jeedigunta <[email protected]>
714c57c
to
8a0fd66
Compare
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, serathius, shyamjvs The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Sorry for delay, sent out the backport for review. |
Fixes #18679
I was able to confirm this risk exists based on the unit test failing with txnWrite.End() still around. I did have to add a few milliseconds of wait time between the txn End and panic to reliably reproduce it because it depends on whether backend commit happens in that window or not.
/cc @serathius @ahrtr