Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix risk of a partial write txn being applied #18749

Merged
merged 1 commit into from
Oct 24, 2024

Conversation

shyamjvs
Copy link
Contributor

Fixes #18679

I was able to confirm this risk exists based on the unit test failing with txnWrite.End() still around. I did have to add a few milliseconds of wait time between the txn End and panic to reliably reproduce it because it depends on whether backend commit happens in that window or not.

/cc @serathius @ahrtr

@k8s-ci-robot
Copy link

Hi @shyamjvs. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.72%. Comparing base (bb381d4) to head (aa8c785).

Current head aa8c785 differs from pull request most recent head 8a0fd66

Please upload reports for the commit 8a0fd66 to get more accurate results.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
Files with missing lines Coverage Δ
server/etcdserver/txn/txn.go 92.69% <ø> (-0.46%) ⬇️

... and 18 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18749      +/-   ##
==========================================
+ Coverage   68.68%   68.72%   +0.03%     
==========================================
  Files         420      420              
  Lines       35504    35503       -1     
==========================================
+ Hits        24385    24398      +13     
+ Misses       9685     9667      -18     
- Partials     1434     1438       +4     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb381d4...8a0fd66. Read the comment docs.

// we verify the following things below:
// - server panics after a write txn aply fails (invariant being tested: server should never try to move on from a failed write)
// - no writes from the txn are applied to the backend (invariant being tested: failed write should have no side-effect on DB state besides panic)
defer func() {
Copy link
Member

@serathius serathius Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to defer? assert.Panics should handle panic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not follow @serathius 's comment above.

  • The purpose here is to verify the hash of the db did not change after the panicking. It's good!
  • When panicking happens, any functions deferred are then executed as usual

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct, test function stops executing after assert.panic - so further verification needs to happen either inside a defer or a t.Cleanup()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what @serathius said is that the assert.Panics is capable to recover the panic. there is no need to use defer.
We should check the hash after assert.Panics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@chaochn47 chaochn47 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shyam - if the initial intended further assertion requires opening a bbolt file, it is protected by the advisory flock so it appears that so further verification could be added after assert.Panics.

However, this advisory lock does not block you from opening the file and perform checksum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I stand corrected - the test indeed continues after the assert (the defer pattern was a remnant from earlier when I was calling Txn() directly without an assert)

however, the test still got stuck after adding the assert because of the defer betesting.Close(t, b) at the start of the function, because it tries to write to an unbuffered channel here that is no longer read from anywhere

fixed both now, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what @serathius said is that the assert.Panics is capable to recover the panic. there is no need to use defer.

thx for the clarification. defer is still a good pattern for such case, but not a big deal though.

server/etcdserver/txn/txn_test.go Outdated Show resolved Hide resolved
server/etcdserver/txn/txn_test.go Outdated Show resolved Hide resolved
server/etcdserver/txn/txn_test.go Outdated Show resolved Hide resolved
// we verify the following things below:
// - server panics after a write txn aply fails (invariant being tested: server should never try to move on from a failed write)
// - no writes from the txn are applied to the backend (invariant being tested: failed write should have no side-effect on DB state besides panic)
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not follow @serathius 's comment above.

  • The purpose here is to verify the hash of the db did not change after the panicking. It's good!
  • When panicking happens, any functions deferred are then executed as usual

@ahrtr
Copy link
Member

ahrtr commented Oct 18, 2024

/ok-to-test

@shyamjvs shyamjvs force-pushed the fix-partial-txn-risk branch 3 times, most recently from c3dc737 to 6825379 Compare October 19, 2024 20:56
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @shyamjvs

Ideally it would be great if we could create an e2e or integration test to reproduce the partially committed/persisted issue.

@ahrtr
Copy link
Member

ahrtr commented Oct 20, 2024

@shyamjvs Please signoff the commit. Please read https://github.com/etcd-io/etcd/pull/18749/checks?check_run_id=31776671478

@shyamjvs
Copy link
Contributor Author

done - thanks @ahrtr

@shyamjvs
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, serathius, shyamjvs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@serathius serathius merged commit 38c27a4 into etcd-io:main Oct 24, 2024
32 checks passed
@shyamjvs shyamjvs deleted the fix-partial-txn-risk branch October 24, 2024 16:38
@ahrtr
Copy link
Member

ahrtr commented Oct 26, 2024

@shyamjvs do you have bandwidth to backport the PR to 3.5?

FYI. The PR (issue) was also backported to 3.5 in #14178

@shyamjvs
Copy link
Contributor Author

Sorry for delay, sent out the backport for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Write txn shouldn't End() on a failure
8 participants