fix(defrag): handle no space left error #18822

ghouscht · 2024-11-01T12:28:37Z

PR contains an e2e test, gofailpoint and a fix for the issue described in #18810.

Without the fix the test triggers a nil ptr panic in etcd as described in the linked issue:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: execute job failed
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x787ad8]

goroutine 136 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x80?, 0x2?, {0xf?, 0x0?, 0x0?})
	go.uber.org/[email protected]/zapcore/entry.go:196 +0x78
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0x40005d81a0, {0x400058a780, 0x2, 0x2})
	go.uber.org/[email protected]/zapcore/entry.go:262 +0x1c4
go.uber.org/zap.(*Logger).Panic(0xcd8746?, {0xce7440?, 0xb54f60?}, {0x400058a780, 0x2, 0x2})
	go.uber.org/[email protected]/logger.go:285 +0x54
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob.func1()
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:202 +0x24c
panic({0xb54f60?, 0x1681ec0?})
	runtime/panic.go:785 +0x124
go.etcd.io/bbolt.(*Tx).Bucket(...)
	go.etcd.io/[email protected]/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.unsafeForEach(0x0, {0xe9b988?, 0x1695e60?}, 0x4000580460)
	go.etcd.io/etcd/server/v3/storage/backend/batch_tx.go:235 +0x38
go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).UnsafeForEach(...)
	go.etcd.io/etcd/server/v3/storage/backend/batch_tx.go:231
go.etcd.io/etcd/server/v3/storage/backend.unsafeVerifyTxConsistency({0xea55e0, 0x40000bafc0}, {0xe9b988, 0x1695e60})
	go.etcd.io/etcd/server/v3/storage/backend/verify.go:97 +0xa4
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll.VerifyBackendConsistency.func1()
	go.etcd.io/etcd/server/v3/storage/backend/verify.go:90 +0x244
go.etcd.io/etcd/client/pkg/v3/verify.Verify(0x40000efdf8)
	go.etcd.io/etcd/client/pkg/[email protected]/verify/verify.go:71 +0x3c
go.etcd.io/etcd/server/v3/storage/backend.VerifyBackendConsistency(...)
	go.etcd.io/etcd/server/v3/storage/backend/verify.go:75
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll(0x40003a3c08, 0x4000172180, 0x40005da000)
	go.etcd.io/etcd/server/v3/etcdserver/server.go:972 +0xcc
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func6({0x4000196850?, 0x4000494a80?})
	go.etcd.io/etcd/server/v3/etcdserver/server.go:847 +0x28
go.etcd.io/etcd/pkg/v3/schedule.job.Do(...)
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:41
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob(0x40000eff70?, {0xe943f8?, 0x40005b6330?}, 0x0?)
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:206 +0x78
go.etcd.io/etcd/pkg/v3/schedule.(*fifo).run(0x400039a0e0)
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:187 +0x15c
created by go.etcd.io/etcd/pkg/v3/schedule.NewFIFOScheduler in goroutine 155
	go.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:101 +0x178

~~I think from here on we can discuss potential solutions for the problem. @ahrtr already suggested two possible options in the linked issue.~~

As mentioned in #18822 (comment) the PR now restores the environment and lets etcd continue to run.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Signed-off-by: Thomas Gosteli <[email protected]>

k8s-ci-robot · 2024-11-01T12:28:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ghouscht
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-11-01T12:28:47Z

Hi @ghouscht. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov-commenter · 2024-11-01T12:46:23Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.80%. Comparing base (3de0018) to head (f4aa022).
Report is 10 commits behind head on main.

❗ Current head f4aa022 differs from pull request most recent head 28a2e22

Please upload reports for the commit 28a2e22 to get more accurate results.

Files with missing lines	Patch %	Lines
server/storage/backend/backend.go	60.00%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

Files with missing lines	Coverage Δ
server/storage/backend/backend.go	`82.87% <60.00%> (-0.46%)`	⬇️

... and 22 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18822      +/-   ##
==========================================
+ Coverage   68.76%   68.80%   +0.04%     
==========================================
  Files         420      420              
  Lines       35523    35526       +3     
==========================================
+ Hits        24426    24445      +19     
+ Misses       9665     9660       -5     
+ Partials     1432     1421      -11

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3de0018...28a2e22. Read the comment docs.

ahrtr · 2024-11-03T20:09:52Z

The e2e test looks good.

The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu

Signed-off-by: Thomas Gosteli <[email protected]>

ghouscht · 2024-11-04T10:10:35Z

The e2e test looks good.

The proposed solution is to restore the environment (i.e. reopen the bbolt) when defragmentation somehow fails and panicking if the restoring fails again. If the bbols fails to be opened, then etcdserver can't serve any requests, so it makes sense to panic it. cc @fuweid @ivanvc @jmhbnz @serathius @tjungblu

I added a second commit that contains a working implementation of a possible restore operation. I did some manual testing with the failpoint and the e2e test and it seems to work. However this opens up a whole lot of other possible problems. I highlighted some of them with TODO in the code - feedback appreciated 🙂

ghouscht · 2024-11-04T10:12:55Z

server/storage/backend/backend.go

@@ -455,7 +456,61 @@ func (b *backend) Commits() int64 {
 }

 func (b *backend) Defrag() error {
-	return b.defrag()
+	err := b.defrag()
+	if err != nil {


Instead of doing the restore here we could probably place it inside a defer in defrag() function? However, if we do so we need to be a bit careful about releasing the locks -> order of defered function execution is important in that case.

Not sure why do generic error handling here.
Why not identify where b.defrag() generates error and then improve error handler there?

Sure that is an option, I just thought you're looking for a generic error handling.

I already know where the error happens so we can improve that. I'll add another commit.

Added 1fb2064 which addresses your comment and only handles the specific error. The e2e test is adapted as well. Manual testing and the e2e test seem to confirm that this is working.

Please let me know if this is OK and how to continue.

server/storage/backend/backend.go

ghouscht · 2024-11-05T08:43:16Z

server/storage/backend/backend.go

+	// Commit/stop and then reset current transactions (including the readTx)
+	b.batchTx.unsafeCommit(true)
+	b.batchTx.tx = nil


Moving this down here ensures that no special error handling is needed in case os.CreateTemp fails

Signed-off-by: Thomas Gosteli <[email protected]>

chore: e2e test defrag no space

b6737ba

Signed-off-by: Thomas Gosteli <[email protected]>

k8s-ci-robot added do-not-merge/work-in-progress area/testing labels Nov 1, 2024

k8s-ci-robot added the needs-ok-to-test label Nov 1, 2024

k8s-ci-robot added the size/M label Nov 1, 2024

wip: first attempt to restore/reopen db after failed defrag

f04d57c

Signed-off-by: Thomas Gosteli <[email protected]>

k8s-ci-robot added size/L and removed size/M labels Nov 4, 2024

ghouscht commented Nov 4, 2024

View reviewed changes

server/storage/backend/backend.go Outdated Show resolved Hide resolved

k8s-ci-robot added size/M and removed size/L labels Nov 5, 2024

ghouscht commented Nov 5, 2024

View reviewed changes

ghouscht changed the title ~~chore: e2e test defrag no space~~ fix(defrag): handle no space left error Nov 5, 2024

fix(defrag): handle no space left error

28a2e22

Signed-off-by: Thomas Gosteli <[email protected]>

ghouscht force-pushed the issue-18810 branch from 1fb2064 to 28a2e22 Compare November 5, 2024 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(defrag): handle no space left error #18822

fix(defrag): handle no space left error #18822

ghouscht commented Nov 1, 2024 •

edited

Loading

k8s-ci-robot commented Nov 1, 2024

k8s-ci-robot commented Nov 1, 2024

codecov-commenter commented Nov 1, 2024 •

edited

Loading

ahrtr commented Nov 3, 2024

ghouscht commented Nov 4, 2024

ghouscht Nov 4, 2024

serathius Nov 4, 2024

ghouscht Nov 5, 2024

ghouscht Nov 5, 2024 •

edited

Loading

ghouscht Nov 5, 2024

fix(defrag): handle no space left error #18822

Are you sure you want to change the base?

fix(defrag): handle no space left error #18822

Conversation

ghouscht commented Nov 1, 2024 • edited Loading

k8s-ci-robot commented Nov 1, 2024

k8s-ci-robot commented Nov 1, 2024

codecov-commenter commented Nov 1, 2024 • edited Loading

Codecov Report

ahrtr commented Nov 3, 2024

ghouscht commented Nov 4, 2024

ghouscht Nov 4, 2024

Choose a reason for hiding this comment

serathius Nov 4, 2024

Choose a reason for hiding this comment

ghouscht Nov 5, 2024

Choose a reason for hiding this comment

ghouscht Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

ghouscht Nov 5, 2024

Choose a reason for hiding this comment

ghouscht commented Nov 1, 2024 •

edited

Loading

codecov-commenter commented Nov 1, 2024 •

edited

Loading

ghouscht Nov 5, 2024 •

edited

Loading