Etcd crashes when running out of space during defrag #18810

serathius · 2024-10-30T20:00:42Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

Etcd crashes with stacktrace

{"level":"info","ts":"2024-10-30T20:53:52.006800+0100","caller":"v3rpc/maintenance.go:110","msg":"starting defragment"}
{"level":"info","ts":"2024-10-30T20:53:52.008359+0100","caller":"backend/backend.go:509","msg":"defragmenting","path":"limited_disk_64MB/member/snap/db","current-db-size-bytes":2334720,"current-db-size":"2.3 MB","current-db-size-in-use-bytes":2236416,"current-db-size-in-use":"2.2 MB"}
{"level":"warn","ts":"2024-10-30T20:53:52.024275+0100","caller":"v3rpc/maintenance.go:115","msg":"failed to defragment","error":"write limited_disk_64MB/member/snap/db.tmp.172848138: no space left on device"}
{"level":"info","ts":"2024-10-30T20:53:52.024402+0100","caller":"v3rpc/health.go:63","msg":"grpc service status changed","service":"","status":"SERVING"}
{"level":"warn","ts":"2024-10-30T20:53:52.025202+0100","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00022c5a0/127.0.0.1:2379","method":"/etcdserverpb.Maintenance/Defragment","attempt":0,"error":"rpc error: code = Unknown desc = write limited_disk_64MB/member/snap/db.tmp.172848138: no space left on device"}
Failed to defragment etcd member[127.0.0.1:2379]. took 26.406294ms. (rpc error: code = Unknown desc = write limited_disk_64MB/member/snap/db.tmp.172848138: no space left on device)

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbf2133]

goroutine 175 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
        go.etcd.io/[email protected]/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc0001bee10, {0x133d608, 0x1b48ba0}, {0x1ae54d0, 0xe, 0xe}, {0x0, 0x0, 0x0}, 0x1)
        go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadStorageVersion({0x7fbdf4454a78?, 0xc0001bee10?})
        go.etcd.io/etcd/server/v3/storage/schema/version.go:35 +0x5d
go.etcd.io/etcd/server/v3/storage/schema.UnsafeDetectSchemaVersion(0xc000408100, {0x7fbdf4454a78, 0xc0001bee10})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:94 +0x47
go.etcd.io/etcd/server/v3/storage/schema.DetectSchemaVersion(0xc000408100, {0x133d678, 0xc0001bee10})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:89 +0xdd
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).StorageVersion(0xc000103808)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2243 +0xf6
go.etcd.io/etcd/server/v3/etcdserver.(*serverVersionAdapter).GetStorageVersion(0xc00079fee0?)
        go.etcd.io/etcd/server/v3/etcdserver/adapters.go:77 +0x16
go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded(0xc00079ff70)
        go.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:112 +0x5d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion(0xc000103808)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2286 +0x93
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1()
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2467 +0x53
created by go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach in goroutine 1
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2465 +0xf3

What did you expect to happen?

Etcd should not crash, either:

[ok] abort defrag if we run out of disk space
[better] preallocate data needed before starting defrag

How can we reproduce it (as minimally and precisely as possible)?

Run etcd with 64 MB (62MB is used for WAL)

mkdir -p limited_disk_64MB
sudo mount -t tmpfs -o size=64M tmpfs limited_disk_64MB/
./bin/etcd --data-dir limited_disk_64MB

In separate command line load etcd (2MB of data) and defrag.

for num in {1..20}; do
  ./bin/etcdctl put a `tr -dc A-Za-z0-9 </dev/urandom | head -c 100000`
done
./bin/etcdctl defrag
/bin/etcdctl put a 1

Expect defrag to fail due to no space, and etcd to crash next time it touches backend. Sometimes it needs additional put call to ensure it accesses db and crashes.

cleanup of mount

sudo umount limited_disk_64MB
rm -rf limited_disk_64MB

Anything else we need to know?

No response

Etcd version (please run commands below)

Reproduced on all latest branches.

Etcd configuration (command line flags or environment variables)

Just ensure that --data-dir points to directory with limited diskspace.

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

serathius · 2024-10-30T20:01:10Z

cc @ahrtr

ahrtr · 2024-10-30T22:54:50Z

The reason is that defragment closes the backend db (bbolt), but it doesn't reopen it when it fails for whatever reason. So etcdserver panics when other jobs access the backend db.

The immediate solution that I can think of is to restore the environment (i.e. reopen the backend db) if defrag fails for whatever reason, and

either panicking if it fails to restore the environment
or add protection when accessing the backend, i.e return error if the backend has closed.

It should be very easy to reproduce this issue by adding a failpoint similar to db.go#L490-L491. Note don't panicking the failpoint, instead return an error.

ghouscht · 2024-11-01T09:12:19Z

/assign

I'll take a look at this. No guarantee that I'm able to fix it but I'll give it a try 🙂

serathius added the type/bug label Oct 30, 2024

serathius added release/v3.6 release/v3.5 release/v3.4 labels Oct 30, 2024

serathius mentioned this issue Oct 30, 2024

Add out of space failpoint to robustness #18811

Open

serathius added the help wanted label Oct 31, 2024

k8s-ci-robot assigned ghouscht Nov 1, 2024

ghouscht mentioned this issue Nov 1, 2024

fix(defrag): handle no space left error #18822

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd crashes when running out of space during defrag #18810

Etcd crashes when running out of space during defrag #18810

serathius commented Oct 30, 2024 •

edited

Loading

serathius commented Oct 30, 2024

ahrtr commented Oct 30, 2024

ghouscht commented Nov 1, 2024

Etcd crashes when running out of space during defrag #18810

Etcd crashes when running out of space during defrag #18810

Comments

serathius commented Oct 30, 2024 • edited Loading

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

serathius commented Oct 30, 2024

ahrtr commented Oct 30, 2024

ghouscht commented Nov 1, 2024

serathius commented Oct 30, 2024 •

edited

Loading