Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd crashes when running out of space during defrag #18810

Open
4 tasks done
serathius opened this issue Oct 30, 2024 · 3 comments
Open
4 tasks done

Etcd crashes when running out of space during defrag #18810

serathius opened this issue Oct 30, 2024 · 3 comments

Comments

@serathius
Copy link
Member

serathius commented Oct 30, 2024

Bug report criteria

What happened?

Etcd crashes with stacktrace

{"level":"info","ts":"2024-10-30T20:53:52.006800+0100","caller":"v3rpc/maintenance.go:110","msg":"starting defragment"}
{"level":"info","ts":"2024-10-30T20:53:52.008359+0100","caller":"backend/backend.go:509","msg":"defragmenting","path":"limited_disk_64MB/member/snap/db","current-db-size-bytes":2334720,"current-db-size":"2.3 MB","current-db-size-in-use-bytes":2236416,"current-db-size-in-use":"2.2 MB"}
{"level":"warn","ts":"2024-10-30T20:53:52.024275+0100","caller":"v3rpc/maintenance.go:115","msg":"failed to defragment","error":"write limited_disk_64MB/member/snap/db.tmp.172848138: no space left on device"}
{"level":"info","ts":"2024-10-30T20:53:52.024402+0100","caller":"v3rpc/health.go:63","msg":"grpc service status changed","service":"","status":"SERVING"}
{"level":"warn","ts":"2024-10-30T20:53:52.025202+0100","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00022c5a0/127.0.0.1:2379","method":"/etcdserverpb.Maintenance/Defragment","attempt":0,"error":"rpc error: code = Unknown desc = write limited_disk_64MB/member/snap/db.tmp.172848138: no space left on device"}
Failed to defragment etcd member[127.0.0.1:2379]. took 26.406294ms. (rpc error: code = Unknown desc = write limited_disk_64MB/member/snap/db.tmp.172848138: no space left on device)

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xbf2133]

goroutine 175 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
        go.etcd.io/[email protected]/tx.go:112
go.etcd.io/etcd/server/v3/storage/backend.(*baseReadTx).UnsafeRange(0xc0001bee10, {0x133d608, 0x1b48ba0}, {0x1ae54d0, 0xe, 0xe}, {0x0, 0x0, 0x0}, 0x1)
        go.etcd.io/etcd/server/v3/storage/backend/read_tx.go:103 +0x233
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadStorageVersion({0x7fbdf4454a78?, 0xc0001bee10?})
        go.etcd.io/etcd/server/v3/storage/schema/version.go:35 +0x5d
go.etcd.io/etcd/server/v3/storage/schema.UnsafeDetectSchemaVersion(0xc000408100, {0x7fbdf4454a78, 0xc0001bee10})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:94 +0x47
go.etcd.io/etcd/server/v3/storage/schema.DetectSchemaVersion(0xc000408100, {0x133d678, 0xc0001bee10})
        go.etcd.io/etcd/server/v3/storage/schema/schema.go:89 +0xdd
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).StorageVersion(0xc000103808)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2243 +0xf6
go.etcd.io/etcd/server/v3/etcdserver.(*serverVersionAdapter).GetStorageVersion(0xc00079fee0?)
        go.etcd.io/etcd/server/v3/etcdserver/adapters.go:77 +0x16
go.etcd.io/etcd/server/v3/etcdserver/version.(*Monitor).UpdateStorageVersionIfNeeded(0xc00079ff70)
        go.etcd.io/etcd/server/v3/etcdserver/version/monitor.go:112 +0x5d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).monitorStorageVersion(0xc000103808)
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2286 +0x93
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1()
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2467 +0x53
created by go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach in goroutine 1
        go.etcd.io/etcd/server/v3/etcdserver/server.go:2465 +0xf3

What did you expect to happen?

Etcd should not crash, either:

  • [ok] abort defrag if we run out of disk space
  • [better] preallocate data needed before starting defrag

How can we reproduce it (as minimally and precisely as possible)?

Run etcd with 64 MB (62MB is used for WAL)

mkdir -p limited_disk_64MB
sudo mount -t tmpfs -o size=64M tmpfs limited_disk_64MB/
./bin/etcd --data-dir limited_disk_64MB

In separate command line load etcd (2MB of data) and defrag.

for num in {1..20}; do
  ./bin/etcdctl put a `tr -dc A-Za-z0-9 </dev/urandom | head -c 100000`
done
./bin/etcdctl defrag
/bin/etcdctl put a 1

Expect defrag to fail due to no space, and etcd to crash next time it touches backend. Sometimes it needs additional put call to ensure it accesses db and crashes.

cleanup of mount

sudo umount limited_disk_64MB
rm -rf limited_disk_64MB

Anything else we need to know?

No response

Etcd version (please run commands below)

Reproduced on all latest branches.

Etcd configuration (command line flags or environment variables)

Just ensure that --data-dir points to directory with limited diskspace.

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@serathius
Copy link
Member Author

cc @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Oct 30, 2024

The reason is that defragment closes the backend db (bbolt), but it doesn't reopen it when it fails for whatever reason. So etcdserver panics when other jobs access the backend db.

The immediate solution that I can think of is to restore the environment (i.e. reopen the backend db) if defrag fails for whatever reason, and

  • either panicking if it fails to restore the environment
  • or add protection when accessing the backend, i.e return error if the backend has closed.

It should be very easy to reproduce this issue by adding a failpoint similar to db.go#L490-L491. Note don't panicking the failpoint, instead return an error.

@ghouscht
Copy link
Contributor

ghouscht commented Nov 1, 2024

/assign

I'll take a look at this. No guarantee that I'm able to fix it but I'll give it a try 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants