Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd boot loop after update to 1.8.3 #9789

Open
bernardgut opened this issue Nov 24, 2024 · 1 comment
Open

etcd boot loop after update to 1.8.3 #9789

bernardgut opened this issue Nov 24, 2024 · 1 comment

Comments

@bernardgut
Copy link
Contributor

Bug Report

etcd goes into a boot loop after upgrade from Talos 1.8.1->1.8.3. Node/Cluster never becomes ready.

Description

talosctl -n n1 service etcd                                                                                       (⎈|sekops-omni-p0:default)
NODE                  2a02:XXX:0:ae1f:6bff:fe1e:8a4e
ID                    etcd
STATE                 Running
HEALTH                Fail
LAST HEALTH MESSAGE   service not running
EVENTS                [Running]: Started task etcd (PID 41598) for container etcd (2s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (7s ago)
                      [Running]: Started task etcd (PID 41429) for container etcd (10s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (15s ago)
                      [Running]: Started task etcd (PID 41118) for container etcd (18s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (23s ago)
                      [Running]: Started task etcd (PID 40882) for container etcd (25s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (31s ago)
                      [Running]: Started task etcd (PID 40653) for container etcd (33s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (38s ago)
                      [Running]: Started task etcd (PID 40453) for container etcd (41s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (46s ago)
                      [Running]: Started task etcd (PID 40299) for container etcd (49s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (54s ago)
                      [Running]: Started task etcd (PID 40090) for container etcd (57s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m2s ago)
                      [Running]: Started task etcd (PID 39906) for container etcd (1m5s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m10s ago)
                      [Running]: Started task etcd (PID 39755) for container etcd (1m12s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m18s ago)
                      [Running]: Started task etcd (PID 39585) for container etcd (1m20s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m26s ago)
                      [Running]: Started task etcd (PID 39387) for container etcd (1m28s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m33s ago)
                      [Running]: Started task etcd (PID 39032) for container etcd (1m36s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m42s ago)
                      [Running]: Started task etcd (PID 38771) for container etcd (1m44s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m49s ago)
                      [Running]: Started task etcd (PID 38603) for container etcd (1m52s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m57s ago)
                      [Running]: Started task etcd (PID 38405) for container etcd (1m59s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (2m5s ago)
                      [Running]: Started task etcd (PID 38216) for container etcd (2m7s ago)
                      [Waiting]: Runner Containerd(etcd) exited without error, going to restart it (2m13s ago)
                      [Running]: Started task etcd (PID 38027) for container etcd (2m15s ago)

Logs

etcd logs are a loop of

2024-11-24T18:30:35.095127Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream MsgApp v2 local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:35.095164Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream Message local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.238416Z
[info] started to purge file caller=fileutil/purge.go:50 dir=/var/lib/etcd/member/snap suffix=snap max=5 interval=30s
2024-11-24T18:30:42.238322Z
[warn] server error caller=etcdserver/server.go:1154 error=the member has been permanently removed from the cluster
2024-11-24T18:30:42.238580Z
[warn] data-dir used by this member must be removed caller=etcdserver/server.go:1155
2024-11-24T18:30:42.238588Z
[info] started to purge file caller=fileutil/purge.go:50 dir=/var/lib/etcd/member/wal suffix=wal max=5 interval=30s
2024-11-24T18:30:42.238674Z
[warn] failed to publish local member to cluster through raft caller=etcdserver/server.go:2161 local-member-id=2c6041e2149212a6 local-member-attributes={Name:n1 ClientURLs:[https://192.168.1.20:2379 https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379]} request-path=/0/members/2c6041e2149212a6/attributes publish-timeout=7s error=etcdserver: request cancelled
2024-11-24T18:30:42.238751Z
[warn] failed to publish local member to cluster through raft caller=etcdserver/server.go:2161 local-member-id=2c6041e2149212a6 local-member-attributes={Name:n1 ClientURLs:[https://192.168.1.20:2379 https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379]} request-path=/0/members/2c6041e2149212a6/attributes publish-timeout=7s error=etcdserver: request cancelled
2024-11-24T18:30:42.238781Z
[warn] failed to publish local member to cluster through raft caller=etcdserver/server.go:2161 local-member-id=2c6041e2149212a6 local-member-attributes={Name:n1 ClientURLs:[https://192.168.1.20:2379 https://[2a02:XXX0:ae1f:6bff:fe1e:8a4e]:2379]} request-path=/0/members/2c6041e2149212a6/attributes publish-timeout=7s error=etcdserver: request cancelled
2024-11-24T18:30:42.238806Z
[warn] stopped publish because server is stopped caller=etcdserver/server.go:2151 local-member-id=2c6041e2149212a6 local-member-attributes={Name:n1 ClientURLs:[https://192.168.1.20:2379 https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379]} publish-timeout=7s error=etcdserver: server stopped
2024-11-24T18:30:42.238848Z
[info] stopping remote peer caller=rafthttp/peer.go:330 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.238876Z
[info] stopped TCP streaming connection with remote peer caller=rafthttp/stream.go:294 stream-writer-type=unknown stream remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.238929Z
[info] stopped TCP streaming connection with remote peer caller=rafthttp/stream.go:294 stream-writer-type=unknown stream remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.238962Z
[info] stopped HTTP pipelining with remote peer caller=rafthttp/pipeline.go:85 local-member-id=2c6041e2149212a6 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.238971Z
[warn] server has stopped; skipping GoAttach caller=etcdserver/server.go:2825
2024-11-24T18:30:42.238987Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream MsgApp v2 local-member-id=2c6041e2149212a6 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.239097Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream Message local-member-id=2c6041e2149212a6 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.239125Z
[info] stopped remote peer caller=rafthttp/peer.go:335 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.239122Z
[info] grpc service status changed caller=v3rpc/health.go:61 service= status=SERVING
2024-11-24T18:30:42.239139Z
[info] stopping remote peer caller=rafthttp/peer.go:330 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239155Z
[info] stopped TCP streaming connection with remote peer caller=rafthttp/stream.go:294 stream-writer-type=unknown stream remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239189Z
[info] stopped TCP streaming connection with remote peer caller=rafthttp/stream.go:294 stream-writer-type=unknown stream remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239222Z
[info] stopped HTTP pipelining with remote peer caller=rafthttp/pipeline.go:85 local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239290Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream MsgApp v2 local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239334Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream Message local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239400Z
[info] stopped remote peer caller=rafthttp/peer.go:335 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.241846Z
[warn] server has stopped; skipping GoAttach caller=etcdserver/server.go:2825
2024-11-24T18:30:42.241928Z
[info] grpc service status changed caller=v3rpc/health.go:61 service= status=SERVING
2024-11-24T18:30:42.242638Z
[info] starting with client TLS caller=embed/etcd.go:729 tls-info=cert = /system/secrets/etcd/server.crt, key = /system/secrets/etcd/server.key, client-cert=, client-key=, trusted-ca = /system/secrets/etcd/ca.crt, client-cert-auth = true, crl-file = cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
2024-11-24T18:30:42.242654Z
[info] serving peer traffic caller=embed/etcd.go:600 address=192.168.1.20:2380
2024-11-24T18:30:42.242681Z
[info] cmux::serve caller=embed/etcd.go:572 address=192.168.1.20:2380
2024-11-24T18:30:42.242796Z
[info] serving peer traffic caller=embed/etcd.go:600 address=[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2380
2024-11-24T18:30:42.242828Z
[info] cmux::serve caller=embed/etcd.go:572 address=[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2380
2024-11-24T18:30:42.243204Z
[info] now serving peer/client/metrics caller=embed/etcd.go:280 local-member-id=2c6041e2149212a6 initial-advertise-peer-urls=http://localhost:2380 listen-peer-urls=https://192.168.1.20:2380,https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2380 advertise-client-urls=https://192.168.1.20:2379,https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379 listen-client-urls=https://192.168.1.20:2379,https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379,https://[::1]:2379 listen-metrics-urls=http://[::]:2381
2024-11-24T18:30:42.243257Z
[info] serving metrics caller=embed/etcd.go:871 address=http://[::]:2381
2024-11-24T18:30:42.245270Z
[info] notifying init daemon caller=etcdmain/main.go:44
2024-11-24T18:30:42.245295Z
[info] successfully notified init daemon caller=etcdmain/main.go:50
...

Environment

  • Talos version: 1.8.3
  • Kubernetes version: 1.31.1
  • Platform: bare-metal
@smira
Copy link
Member

smira commented Nov 25, 2024

[warn] server error caller=etcdserver/server.go:1154 error=the member has been permanently removed from the cluster

This is the root cause I guess, Talos doesn't leave etcd on upgrades in 1.8.x, so the issue with etcd was there before the upgrade.

The documentation has information about the full recovery, but in 3+ controlplanes case, first try to analyze members with talosctl etcd members, figure out if the quorum is still there. If the quorum is ok, you can try to bring back the dead member by wiping its state with talosctl reset --nodes <BAD_MEMBER> --system-labels-to-wipe=EPHEMERAL --reboot --graceful=false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants