You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
etcd goes into a boot loop after upgrade from Talos 1.8.1->1.8.3. Node/Cluster never becomes ready.
Description
talosctl -n n1 service etcd (⎈|sekops-omni-p0:default)
NODE 2a02:XXX:0:ae1f:6bff:fe1e:8a4e
ID etcd
STATE Running
HEALTH Fail
LAST HEALTH MESSAGE service not running
EVENTS [Running]: Started task etcd (PID 41598) for container etcd (2s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (7s ago)
[Running]: Started task etcd (PID 41429) for container etcd (10s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (15s ago)
[Running]: Started task etcd (PID 41118) for container etcd (18s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (23s ago)
[Running]: Started task etcd (PID 40882) for container etcd (25s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (31s ago)
[Running]: Started task etcd (PID 40653) for container etcd (33s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (38s ago)
[Running]: Started task etcd (PID 40453) for container etcd (41s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (46s ago)
[Running]: Started task etcd (PID 40299) for container etcd (49s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (54s ago)
[Running]: Started task etcd (PID 40090) for container etcd (57s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m2s ago)
[Running]: Started task etcd (PID 39906) for container etcd (1m5s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m10s ago)
[Running]: Started task etcd (PID 39755) for container etcd (1m12s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m18s ago)
[Running]: Started task etcd (PID 39585) for container etcd (1m20s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m26s ago)
[Running]: Started task etcd (PID 39387) for container etcd (1m28s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m33s ago)
[Running]: Started task etcd (PID 39032) for container etcd (1m36s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m42s ago)
[Running]: Started task etcd (PID 38771) for container etcd (1m44s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m49s ago)
[Running]: Started task etcd (PID 38603) for container etcd (1m52s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (1m57s ago)
[Running]: Started task etcd (PID 38405) for container etcd (1m59s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (2m5s ago)
[Running]: Started task etcd (PID 38216) for container etcd (2m7s ago)
[Waiting]: Runner Containerd(etcd) exited without error, going to restart it (2m13s ago)
[Running]: Started task etcd (PID 38027) for container etcd (2m15s ago)
Logs
etcd logs are a loop of
2024-11-24T18:30:35.095127Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream MsgApp v2 local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:35.095164Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream Message local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.238416Z
[info] started to purge file caller=fileutil/purge.go:50 dir=/var/lib/etcd/member/snap suffix=snap max=5 interval=30s
2024-11-24T18:30:42.238322Z
[warn] server error caller=etcdserver/server.go:1154 error=the member has been permanently removed from the cluster
2024-11-24T18:30:42.238580Z
[warn] data-dir used by this member must be removed caller=etcdserver/server.go:1155
2024-11-24T18:30:42.238588Z
[info] started to purge file caller=fileutil/purge.go:50 dir=/var/lib/etcd/member/wal suffix=wal max=5 interval=30s
2024-11-24T18:30:42.238674Z
[warn] failed to publish local member to cluster through raft caller=etcdserver/server.go:2161 local-member-id=2c6041e2149212a6 local-member-attributes={Name:n1 ClientURLs:[https://192.168.1.20:2379 https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379]} request-path=/0/members/2c6041e2149212a6/attributes publish-timeout=7s error=etcdserver: request cancelled
2024-11-24T18:30:42.238751Z
[warn] failed to publish local member to cluster through raft caller=etcdserver/server.go:2161 local-member-id=2c6041e2149212a6 local-member-attributes={Name:n1 ClientURLs:[https://192.168.1.20:2379 https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379]} request-path=/0/members/2c6041e2149212a6/attributes publish-timeout=7s error=etcdserver: request cancelled
2024-11-24T18:30:42.238781Z
[warn] failed to publish local member to cluster through raft caller=etcdserver/server.go:2161 local-member-id=2c6041e2149212a6 local-member-attributes={Name:n1 ClientURLs:[https://192.168.1.20:2379 https://[2a02:XXX0:ae1f:6bff:fe1e:8a4e]:2379]} request-path=/0/members/2c6041e2149212a6/attributes publish-timeout=7s error=etcdserver: request cancelled
2024-11-24T18:30:42.238806Z
[warn] stopped publish because server is stopped caller=etcdserver/server.go:2151 local-member-id=2c6041e2149212a6 local-member-attributes={Name:n1 ClientURLs:[https://192.168.1.20:2379 https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379]} publish-timeout=7s error=etcdserver: server stopped
2024-11-24T18:30:42.238848Z
[info] stopping remote peer caller=rafthttp/peer.go:330 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.238876Z
[info] stopped TCP streaming connection with remote peer caller=rafthttp/stream.go:294 stream-writer-type=unknown stream remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.238929Z
[info] stopped TCP streaming connection with remote peer caller=rafthttp/stream.go:294 stream-writer-type=unknown stream remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.238962Z
[info] stopped HTTP pipelining with remote peer caller=rafthttp/pipeline.go:85 local-member-id=2c6041e2149212a6 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.238971Z
[warn] server has stopped; skipping GoAttach caller=etcdserver/server.go:2825
2024-11-24T18:30:42.238987Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream MsgApp v2 local-member-id=2c6041e2149212a6 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.239097Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream Message local-member-id=2c6041e2149212a6 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.239125Z
[info] stopped remote peer caller=rafthttp/peer.go:335 remote-peer-id=b325a7b80f41bc6f
2024-11-24T18:30:42.239122Z
[info] grpc service status changed caller=v3rpc/health.go:61 service= status=SERVING
2024-11-24T18:30:42.239139Z
[info] stopping remote peer caller=rafthttp/peer.go:330 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239155Z
[info] stopped TCP streaming connection with remote peer caller=rafthttp/stream.go:294 stream-writer-type=unknown stream remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239189Z
[info] stopped TCP streaming connection with remote peer caller=rafthttp/stream.go:294 stream-writer-type=unknown stream remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239222Z
[info] stopped HTTP pipelining with remote peer caller=rafthttp/pipeline.go:85 local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239290Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream MsgApp v2 local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239334Z
[info] stopped stream reader with remote peer caller=rafthttp/stream.go:442 stream-reader-type=stream Message local-member-id=2c6041e2149212a6 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.239400Z
[info] stopped remote peer caller=rafthttp/peer.go:335 remote-peer-id=29bb865707d8c70
2024-11-24T18:30:42.241846Z
[warn] server has stopped; skipping GoAttach caller=etcdserver/server.go:2825
2024-11-24T18:30:42.241928Z
[info] grpc service status changed caller=v3rpc/health.go:61 service= status=SERVING
2024-11-24T18:30:42.242638Z
[info] starting with client TLS caller=embed/etcd.go:729 tls-info=cert = /system/secrets/etcd/server.crt, key = /system/secrets/etcd/server.key, client-cert=, client-key=, trusted-ca = /system/secrets/etcd/ca.crt, client-cert-auth = true, crl-file = cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
2024-11-24T18:30:42.242654Z
[info] serving peer traffic caller=embed/etcd.go:600 address=192.168.1.20:2380
2024-11-24T18:30:42.242681Z
[info] cmux::serve caller=embed/etcd.go:572 address=192.168.1.20:2380
2024-11-24T18:30:42.242796Z
[info] serving peer traffic caller=embed/etcd.go:600 address=[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2380
2024-11-24T18:30:42.242828Z
[info] cmux::serve caller=embed/etcd.go:572 address=[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2380
2024-11-24T18:30:42.243204Z
[info] now serving peer/client/metrics caller=embed/etcd.go:280 local-member-id=2c6041e2149212a6 initial-advertise-peer-urls=http://localhost:2380 listen-peer-urls=https://192.168.1.20:2380,https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2380 advertise-client-urls=https://192.168.1.20:2379,https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379 listen-client-urls=https://192.168.1.20:2379,https://[2a02:XXX:0:ae1f:6bff:fe1e:8a4e]:2379,https://[::1]:2379 listen-metrics-urls=http://[::]:2381
2024-11-24T18:30:42.243257Z
[info] serving metrics caller=embed/etcd.go:871 address=http://[::]:2381
2024-11-24T18:30:42.245270Z
[info] notifying init daemon caller=etcdmain/main.go:44
2024-11-24T18:30:42.245295Z
[info] successfully notified init daemon caller=etcdmain/main.go:50
...
Environment
Talos version: 1.8.3
Kubernetes version: 1.31.1
Platform: bare-metal
The text was updated successfully, but these errors were encountered:
[warn] server error caller=etcdserver/server.go:1154 error=the member has been permanently removed from the cluster
This is the root cause I guess, Talos doesn't leave etcd on upgrades in 1.8.x, so the issue with etcd was there before the upgrade.
The documentation has information about the full recovery, but in 3+ controlplanes case, first try to analyze members with talosctl etcd members, figure out if the quorum is still there. If the quorum is ok, you can try to bring back the dead member by wiping its state with talosctl reset --nodes <BAD_MEMBER> --system-labels-to-wipe=EPHEMERAL --reboot --graceful=false.
Bug Report
etcd goes into a boot loop after upgrade from Talos 1.8.1->1.8.3. Node/Cluster never becomes ready.
Description
Logs
etcd logs are a loop of
Environment
The text was updated successfully, but these errors were encountered: