-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The second node took about 7 seconds to join in a single node cluster #18385
Comments
I'll try to reproduce it. Thanks for the detailed instructions, @XiuhuaRuan. /assign |
I was able to duplicate this issue with 3.5.15 and the main branch. I used the following etcd1: bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380' --logger=zap --log-outputs=stderr --force-new-cluster
etcd2: bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380' --logger=zap --log-outputs=stderr --initial-cluster-state existing And the following script (after running rm -rf infra*.etcd
trap 'goreman run stop-all' EXIT
goreman start &
sleep 3
bin/etcdctl --endpoints=127.0.0.1:2379 member list
bin/etcdctl --endpoints=127.0.0.1:2379 member add infra2 --peer-urls=http://127.0.0.1:22380
sleep 1
SECONDS=0
goreman run start etcd2 &
while ! bin/etcdctl --command-timeout=0.1s --endpoints=127.0.0.1:2379 member list | grep infra2; do
sleep 0.5
done
elapsed="$SECONDS"
goreman run stop-all
echo "second node joined in $elapsed seconds" Find the files attached and remove the |
Thanks @ivanvc for reproducing and @XiuhuaRuan for raising. It looks like this could be related to downgrade implementation, @siyuanfoundation is this behaviour expected? |
When a new member joins an existing cluster, the new member needs to check if its version is compatible with the current cluster version, and the new member's version can be lower than cluster version only when downgrade is enabled. That's the background of needing to check the On the other end, when serving In the case of adding a new member to 1 node cluster, after @serathius this seems to be a bug fix. WDYT? |
Right, the suggestion to fix makes sense, however I would like to better understand the context why it was designed like this before. |
@YoyinZyc I see you wrote this code. Can you help us understand the context why it was designed like this? |
The suggestion is to add the second member as a learner firstly, and promote to a voting member it later. |
You mean why we designed to use linearizableReadNotify? It is just like you said, we use it to get the latest downgrade status. The context timeout issue for the second member sounds like a bug we didn't catch at the early stage. |
I'm thinking about why check |
Bug report criteria
What happened?
I started the first etcd node with option --force-new-cluster. Then I started the second node to join this new cluster. But it took 7 seconds to set up this cluster.
What did you expect to happen?
The second node could join the cluster quickly.
How can we reproduce it (as minimally and precisely as possible)?
Start the first etcd node with option --force-new-cluster
./etcd --name etcd_01 --data-dir etcd01 --listen-client-urls http://192.168.31.105:12379,http://127.0.0.1:12379 --advertise-client-urls http://192.168.31.105:12379 --initial-advertise-peer-urls http://192.168.31.105:12380 --listen-peer-urls http://192.168.31.105:12380 --initial-cluster-token etcd-cluster --initial-cluster etcd_01=http://192.168.31.105:12380 --force-new-cluster
Check the cluster member list successfully
./etcdctl --endpoints=192.168.31.105:12379 member list
7c0730f7752a5a15, started, etcd_01, http://192.168.31.105:12380, http://192.168.31.105:12379, false
Add the secode node to the cluster
./etcdctl --endpoints=192.168.31.105:12379 member add etcd_02 --peer-urls=http://192.168.31.106:12380
Member 40339f82f8313cbb added to cluster 5fb4c1b191df6a5b
ETCD_NAME="etcd_02"
ETCD_INITIAL_CLUSTER="etcd_02=http://192.168.31.106:12380,etcd_01=http://192.168.31.105:12380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.31.106:12380"
ETCD_INITIAL_CLUSTER_STATE="existing"
Start the second etcd node
./etcd --name etcd_02 --data-dir etcd02 --listen-client-urls http://192.168.31.106:12379,http://127.0.0.1:12379 --advertise-client-urls http://192.168.31.106:12379 --initial-advertise-peer-urls http://192.168.31.106:12380 --listen-peer-urls http://192.168.31.106:12380 --initial-cluster-token etcd-cluster-07311158 --initial-cluster etcd_01=http://192.168.31.105:12380,etcd_02=http://192.168.31.106:12380 --initial-cluster-state existing
Checking the cluster member list failed at the beginning
./etcdctl --endpoints=192.168.31.105:12379 member list
{"level":"warn","ts":"2024-07-31T13:31:53.591638+0800","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000362700/192.168.31.105:12379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
After a few seconds, check the cluster member list successfully
./etcdctl --endpoints=192.168.31.105:12379 member list
40339f82f8313cbb, started, etcd_02, http://192.168.31.106:12380, http://192.168.31.106:12379, false
7c0730f7752a5a15, started, etcd_01, http://192.168.31.105:12380, http://192.168.31.105:12379, false
Anything else we need to know?
I tried etcd-v3.5.9 and etcd-v3.5.15, both got the same result, but with different log output.
With etcd-v3.5.15, the etcd log showed 7 seconds absence between 13:32:42~13:32:49 when the second node started , reported two warnings "failed to reach the peer URL" and "failed to get downgrade enabled status" at 13:32:49.
With etcd-v3.5.9, the etcd log also showed 7 seconds absence between 10:42:15~10:42:22 when the second node started , reported two warnings "failed to convert response" and "failed to get downgrade enabled status" at 10:42:22.
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
#on the first node
./etcd --name etcd_01 --data-dir etcd01 --listen-client-urls http://192.168.31.105:12379,http://127.0.0.1:12379 --advertise-client-urls http://192.168.31.105:12379 --initial-advertise-peer-urls http://192.168.31.105:12380 --listen-peer-urls http://192.168.31.105:12380 --initial-cluster-token etcd-cluster-07311158 --initial-cluster etcd_01=http://192.168.31.105:12380 --force-new-cluster
#no the second node
./etcdctl --endpoints=192.168.31.105:12379 member add etcd_02 --peer-urls=http://192.168.31.106:12380
./etcd --name etcd_02 --data-dir etcd02 --listen-client-urls http://192.168.31.106:12379,http://127.0.0.1:12379 --advertise-client-urls http://192.168.31.106:12379 --initial-advertise-peer-urls http://192.168.31.106:12380 --listen-peer-urls http://192.168.31.106:12380 --initial-cluster-token etcd-cluster-07311158 --initial-cluster etcd_01=http://192.168.31.105:12380,etcd_02=http://192.168.31.106:12380 --initial-cluster-state existing
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
The text was updated successfully, but these errors were encountered: