Automatic failover vote is not limited by two times the node timeout (#…

…1356) This is a follow of #1305, we now decided to apply the same change to automatic failover as well, that is, move forward with removing it for both automatic and manual failovers. Quote from Ping during the review: Note that we already debounce transient primary failures with node timeout, ensuring failover is only triggered after sustained outages. Election timing is naturally staggered by replica spacing, making the likelihood of simultaneous elections from replicas of the same shard very low. The one-vote-per-epoch rule further throttles retries and ensures orderly elections. On top of that, quorum-based primary failure confirmation, cluster-state convergence, and slot ownership validation are all built into the process. Quote from Madelyn during the review: It against the specific primary. It's to prevent double failovers. If a primary just took over we don't want someone else to try to take over and give the new primary some amount of time to take over. I have not seen this issue though, it might have been over optimizing? The double failure mode, where a node fails and then another node fails within the nodetimeout also doesn't seem that common either though. So the conclusion is that we all agreed to remove it completely, it will make the code a lot simpler. And if there is other specific edge cases we are missing, we will fix it in other way. See discussion #1305 for more information. Signed-off-by: Binbin <[email protected]>
valkey-io · Dec 15, 2024 · ad24220 · ad24220
1 parent 88942c8
commit ad24220
Show file tree

Hide file tree

Showing 3 changed files with 30 additions and 30 deletions.
diff --git a/src/cluster_legacy.c b/src/cluster_legacy.c
@@ -1505,7 +1505,6 @@ clusterNode *createClusterNode(char *nodename, int flags) {
     node->cport = 0;
     node->tls_port = 0;
     node->fail_reports = listCreate();
-    node->voted_time = 0;
     node->orphaned_time = 0;
     node->repl_offset_time = 0;
     node->repl_offset = 0;
@@ -4396,23 +4395,6 @@ void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
         return;
     }
 
-    /* We did not voted for a replica about this primary for two
-     * times the node timeout. This is not strictly needed for correctness
-     * of the algorithm but makes the base case more linear.
-     *
-     * This limitation does not restrict manual failover. If a user initiates
-     * a manual failover, we need to allow it to vote, otherwise the manual
-     * failover may time out. */
-    if (!force_ack && mstime() - node->replicaof->voted_time < server.cluster_node_timeout * 2) {
-        serverLog(LL_WARNING,
-                  "Failover auth denied to %.40s (%s): "
-                  "can't vote for any replica of %.40s (%s) within %lld milliseconds",
-                  node->name, node->human_nodename,
-                  node->replicaof->name, node->replicaof->human_nodename,
-                  (long long)((server.cluster_node_timeout * 2) - (mstime() - node->replicaof->voted_time)));
-        return;
-    }
-
     /* The replica requesting the vote must have a configEpoch for the claimed
      * slots that is >= the one of the primaries currently serving the same
      * slots in the current configuration. */
@@ -4434,7 +4416,6 @@ void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
 
     /* We can vote for this replica. */
     server.cluster->lastVoteEpoch = server.cluster->currentEpoch;
-    if (!force_ack) node->replicaof->voted_time = mstime();
     clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_FSYNC_CONFIG);
     clusterSendFailoverAuth(node);
     serverLog(LL_NOTICE, "Failover auth granted to %.40s (%s) for epoch %llu", node->name, node->human_nodename,

diff --git a/src/cluster_legacy.h b/src/cluster_legacy.h
@@ -341,8 +341,6 @@ struct _clusterNode {
     mstime_t pong_received;                 /* Unix time we received the pong */
     mstime_t data_received;                 /* Unix time we received any data */
     mstime_t fail_time;                     /* Unix time when FAIL flag was set */
-    mstime_t voted_time;                    /* Last time we voted for a replica of this primary in non manual
-                                             * failover scenarios. */
     mstime_t repl_offset_time;              /* Unix time we received offset for this node */
     mstime_t orphaned_time;                 /* Starting time of orphaned primary condition */
     mstime_t inbound_link_freed_time;       /* Last time we freed the inbound link for this node.

diff --git a/tests/unit/cluster/manual-failover.tcl b/tests/unit/cluster/manual-failover.tcl
@@ -189,11 +189,6 @@ start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval
         set CLUSTER_PACKET_TYPE_FAILOVER_AUTH_ACK 6
         set CLUSTER_PACKET_TYPE_NONE -1
 
-        # Setting a large timeout to make sure we hit the voted_time limit.
-        R 0 config set cluster-node-timeout 150000
-        R 1 config set cluster-node-timeout 150000
-        R 2 config set cluster-node-timeout 150000
-
         # Let replica drop FAILOVER_AUTH_ACK so that the election won't
         # get the enough votes and the election will time out.
         R 3 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_FAILOVER_AUTH_ACK
@@ -229,10 +224,6 @@ start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval
         pause_process [srv 0 pid]
         wait_for_cluster_state fail
 
-        # Setting a large timeout to make sure we hit the voted_time limit.
-        R 1 config set cluster-node-timeout 150000
-        R 2 config set cluster-node-timeout 150000
-
         # R 3 performs an automatic failover and it will work.
         R 3 config set cluster-replica-no-failover no
         wait_for_condition 1000 50 {
@@ -272,6 +263,36 @@ start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval
     }
 } ;# start_cluster
 
+start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval 1000 cluster-node-timeout 2000}} {
+    test "Automatic failover vote is not limited by two times the node timeout - mixed failover" {
+        R 3 cluster failover
+        wait_for_condition 1000 50 {
+            [s 0 role] eq {slave} &&
+            [s -3 role] eq {master}
+        } else {
+            fail "The first failover does not happen"
+        }
+        wait_for_cluster_propagation
+
+        R 0 cluster failover
+        wait_for_condition 1000 50 {
+            [s 0 role] eq {master} &&
+            [s -3 role] eq {slave}
+        } else {
+            fail "The second failover does not happen"
+        }
+        wait_for_cluster_propagation
+
+        # Let R 3 trigger the automatic failover
+        pause_process [srv 0 pid]
+        wait_for_condition 1000 50 {
+            [s -3 role] eq {master}
+        } else {
+            fail "The third failover does not happen"
+        }
+    }
+} ;# start_cluster
+
 start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval 1000 cluster-node-timeout 15000}} {
     test "Manual failover will reset the on-going election" {
         set CLUSTER_PACKET_TYPE_FAILOVER_AUTH_REQUEST 5