Skip to content

Commit

Permalink
Automatic failover vote is not limited by two times the node timeout (#…
Browse files Browse the repository at this point in the history
…1356)

This is a follow of #1305, we now decided to apply the same change
to automatic failover as well, that is, move forward with removing
it for both automatic and manual failovers.

Quote from Ping during the review:
Note that we already debounce transient primary failures with node
timeout, ensuring failover is only triggered after sustained outages.
Election timing is naturally staggered by replica spacing, making the
likelihood of simultaneous elections from replicas of the same shard
very low. The one-vote-per-epoch rule further throttles retries and
ensures orderly elections. On top of that, quorum-based primary failure
confirmation, cluster-state convergence, and slot ownership validation
are all built into the process.

Quote from Madelyn during the review:
It against the specific primary. It's to prevent double failovers.
If a primary just took over we don't want someone else to try to
take over and give the new primary some amount of time to take over.
I have not seen this issue though, it might have been over optimizing?
The double failure mode, where a node fails and then another node fails
within the nodetimeout also doesn't seem that common either though.

So the conclusion is that we all agreed to remove it completely,
it will make the code a lot simpler. And if there is other specific
edge cases we are missing, we will fix it in other way.

See discussion #1305 for more information.

Signed-off-by: Binbin <[email protected]>
  • Loading branch information
enjoy-binbin authored Dec 15, 2024
1 parent 88942c8 commit ad24220
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 30 deletions.
19 changes: 0 additions & 19 deletions src/cluster_legacy.c
Original file line number Diff line number Diff line change
Expand Up @@ -1505,7 +1505,6 @@ clusterNode *createClusterNode(char *nodename, int flags) {
node->cport = 0;
node->tls_port = 0;
node->fail_reports = listCreate();
node->voted_time = 0;
node->orphaned_time = 0;
node->repl_offset_time = 0;
node->repl_offset = 0;
Expand Down Expand Up @@ -4396,23 +4395,6 @@ void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
return;
}

/* We did not voted for a replica about this primary for two
* times the node timeout. This is not strictly needed for correctness
* of the algorithm but makes the base case more linear.
*
* This limitation does not restrict manual failover. If a user initiates
* a manual failover, we need to allow it to vote, otherwise the manual
* failover may time out. */
if (!force_ack && mstime() - node->replicaof->voted_time < server.cluster_node_timeout * 2) {
serverLog(LL_WARNING,
"Failover auth denied to %.40s (%s): "
"can't vote for any replica of %.40s (%s) within %lld milliseconds",
node->name, node->human_nodename,
node->replicaof->name, node->replicaof->human_nodename,
(long long)((server.cluster_node_timeout * 2) - (mstime() - node->replicaof->voted_time)));
return;
}

/* The replica requesting the vote must have a configEpoch for the claimed
* slots that is >= the one of the primaries currently serving the same
* slots in the current configuration. */
Expand All @@ -4434,7 +4416,6 @@ void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {

/* We can vote for this replica. */
server.cluster->lastVoteEpoch = server.cluster->currentEpoch;
if (!force_ack) node->replicaof->voted_time = mstime();
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_FSYNC_CONFIG);
clusterSendFailoverAuth(node);
serverLog(LL_NOTICE, "Failover auth granted to %.40s (%s) for epoch %llu", node->name, node->human_nodename,
Expand Down
2 changes: 0 additions & 2 deletions src/cluster_legacy.h
Original file line number Diff line number Diff line change
Expand Up @@ -341,8 +341,6 @@ struct _clusterNode {
mstime_t pong_received; /* Unix time we received the pong */
mstime_t data_received; /* Unix time we received any data */
mstime_t fail_time; /* Unix time when FAIL flag was set */
mstime_t voted_time; /* Last time we voted for a replica of this primary in non manual
* failover scenarios. */
mstime_t repl_offset_time; /* Unix time we received offset for this node */
mstime_t orphaned_time; /* Starting time of orphaned primary condition */
mstime_t inbound_link_freed_time; /* Last time we freed the inbound link for this node.
Expand Down
39 changes: 30 additions & 9 deletions tests/unit/cluster/manual-failover.tcl
Original file line number Diff line number Diff line change
Expand Up @@ -189,11 +189,6 @@ start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval
set CLUSTER_PACKET_TYPE_FAILOVER_AUTH_ACK 6
set CLUSTER_PACKET_TYPE_NONE -1

# Setting a large timeout to make sure we hit the voted_time limit.
R 0 config set cluster-node-timeout 150000
R 1 config set cluster-node-timeout 150000
R 2 config set cluster-node-timeout 150000

# Let replica drop FAILOVER_AUTH_ACK so that the election won't
# get the enough votes and the election will time out.
R 3 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_FAILOVER_AUTH_ACK
Expand Down Expand Up @@ -229,10 +224,6 @@ start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval
pause_process [srv 0 pid]
wait_for_cluster_state fail

# Setting a large timeout to make sure we hit the voted_time limit.
R 1 config set cluster-node-timeout 150000
R 2 config set cluster-node-timeout 150000

# R 3 performs an automatic failover and it will work.
R 3 config set cluster-replica-no-failover no
wait_for_condition 1000 50 {
Expand Down Expand Up @@ -272,6 +263,36 @@ start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval
}
} ;# start_cluster

start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval 1000 cluster-node-timeout 2000}} {
test "Automatic failover vote is not limited by two times the node timeout - mixed failover" {
R 3 cluster failover
wait_for_condition 1000 50 {
[s 0 role] eq {slave} &&
[s -3 role] eq {master}
} else {
fail "The first failover does not happen"
}
wait_for_cluster_propagation

R 0 cluster failover
wait_for_condition 1000 50 {
[s 0 role] eq {master} &&
[s -3 role] eq {slave}
} else {
fail "The second failover does not happen"
}
wait_for_cluster_propagation

# Let R 3 trigger the automatic failover
pause_process [srv 0 pid]
wait_for_condition 1000 50 {
[s -3 role] eq {master}
} else {
fail "The third failover does not happen"
}
}
} ;# start_cluster

start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-ping-interval 1000 cluster-node-timeout 15000}} {
test "Manual failover will reset the on-going election" {
set CLUSTER_PACKET_TYPE_FAILOVER_AUTH_REQUEST 5
Expand Down

0 comments on commit ad24220

Please sign in to comment.