Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Open
wants to merge 16 commits into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Sep 30, 2024

When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some seconds).
It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.

The primary does this by sending a CLUSTER FAILOVER command to the replica.
We added a replicaid arg to CLUSTER FAILOVER, after receiving the command,
the replica will check whether the node-id is itself, if not, the command
will be ignored. The node-id is set by the replica through client setname
during the replication handshake.

New argument for CLUSTER FAILOVER

So the format now become CLUSTER FAILOVER [FORCE TAKEOVER] [REPLICAID node-id],
this arg does not intented for user use, so it will not be added to the JSON
file.

Replica sends CLIENT SETNAME using its node-id as the name

During the replication handshake, replica now will use CLIENT SETNAME to inform
the primary of replica node-id.

Primary issue CLUSTER FAILOVER

Primary sends CLUSTER FAILOVER FORCE REPLICAID node-id to all replicas because
it is a shared replication buffer but only the replica with the mathching id
will execute it.

Add a new auto-failover-on-shutdown config

People can disable this feature if they don't like it, the default is 0.

This closes #939.

When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some seconds).
It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.

This closes valkey-io#939.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 30, 2024
Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 94.33962% with 3 lines in your changes missing coverage. Please review.

Project coverage is 71.03%. Comparing base (7fc958d) to head (c8037a1).
Report is 3 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 91.42% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1091      +/-   ##
============================================
+ Coverage     70.80%   71.03%   +0.23%     
============================================
  Files           121      121              
  Lines         65132    65211      +79     
============================================
+ Hits          46118    46324     +206     
+ Misses        19014    18887     -127     
Files with missing lines Coverage Δ
src/config.c 78.39% <ø> (ø)
src/replication.c 87.74% <100.00%> (+0.24%) ⬆️
src/server.c 87.61% <100.00%> (-0.01%) ⬇️
src/server.h 100.00% <ø> (ø)
src/cluster_legacy.c 87.00% <91.42%> (-0.36%) ⬇️

... and 18 files with indirect coverage changes

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for doing this.

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

This is the sequence:

  1. SHUTDOWN or SIGTERM calls prepareForShutdown(). Here, pause clients for writing and start waiting for replicas offset.
  2. In serverCron(), we check isReadyToShutdown() which checks if all replicas have repl_ack_off == primary_repl_offset. If yes, finishShutdown() is called, otherwise wait more.
  3. finishShutdown.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

What do you think?

src/server.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
tests/unit/cluster/auto-failover-on-shutdown.tcl Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
@enjoy-binbin
Copy link
Member Author

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

The issue desc is good and very detailed so i copied it, i will update it later.

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

yean, a failover as soon as possible is good, but itn't true that the primary is down only after it actually exit? so in this case, if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

so let me sort it out again, you are suggesting that if one replica has already caught up the offset, we should trigger a failover immediately?

I guess it is also make sense in this case.

@zuiderkwast
Copy link
Contributor

if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Oct 18, 2024

Sorry for the late reply, i somehow missed this thread.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

Yes, i think this may happen, like if the primary does not flush its output buffer to the slow replica, like primary does not write the buffer to the slow replica, when doing the reconfiguration, the slow replica may use an old offset to psync with the new primary, which will cause a full sync. This may happen, but the probability should be small since the primary will call flushReplicasOutputBuffers to write as much as possible before shutdown.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

@zuiderkwast
Copy link
Contributor

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

I'm thinking if there are any corner cases, like if the cluster is too small to have quorum without the shutting down primary...

If it is simple, I prefer to let the primary wait and vote. Then we can avoid the server.cluster->mf_is_primary_failover variable. I don't like this variable and special case. :)

But if this implementation to wait for the vote will be too complex, then let's just skip the vote. I think it's also fine. Without this feature, we wait for automatic failover, which will also not have the vote from the already shutdown primary.

@enjoy-binbin
Copy link
Member Author

But if this implementation to wait for the vote will be too complex, then let's just skip the vote. I think it's also fine. Without this feature, we wait for automatic failover, which will also not have the vote from the already shutdown primary.

i am going to skip the vote for now, i tried a bit which seemed not easy and not good looking to finish it. Maybe I'll have a better idea later, i will keep it in mind.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am going to skip the vote for now, i tried a bit which seemed not easy and not good looking to finish it. Maybe I'll have a better idea later, i will keep it in mind.

I understand. Simple is better.

But possible data loss is not good. See comments below.

tests/unit/cluster/auto-failover-on-shutdown.tcl Outdated Show resolved Hide resolved
tests/unit/cluster/auto-failover-on-shutdown.tcl Outdated Show resolved Hide resolved
Signed-off-by: Binbin <[email protected]>
@PingXie
Copy link
Member

PingXie commented Oct 28, 2024

This is an interesting idea. I like the direction we are going in but I agree with @zuiderkwast that potential data loss is not appealing.

We can do both though IMO triggering a (graceful) failover as part of CLUSTER FORGET is more valuable than making it part of shutdown, because it is cleaner to forget a node prior to shutting it down in any production environment.

Today, we can't forget "myself" nor "my primary" (with the latter being a dynamic state). This adds operational complexity. Imagine that the admin could just send CLUSTER FORGET to any node in the cluster and then the server will do the right thing, failing over the primaryship to one of its replicas, if applicable, and then broadcast the forget message to the cluster.

@zuiderkwast
Copy link
Contributor

We can do both though IMO triggering a (graceful) failover as part of CLUSTER FORGET is more valuable than making it part of shutdown, because it is cleaner to forget a node prior to shutting it down in any production environment.

@PingXie Yes, it's a good idea, but this PR is about the scenario that the machine is taken down without the control of the Valkey admin. For example, in Kubernetes when a worker is shutdown, SIGTERM is sent to all processes and it waits for 30 seconds by default. When you shutdown your laptop, I believe it's similar, each application gets SIGTERM and has some time to be able to do a graceful shutdown.

src/cluster_legacy.c Outdated Show resolved Hide resolved
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I see you updated the top comment, but I think it should explain the new commands and behaviour in some more detail:

  • New argument for CLUSTER FAILOVER
  • Replica sends CLIENT SETNAME using its node-id as the name
  • Primary sends CLUSTER FAILOVER FORCE REPLICAID node-id to all replicas, because it's a shared replication buffer, but only the replica with the matching id will execute it
  • CLUSTER FAILOVER REPLICAID node-id can also be sent to a primary now? It replicates it and the replica will do the failover, right?
  • New config? Maybe someone wants to disable it because they reboot faster than a failover and the node is still a primary when it comes back?

I don't understand why you wrote "To avoid polluting the replication stream" in the top comment. What are we avoiding here?

We can add test with an old replica in the cluster. I have merged #1371 so now it is easy to add a test like that. WDYT?

Comment on lines +1244 to +1247
/* This is done only when the replica offset is caught up, to avoid data loss.
* And 0x800ff is 8.0.255, we only support new versions for this feature. */
if (replica->repl_data->repl_state == REPLICA_STATE_ONLINE &&
// replica->repl_data->replica_version > 0x800ff &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to uncomment the check?

In the comment, maybe we shall not say "new" because in a few years, this will not be new anymore.

Why not check >= 0x80100 instead > 0x800ff? It's the same but maybe easier to read?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i take this from

             * And 0x702ff is 7.2.255, we only support new versions in this case. */
            if (r->repl_data->repl_state == REPLICA_STATE_ONLINE && r->repl_data->replica_version > 0x702ff) {
                num_eligible_replicas++;
            }

Forgot to uncomment the check?

just a easy way that i can test in local.

src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
return 1;
} else if (myself->replicaof == NULL) {
addReplyError(c, "I'm a replica but my master is unknown to me");
if (replicaid == NULL) addReplyError(c, "I'm a replica but my master is unknown to me");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this? If the primary is unknown, the failover can't succeed, so I think we need to return an error even if REPLICAID is sent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if there would be some races, that a replica return an error to the priamry. Or maybe we should always return OK if replicaid is passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replicas don't send the replies to primaries. Only problem is the confic to panic on repöocation errors. But i can't see any races. Can you?

We can return ok if you want. Maybe we should check that c is the primary fake client and forbid this option for normal clients?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we will write it to the backlog, i am worry that after some down and up, the psync will get the command and return an error. Though I haven't verified it specifically.

return 1;
} else if (!force && (nodeFailed(myself->replicaof) || myself->replicaof->link == NULL)) {
addReplyError(c, "Master is down or failed, "
"please use CLUSTER FAILOVER FORCE");
if (replicaid == NULL) addReplyError(c, "Master is down or failed, please use CLUSTER FAILOVER FORCE");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this? I think it should be an error even with REPLICAID.

src/config.c Show resolved Hide resolved
src/replication.c Outdated Show resolved Hide resolved
@zuiderkwast zuiderkwast added the major-decision-pending Major decision pending by TSC team label Jan 23, 2025
Signed-off-by: Binbin <[email protected]>
@enjoy-binbin
Copy link
Member Author

CLUSTER FAILOVER REPLICAID node-id can also be sent to a primary now? It replicates it and the replica will do the failover, right?

no, we won't replicate this command.

New config? Maybe someone wants to disable it because they reboot faster than a failover and the node is still a primary when it comes back?

yes, i think a new config may allow people to make a better transition, if there are any problems

I don't understand why you wrote "To avoid polluting the replication stream" in the top comment. What are we avoiding here?

i will remove it, it is a old stuff i guess, in the old version, the code has the issue

We can add test with an old replica in the cluster. I have merged #1371 so now it is easy to add a test like that. WDYT?

Sound good, i will take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-pending Major decision pending by TSC team run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[NEW] Trigger manual failover on SIGTERM to primary (cluster)
3 participants