Skip to content

Commit

Permalink
[PLAT-10706][dr] Support retry-ability of failover and switchover
Browse files Browse the repository at this point in the history
Summary:
This diff addresses:
  - Adds support for retry-ing switchover and failover task upon failure for both txn DR and db-scoped DR.
  - Adds support for aborting switchover.
  - Improves the pitr restore task performance.
  - Improves the failover/switchover task execution time by creating the pitr configs during DR creation.
  - Fix the create backup and restore subtasks methods to not overwrite the task params of the parent task in case of xCluster.
  - Add support to ignore errors to the ChangeXClusterRole subtask. Prior to this, the force delete option of the xCluster/DR config could have failed if the target universe was not responsive.
  - Delete the extra xCluster configs of a DR config during restart.
  - Fix an issue where the create db-scoped DR config could fail if a pitr config was created before creating the DR config.

Test Plan:
- Made sure the PITR configs are created as part of the DR config create task on both universes.
- Retry works for Failover and Switchover as expected for both txn and db xCluster config types, having one or two DBs in replication:
  - Add fault injecting in the middle of some critical subtasks and removed it and retried the task.
  - Randomly aborted the switchover task and made sure the retry works.
- Made sure the PITR restore works and it is faster.

Reviewers: #yba-api-review!, cwang, vbansal, jmak, sanketh

Reviewed By: cwang, vbansal

Subscribers: cwang, sanketh, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D37310
  • Loading branch information
shahrooz1997 committed Sep 11, 2024
1 parent 0d53558 commit ffa537e
Show file tree
Hide file tree
Showing 53 changed files with 1,765 additions and 1,094 deletions.
5 changes: 4 additions & 1 deletion managed/RUNTIME-FLAGS.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,8 @@
| "Network Load balancer health check paths" | "yb.universe.network_load_balancer.custom_health_check_paths" | "UNIVERSE" | "Paths probed by HTTP/HTTPS health checks performed by the network load balancer. Paths are mapped one-to-one with the custom health check ports runtime configuration." | "String List" |
| "Validate filepath for local release" | "yb.universe.validate_local_release" | "UNIVERSE" | "For certain tasks validates the existence of local filepath for the universe software version." | "Boolean" |
| "The delay before the next poll of the PITR config creation status" | "yb.pitr.create_poll_delay" | "UNIVERSE" | "It is the delay after which the create PITR config subtask rechecks the status of the PITR config creation in each iteration" | "Duration" |
| "The delay before the next poll of the PITR config restore status" | "yb.pitr.restore_poll_delay" | "UNIVERSE" | "It is the delay after which the restore PITR config subtask rechecks the status of the restore operation" | "Duration" |
| "The timeout for restoring a universe using a PITR config" | "yb.pitr.restore_timeout" | "UNIVERSE" | "It is the maximum time that the restore PITR config subtask waits for the restore operation using PITR to be completed; otherwise, it will fail the operation" | "Duration" |
| "The timeout for creating a PITR config" | "yb.pitr.create_timeout" | "UNIVERSE" | "It is the maximum time that the create PITR config subtask waits for the PITR config to be created; otherwise, it will fail the operation" | "Duration" |
| "Default PITR retention period for txn xCluster" | "yb.xcluster.transactional.pitr.default_retention_period" | "UNIVERSE" | "The default retention period used to create PITR configs for transactional xCluster replication; it will be used when there is no existing PITR configs and it is not specified in the task parameters" | "Duration" |
| "Default PITR snapshot interval for txn xCluster" | "yb.xcluster.transactional.pitr.default_snapshot_interval" | "UNIVERSE" | "The default snapshot interval used to create PITR configs for transactional xCluster replication; it will be used when there is no existing PITR configs and it is not specified in the task parameters" | "Duration" |
Expand All @@ -241,7 +243,8 @@
| "Sync user-groups between the Universe DB nodes and LDAP Server" | "yb.security.ldap.ldap_universe_sync" | "UNIVERSE" | "If configured, this feature allows users to synchronise user groups configured on the upstream LDAP Server with user roles in YBDB nodes associated with the universe." | "Boolean" |
| "Cluster membership check timeout" | "yb.checks.cluster_membership.timeout" | "UNIVERSE" | "Controls the max time to check that there are no tablets assigned to the node" | "Duration" |
| "Verify current cluster state (from db perspective) before running task" | "yb.task.verify_cluster_state" | "UNIVERSE" | "Verify current cluster state (from db perspective) before running task" | "Boolean" |
| "Wait time for xcluster/DR replication setup and edit RPCs." | "yb.xcluster.operation_timeout" | "UNIVERSE" | "Wait time for xcluster/DR replication setup and edit RPCs." | "Duration" |
| "Wait time for xcluster/DR replication setup and edit RPCs" | "yb.xcluster.operation_timeout" | "UNIVERSE" | "Wait time for xcluster/DR replication setup and edit RPCs." | "Duration" |
| "Maximum timeout for xCluster bootstrap producer RPC call" | "yb.xcluster.bootstrap_producer_timeout" | "UNIVERSE" | "If the RPC call to create the bootstrap streams on the source universe does not return before this timeout, the task will retry with exponential backoff until it fails." | "Duration" |
| "Leaderless tablets check enabled" | "yb.checks.leaderless_tablets.enabled" | "UNIVERSE" | " Whether to run CheckLeaderlessTablets subtask before running universe tasks" | "Boolean" |
| "Leaderless tablets check timeout" | "yb.checks.leaderless_tablets.timeout" | "UNIVERSE" | "Controls the max time out when performing the CheckLeaderlessTablets subtask" | "Duration" |
| "Enable Clock Sync check" | "yb.wait_for_clock_sync.enabled" | "UNIVERSE" | "Enable Clock Sync check" | "Boolean" |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import static com.yugabyte.yw.common.PlatformExecutorFactory.SHUTDOWN_TIMEOUT_MINUTES;

import com.fasterxml.jackson.databind.JsonNode;
import com.google.api.client.util.Throwables;
import com.google.common.util.concurrent.MoreExecutors;
import com.google.common.util.concurrent.ThreadFactoryBuilder;
import com.typesafe.config.Config;
Expand All @@ -22,6 +23,7 @@
import com.yugabyte.yw.common.ShellResponse;
import com.yugabyte.yw.common.TableManager;
import com.yugabyte.yw.common.TableManagerYb;
import com.yugabyte.yw.common.UnrecoverableException;
import com.yugabyte.yw.common.Util;
import com.yugabyte.yw.common.YsqlQueryExecutor;
import com.yugabyte.yw.common.alerts.AlertConfigurationService;
Expand Down Expand Up @@ -298,40 +300,42 @@ protected boolean doWithConstTimeout(long delayMs, long totalDelayMs, Supplier<B
}

/**
* This function is used to retry a function with a delay between retries. The delay is
* modifiable. The function will be retried on exceptions until the total delay has passed or the
* function returns.
* This function retries a function with a modifiable delay between retries. The function will
* retry on any exception but will not retry if the exception is an UnrecoverableException.
*
* @param delayFunct Function to calculate the delay between retries
* @param totalDelayMs Total delay to wait before giving up
* @param funct Function to retry; must abide by the Runnable interface
* @throws RuntimeException If the function does not return before the total delay
* @throws RuntimeException If the function does not succeed before the total delay, or if an
* UnrecoverableException is thrown.
*/
protected void doWithModifyingTimeout(
Function<Long, Long> delayFunct, long totalDelayMs, Runnable funct) throws RuntimeException {
long currentDelayMs = 0;
long startTime = System.currentTimeMillis();
while (System.currentTimeMillis() < startTime + totalDelayMs - currentDelayMs) {
while (true) {
currentDelayMs = delayFunct.apply(currentDelayMs);
try {
funct.run();
return;
} catch (UnrecoverableException e) {
log.error(
"Won't retry; Unrecoverable error while running the function: {}", e.getMessage());
throw e;
} catch (Exception e) {
log.warn("Will retry; Error while running the function: {}", e.getMessage());
if (System.currentTimeMillis() < startTime + totalDelayMs - currentDelayMs) {
log.warn("Will retry; Error while running the function: {}", e.getMessage());
} else {
log.error("Retry timed out; Error while running the function: {}", e.getMessage());
Throwables.propagate(e);
}
}
currentDelayMs = delayFunct.apply(currentDelayMs);
log.debug(
"Waiting for {} ms between retry, total delay remaining {} ms",
currentDelayMs,
(startTime + totalDelayMs - System.currentTimeMillis()));
totalDelayMs - (System.currentTimeMillis() - startTime));
waitFor(Duration.ofMillis(currentDelayMs));
}
// Retry for the last time and then throw the exception that funct raised.
try {
funct.run();
} catch (Exception e) {
log.error("Retry timed out; Error while running the function: {}", e.getMessage());
throw new RuntimeException(e);
}
}

protected void doWithConstTimeout(long delayMs, long totalDelayMs, Runnable funct) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ private Set<String> getTableIdsToAdd(
}

private Set<String> getTableIdsToRemove(
XClusterConfig xClusterConfig,
Set<String> tableIdsInReplication,
Set<String> tableIdsInYbaXClusterConfig,
Set<String> sourceUniverseTableIds,
Expand All @@ -160,7 +161,7 @@ private Set<String> getTableIdsToRemove(
// Exclude tables that have no associated xClusterTableConfig or have a null
// streamId.
Optional<XClusterTableConfig> xClusterTableConfig =
XClusterTableConfig.maybeGetByTableId(tableId);
xClusterConfig.maybeGetTableById(tableId);
if (xClusterTableConfig.isEmpty()
|| xClusterTableConfig.get().getStreamId() == null) {
return false;
Expand Down Expand Up @@ -255,6 +256,7 @@ public void compareTablesAndSyncXClusterConfig(XClusterConfig config) {

Set<String> tableIdsToRemove =
getTableIdsToRemove(
config,
tableIdsInReplication,
tableIdsInYbaXClusterConfig,
sourceUniverseTableIds,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,11 +44,19 @@ protected CreatePitrConfigParams taskParams() {
@Override
public String getName() {
return String.format(
"%s(universeUuid=%s,tableType=%s,keyspaceName=%s)",
"%s(universeUuid=%s, customerUuid=%s, name=%s, keyspaceName=%s, tableType=%s,"
+ " retentionPeriodInSeconds=%s, intervalInSeconds=%d, xClusterConfig=%s,"
+ " createdForDr=%s)",
super.getName(),
taskParams().getUniverseUUID(),
taskParams().customerUUID,
taskParams().name,
taskParams().keyspaceName,
taskParams().tableType,
taskParams().keyspaceName);
taskParams().retentionPeriodInSeconds,
taskParams().intervalInSeconds,
taskParams().xClusterConfig,
taskParams().createdForDr);
}

@Override
Expand Down
Loading

0 comments on commit ffa537e

Please sign in to comment.