Availability Zone Fault Tolerance - Documentation Proposal #17120

aruraghuwanshi · 2024-09-19T23:27:04Z

Description

As of today, Druid doesn't provide AZ fault tolerance out of the box and when an AZ actually goes down, it would take down multiple historicals that are present on that AZ which would potentially result in Segment Unavailability. Having multiple replicas of the segment distributed across the historicals also wouldn't help if the replicas exist on the same historical StatefulSet.

Suppose there is a complete AZ failure in us-east-2a and all pods go down. Routers and Brokers are stateless so querying would continue to work. A new coordinator/overlord will be elected from either us-east-2b or us-east-2c. For MiddleManagers we may incur some ingestion latency, but it should be remediated once the new coordinator/overlord is elected and directs MiddleManagers in us-east-2b and us-east-2c to consume from those partitions (assuming there is capacity). The problem we’re tackling here is how to achieve high data availability for those segments stored on Historicals in us-east-2a. Although one could have multiple replicas for each segment, the only guarantee Druid provides is that no two replicants are loaded onto the same Historical server. This means that Historicals in the same AZ could hold all the replicas, and during AZ failure that data would be unavailable for querying. To solve this problem we need to ensure that each replica is stored on a Historical in a different AZ.

Proposal

To counter the issue with historicals and replicas, the following changes need to be made in the deployment strategy:

Create 2 or more tiers of historical Statefulsets.
Assign replica1 to tier1, replica2 to tier2, ... replicaN to tierN.
Create a common PodDisruptionBudget which would cover the N historical tiers/statefulsets, and set maxUnavailable to (N-1), where N=number of segment replicas.

Another use-case this kind of deployment solves is the ability to rollout changes much faster.
Issue:

Historical statefulsets usually take the longest when trying to restart/update the Druid image.
In order to maintain availability, its usually recommended to do a rolling restart; this causes bigger clusters to take extremely long hours to effect a change.

With the proposed deployment setup:

Its possible to restart an entire historical stateful set in one go, and move on to the next.
This implies that pods belonging to that Statefulset will restart in parallel, accelerating the time taken to update a big chuck on historicals.
No Segment Unavailability despite multiple historicals being pulled down at once.

Motivation

This strategy could provide users with a guidance on how to make their cluster AZ fault tolerant.
The cluster would never lose Segment Availability, which would make the query outputs always reliable.
This also protects against other issues when more than 1 pods of the same historical stateful sets are taken down, which results in unavailability of segments.
Another benefit is giving the user the option to have faster deployments.

To conclude, I wanted to ask to add these details (further enriched) into the High Availability Section of Druid Documentation

aruraghuwanshi added the Feature/Change Description label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Availability Zone Fault Tolerance - Documentation Proposal #17120

Availability Zone Fault Tolerance - Documentation Proposal #17120

aruraghuwanshi commented Sep 19, 2024 •

edited

Loading

Availability Zone Fault Tolerance - Documentation Proposal #17120

Availability Zone Fault Tolerance - Documentation Proposal #17120

Comments

aruraghuwanshi commented Sep 19, 2024 • edited Loading

Description

Motivation

aruraghuwanshi commented Sep 19, 2024 •

edited

Loading