Skip to content

Commit b3ee38d

Browse files
authored
Merge pull request #1197 from mrlihanbo/failover-policy
add docs: failover policy overview
2 parents b3f3286 + 9fd3dd9 commit b3ee38d

File tree

2 files changed

+208
-0
lines changed

2 files changed

+208
-0
lines changed

docs/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Refer to [Installing Karmada](./installation/installation.md).
1313

1414
- [Cluster Registration](./userguide/cluster-registration.md)
1515
- [Aggregated Kubernetes API Endpoint](./userguide/aggregated-api-endpoint.md)
16+
- [Cluster Failover](./userguide/failover.md)
1617

1718
## Best Practices
1819

docs/userguide/failover.md

+207
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# Failover Overview
2+
3+
## Monitor the cluster health status
4+
5+
Karmada supports both `Push` and `Pull` modes to manage member clusters.
6+
7+
More details about cluster registration please refer to [Cluster Registration](./cluster-registration.md#cluster-registration).
8+
9+
### Determining failures
10+
11+
For clusters there are two forms of heartbeats:
12+
- updates to the `.status` of a Cluster.
13+
- `Lease` objects within the `karmada-cluster` namespace in karmada control plane. Each cluster has an associated `Lease` object.
14+
15+
#### Cluster status collection
16+
17+
For `Push` mode clusters, the cluster status controller in karmada control plane will continually collect cluster's status for a configured interval.
18+
19+
For `Pull` mode clusters, the `karmada-agent` is responsible for creating and updating the `.status` of clusters with configured interval.
20+
21+
The interval for `.status` updates to `Cluster` can be configured via `--cluster-status-update-frequency` flag(default is 10 seconds).
22+
23+
Cluster might be set to the `NotReady` state with following conditions:
24+
- cluster is unreachable(retry 4 times within 2 seconds).
25+
- cluster's health endpoint responded without ok.
26+
- failed to collect cluster status including the kubernetes’ version, installed APIs, resources usages, etc.
27+
28+
#### Lease updates
29+
Karmada will create a `Lease` object and a lease controller for each cluster when clusters are joined.
30+
31+
Each lease controller is responsible for updating the related Leases. The lease renewing time can be configured via `--cluster-lease-duration` and `--cluster-lease-renew-interval-fraction` flags(default is 10 seconds).
32+
33+
Lease’s updating process is independent with cluster’s status updating process, since cluster’s `.status` field is maintained by cluster status controller.
34+
35+
The cluster controller in Karmada control plane would check the state of each cluster every `--cluster-monitor-period` period(default is 5 seconds).
36+
37+
The cluster's `Ready` condition would be changed to `Unknown` when cluster controller has not heard from the cluster in the last `--cluster-monitor-grace-period`(default is 40 seconds).
38+
39+
### Check cluster status
40+
You can use `kubectl` to check a Cluster's status and other details:
41+
```
42+
kubectl describe cluster <cluster-name>
43+
```
44+
45+
The `Ready` condition in `Status` field indicates the cluster is healthy and ready to accept workloads.
46+
It will be set to `False` if the cluster is not healthy and is not accepting workloads, and `Unknown` if the cluster controller has not heard from the cluster in the last `cluster-monitor-grace-period`.
47+
48+
The following example describes an unhealthy cluster:
49+
```
50+
kubectl describe cluster member1
51+
52+
Name: member1
53+
Namespace:
54+
Labels: <none>
55+
Annotations: <none>
56+
API Version: cluster.karmada.io/v1alpha1
57+
Kind: Cluster
58+
Metadata:
59+
Creation Timestamp: 2021-12-29T08:49:35Z
60+
Finalizers:
61+
karmada.io/cluster-controller
62+
Resource Version: 152047
63+
UID: 53c133ab-264e-4e8e-ab63-a21611f7fae8
64+
Spec:
65+
API Endpoint: https://172.23.0.7:6443
66+
Impersonator Secret Ref:
67+
Name: member1-impersonator
68+
Namespace: karmada-cluster
69+
Secret Ref:
70+
Name: member1
71+
Namespace: karmada-cluster
72+
Sync Mode: Push
73+
Status:
74+
Conditions:
75+
Last Transition Time: 2021-12-31T03:36:08Z
76+
Message: cluster is not reachable
77+
Reason: ClusterNotReachable
78+
Status: False
79+
Type: Ready
80+
Events: <none>
81+
```
82+
83+
## Failover feature of Karmada
84+
The failover feature is controlled by the `Failover` feature gate, users need to enable the `Failover` feature gate of karmada scheduler:
85+
```
86+
--feature-gates=Failover=true
87+
```
88+
89+
### Concept
90+
91+
When it is determined that member clusters becoming unhealthy, the karmada scheduler will reschedule the reference application.
92+
There are several constraints:
93+
- For each rescheduled application, it still needs to meet the restrictions of PropagationPolicy, such as ClusterAffinity or SpreadConstraints.
94+
- The application distributed on the ready clusters after the initial scheduling will remain when failover schedule.
95+
96+
#### Duplicated schedule type
97+
For `Duplicated` schedule policy, when the number of candidate clusters that meet the PropagationPolicy restriction is not less than the number of failed clusters,
98+
it will be rescheduled to candidate clusters according to the number of failed clusters. Otherwise, no rescheduling.
99+
100+
Take `Deployment` as example:
101+
```
102+
apiVersion: apps/v1
103+
kind: Deployment
104+
metadata:
105+
name: nginx
106+
labels:
107+
app: nginx
108+
spec:
109+
replicas: 2
110+
selector:
111+
matchLabels:
112+
app: nginx
113+
template:
114+
metadata:
115+
labels:
116+
app: nginx
117+
spec:
118+
containers:
119+
- image: nginx
120+
name: nginx
121+
---
122+
apiVersion: policy.karmada.io/v1alpha1
123+
kind: PropagationPolicy
124+
metadata:
125+
name: nginx-propagation
126+
spec:
127+
resourceSelectors:
128+
- apiVersion: apps/v1
129+
kind: Deployment
130+
name: nginx
131+
placement:
132+
clusterAffinity:
133+
clusterNames:
134+
- member1
135+
- member2
136+
- member3
137+
- member5
138+
spreadConstraints:
139+
- maxGroups: 2
140+
minGroups: 2
141+
replicaScheduling:
142+
replicaSchedulingType: Duplicated
143+
```
144+
145+
Suppose there are 5 member clusters, and the initial scheduling result is in member1 and member2. When member2 fails, it triggers rescheduling.
146+
147+
It should be noted that rescheduling will not delete the application on the ready cluster member1. In the remaining 3 clusters, only member3 and member5 match the `clusterAffinity` policy.
148+
149+
Due to the limitations of spreadConstraints, the final result can be [member1, member3] or [member1, member5].
150+
151+
#### Divided schedule type
152+
For `Divided` schedule policy, karmada scheduler will try to migrate replicas to the other health clusters.
153+
154+
Take `Deployment` as example:
155+
```
156+
apiVersion: apps/v1
157+
kind: Deployment
158+
metadata:
159+
name: nginx
160+
labels:
161+
app: nginx
162+
spec:
163+
replicas: 3
164+
selector:
165+
matchLabels:
166+
app: nginx
167+
template:
168+
metadata:
169+
labels:
170+
app: nginx
171+
spec:
172+
containers:
173+
- image: nginx
174+
name: nginx
175+
---
176+
apiVersion: policy.karmada.io/v1alpha1
177+
kind: PropagationPolicy
178+
metadata:
179+
name: nginx-propagation
180+
spec:
181+
resourceSelectors:
182+
- apiVersion: apps/v1
183+
kind: Deployment
184+
name: nginx
185+
placement:
186+
clusterAffinity:
187+
clusterNames:
188+
- member1
189+
- member2
190+
replicaScheduling:
191+
replicaDivisionPreference: Weighted
192+
replicaSchedulingType: Divided
193+
weightPreference:
194+
staticWeightList:
195+
- targetCluster:
196+
clusterNames:
197+
- member1
198+
weight: 1
199+
- targetCluster:
200+
clusterNames:
201+
- member2
202+
weight: 2
203+
```
204+
205+
Karmada scheduler will divide the replicas according the `weightPreference`. The initial schedule result is member1 with 1 replica and member2 with 2 replicas.
206+
207+
When member1 fails, it triggers rescheduling. Karmada scheduler will try to migrate replicas to the other health clusters. The final result will be member2 with 3 replicas.

0 commit comments

Comments
 (0)