Starting with OpenShift version 4.4.8, clusters can be stopped and started by simply shutting down their machines and starting them back up. The only additional requirement to bring a cluster back up from a stopped state is that CSRs be approved for the cluster's nodes when certificates have expired since the cluster was active. This typically happens if the cluster is hibernated within the first 24 hours before initial cert rotation takes place. After this phase, certs will be valid for 30 days.
Hive can automate the process of stopping/starting clusters via its API by allowing the user to set a the desired state of the cluster in the ClusterDeployment spec. Both API and controller changes are required to support this feature.
$ oc patch cd mycluster --type='merge' -p $'spec:\n powerState: Hibernating'
$ oc patch cd mycluster --type='merge' -p $'spec:\n powerState: Running'
The ClusterDeploymentStatus defines the observed state of a ClusterDeployment.
Two noteworthy parameters of ClusterDeploymentStatus include:
- PowerState - indicates the observed ClusterPowerState of the cluster
- Conditions - provides further details regarding the status of the cluster
type ClusterDeploymentStatus struct {
// ... other fields
// PowerState indicates the powerstate of cluster
// +optional
PowerState ClusterPowerState `json:"powerState,omitempty"`
// Conditions includes more detailed status for the cluster deployment
// +optional
Conditions []ClusterDeploymentCondition `json:"conditions,omitempty"`
}
The following are stable ClusterPowerStates:
type ClusterPowerState string
const (
// ClusterPowerStateRunning indicates that all machines are running.
// This is the default state of a cluster after installation has completed
ClusterPowerStateRunning ClusterPowerState = "Running"
// ClusterPowerStateHibernating indicates that the machines belonging to a cluster
// are stopped.
ClusterPowerStateHibernating ClusterPowerState = "Hibernating"
The following indicate a terminal error with the power state of the cluster:
type ClusterPowerState string
const (
// ClusterPowerStateFailedToStartMachines indicates a failure to start cloud VMs
ClusterPowerStateFailedToStartMachines ClusterPowerState = "FailedToStartMachines"
// ClusterPowerStateFailedToStop indicates that stopping machines for hibernation
// resulted in an error
ClusterPowerStateFailedToStop ClusterPowerState = "FailedToStop"
)
Finally, the following denote transitionary or waiting states that can occur before a stable state is reached:
type ClusterPowerState string
const (
// ClusterPowerStateSyncSetsNotApplied indicates SyncSets have not yet been applied
// for the cluster. This state is derived from ClusterSync.Status.FirstSuccessTime
ClusterPowerStateSyncSetsNotApplied ClusterPowerState = "SyncSetsNotApplied"
// ClusterPowerStateStartingMachines indicates an attempt to list and start cloud VMs
ClusterPowerStateStartingMachines ClusterPowerState = "StartingMachines"
// ClusterPowerStateStopping indicates the cluster is transitioning
// from a Running state to a Hibernating state.
ClusterPowerStateStopping ClusterPowerState = "Stopping"
// ClusterPowerStateWaitingForMachinesToStop is used when waiting for cloud VMs to stop
ClusterPowerStateWaitingForMachinesToStop ClusterPowerState = "WaitingForMachinesToStop"
// ClusterPowerStateWaitingForMachines is used when waiting for cloud VMs to start
ClusterPowerStateWaitingForMachines ClusterPowerState = "WaitingForMachines"
// ClusterPowerStateWaitingForNodes is used when waiting for nodes to become Ready
ClusterPowerStateWaitingForNodes ClusterPowerState = "WaitingForNodes"
// ClusterPowerStatePausingForClusterOperatorsToSettle is used when pausing to let ClusterOperators start and post new status before we check it.
ClusterPowerStatePausingForClusterOperatorsToSettle ClusterPowerState = "PausingForClusterOperatorsToSettle"
// ClusterPowerStateWaitingForClusterOperators is used when waiting for ClusterOperators to
// get to a good state. (Available=True, Processing=False, Degraded=False)
ClusterPowerStateWaitingForClusterOperators ClusterPowerState = "WaitingForClusterOperators"
// ClusterPowerStateUnknown indicates that we can't/won't discover the state of the cluster's cloud machines.
ClusterPowerStateUnknown = "Unknown"
)
The ClusterDeploymentCondition details additional information about the state of the cluster, and has the following typical structure:
// ClusterDeploymentCondition contains details for the current condition of a cluster deployment
type ClusterDeploymentCondition struct {
// ... other fields
// Type is the type of the condition.
Type ClusterDeploymentConditionType `json:"type"`
// Status is the status of the condition.
Status corev1.ConditionStatus `json:"status"`
// Reason is a unique, one-word, CamelCase reason for the condition's last transition.
// +optional
Reason string `json:"reason,omitempty"`
}
The Type combined with the Status is the primary indicator of the state.
Example:
status:
...
conditions:
- ...
status: "False"
type: Hibernating
When the Hibernating condition is False it means that the cluster is not hibernating.
The Ready and Hibernating status conditions should be considered in combination to determine the state of a cluster. If the Status of either is True, the cluster is in a stable state (either hibernated or running). When both are False, the cluster is transitioning between stable states, and the Reason and Message fields will provide more details.
The following parameters can be used for the Reason section of ClusterDeploymentCondition:
// Cluster hibernating and ready reasons
const (
// HibernatingReasonResumingOrRunning is used as the reason for the Hibernating condition when the cluster
// is resuming or running. Precise details are available in the Ready condition.
HibernatingReasonResumingOrRunning = "ResumingOrRunning"
// HibernatingReasonStopping is used as the reason when the cluster is transitioning
// from a Running state to a Hibernating state.
HibernatingReasonStopping = string(ClusterPowerStateStopping)
// HibernatingReasonWaitingForMachinesToStop is used on the Hibernating condition when waiting for cloud VMs to stop
HibernatingReasonWaitingForMachinesToStop = string(ClusterPowerStateWaitingForMachinesToStop)
// HibernatingReasonHibernating is used as the reason when the cluster is in a
// Hibernating state.
HibernatingReasonHibernating = string(ClusterPowerStateHibernating)
// HibernatingReasonUnsupported is used as the reason when the cluster spec
// specifies that the cluster be moved to a Hibernating state, but the cloud provider of
// the cluster is not supported.
HibernatingReasonUnsupported = "Unsupported"
// HibernatingReasonFailedToStop is used when there was an error stopping machines
// to enter hibernation
HibernatingReasonFailedToStop = string(ClusterPowerStateFailedToStop)
// HibernatingReasonSyncSetsNotApplied is used as the reason when SyncSets have not yet been applied
// for the cluster based on ClusterSync.Status.FirstSucessTime
HibernatingReasonSyncSetsNotApplied = string(ClusterPowerStateSyncSetsNotApplied)
// HibernatingReasonSyncSetsApplied means SyncSets have been successfully applied at some point.
// (It does not necessarily mean they are currently copacetic -- check ClusterSync status
// for that.)
HibernatingReasonSyncSetsApplied = "SyncSetsApplied"
// HibernatingReasonPowerStatePaused indicates that we can't/won't discover the state of the
// cluster's cloud machines because the powerstate-paused annotation is set.
HibernatingReasonPowerStatePaused = "PowerStatePaused"
// HibernatingReasonClusterDeploymentDeleted indicates that a Cluster Deployment has been deleted
// and that the cluster is deprovisioning unless preserveOnDelete is set to true.
HibernatingReasonClusterDeploymentDeleted = "ClusterDeploymentDeleted"
// ReadyReasonStoppingOrHibernating is used as the reason for the Ready condition when the cluster
// is stopping or hibernating. Precise details are available in the Hibernating condition.
ReadyReasonStoppingOrHibernating = "StoppingOrHibernating"
// ReadyReasonStartingMachines is used to reflect attempt to list and start cloud VMs
ReadyReasonStartingMachines = string(ClusterPowerStateStartingMachines)
// ReadyReasonFailedToStartMachines is used when there was an error starting machines
// to leave hibernation
ReadyReasonFailedToStartMachines = string(ClusterPowerStateFailedToStartMachines)
// ReadyReasonWaitingForMachines is used on the Ready condition when waiting for cloud VMs to start.
ReadyReasonWaitingForMachines = string(ClusterPowerStateWaitingForMachines)
// ReadyReasonWaitingForNodes is used on the Ready condition when waiting for nodes to become Ready.
ReadyReasonWaitingForNodes = string(ClusterPowerStateWaitingForNodes)
// ReadyReasonPausingForClusterOperatorsToSettle is used on the Ready condition when pausing to let ClusterOperators start and post new status before we check it.
ReadyReasonPausingForClusterOperatorsToSettle = string(ClusterPowerStatePausingForClusterOperatorsToSettle)
// ReadyReasonWaitingForClusterOperators is used on the Ready condition when waiting for ClusterOperators to
// get to a good state. (Available=True, Processing=False, Degraded=False)
ReadyReasonWaitingForClusterOperators = string(ClusterPowerStateWaitingForClusterOperators)
// ReadyReasonRunning is used on the Ready condition as the reason when the cluster is running and ready
ReadyReasonRunning = string(ClusterPowerStateRunning)
// ReadyReasonPowerStatePaused indicates that we can't/won't discover the state of the
// cluster's cloud machines because the powerstate-paused annotation is set.
ReadyReasonPowerStatePaused = "PowerStatePaused"
// ReadyReasonClusterDeploymentDeleted indicates that a Cluster Deployment has been deleted
// and that the cluster is deprovisioning unless preserveOnDelete is set to true.
ReadyReasonClusterDeploymentDeleted = "ClusterDeploymentDeleted"
)
type ConditionStatus string
// These are valid condition statuses. "ConditionTrue" means a resource is in the condition.
// "ConditionFalse" means a resource is not in the condition. "ConditionUnknown" means kubernetes
// can't decide if a resource is in the condition or not. In the future, we could add other
// intermediate conditions, e.g. ConditionDegraded.
const (
ConditionTrue ConditionStatus = "True"
ConditionFalse ConditionStatus = "False"
ConditionUnknown ConditionStatus = "Unknown"
)
As opposed to the ClusterDeploymentStatus, which indicates the current state of the ClusterDeployment, the ClusterDeploymentSpec defines the desired state of the ClusterDeployment.
ClusterDeploymentSpec has two noteworthy fields that are relevant to hibernation:
// ClusterDeploymentSpec defines the desired state of ClusterDeployment
type ClusterDeploymentSpec struct {
// ... other fields
// PowerState indicates whether a cluster should be running or hibernating. When omitted,
// PowerState defaults to the Running state.
PowerState ClusterPowerState `json:"powerState,omitempty"`
// HibernateAfter will transition a cluster to hibernating power state after it has been running for the
// given duration. The time that a cluster has been running is the time since the cluster was installed
// the time since the cluster last came out of hibernation.
HibernateAfter *metav1.Duration `json:"hibernateAfter,omitempty"`
}
The PowerState indicates the desired power state of the cluster.
- To hibernate a running cluster, set this field to "Hibernating".
- To resume a hibernating cluster, set this field to "Running".
The HibernateAfter field will hibernate a cluster after a given duration (specified by a time.Duration value). Set this field to the given amount of time you wish for the cluster to stay in Running. Once this time has passed, the cluster will transition to the Hibernating state.
NOTE: While this parameter is set, each time the cluster is resumed post-hibernation, it will only remain running for the given duration.
The cluster hibernation controller is a controller that watches the ClusterDeployments and ensures that the target cluster's machines reflect the hibernating state specified in the ClusterDeployment spec.
High level flow of controller logic:
The controller uses the actuator pattern to power up and power down machine instances as well as determining machine state (running or stopped).
Option 1 (Preferred): The hibernation controller relies on the actuator to select machines used by the cluster. The actuator, given a ClusterDeployment's InfraID selects machines using a method appropriate to the cloud provider (tags/name prefix/resource group).
Option 2: The hibernation controller uses the machine API on the target cluster to determine which machines belong to the cluster. It then stores the machine IDs in the clusterdeployment (or a separate CR), then uses those machine IDs to start the cluster again.
The hibernation controller relies on an actuator to work with cloud provider machines. This is the interface for the actuator:
type HibernationActuator interface {
// CanHandle returns true if the actuator can handle a particular ClusterDeployment
CanHandle(cd *hivev1.ClusterDeployment) bool
// StopMachines will start machines belonging to the given ClusterDeployment
StopMachines(logger log.FieldLogger, cd *hivev1.ClusterDeployment, hiveClient client.Client) error
// StartMachines will select machines belonging to the given ClusterDeployment
StartMachines(logger log.FieldLogger, cd *hivev1.ClusterDeployment, hiveClient client.Client) error
// MachinesRunning will return true if the machines associated with the given
// ClusterDeployment are in a running state.
MachinesRunning(logger log.FieldLogger, cd *hivev1.ClusterDeployment, hiveClient client.Client) (bool, error)
// MachinesStopped will return true if the machines associated with the given
MachinesStopped(logger log.FieldLogger, cd *hivev1.ClusterDeployment, hiveClient client.Client) (bool, error)
}
The hibernation controller will set the Hibernating condition to false
and set the reason
to Unsupported if the cluster's cloud provider is not currently supported.
In the case that CSRs must be approved for a cluster that has had its certificates expired while hibernating, we should follow similar checks as the cluster machine approver to ensure we are not introducing an additional security exposure.
The unreachable controller tracks the reachability of a cluster and is responsible for updating the Unreachable
status condition accordingly.
Once a cluster hibernates and stops responding, the unreachable controller sets the Unreachable
condition's Status
to True
.
Once the cluster resumes, the unreachable controller set the condition False
and the hive controllers are free to resume syncing.