[non-official Guide] Recover from an erroneous machine config changing kernel options (in this case, cgroups) #2056
glowing-axolotl
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Use this guide at your own risk!
You might have come across this machineconfig:
In versions before 4.13, this config changed cgroups support from v2 to v1.
When running the configuration in a 4.15 OKD cluster, the node on which the config is applied will get stuck:
No containers will start on the node, and journalctl will keep showing the following error:
If we ssh on the node and switch to root:
And show the kernel arguments:
We will find the following options:
Since OKD 4.13, these options have been added as kernel arguments (assuming from https://access.redhat.com/solutions/7049418 ).
To recover the node and the machine config pool, you should delete the machineconfig:
The "Stuck" node will not automatically recover, since cri-o didn't start and the pod machine-config-daemon-**** for the node couldn't start:
You must therefore manually modify the kernel arguments on the node while in ssh as root:
You will get something like:
You must modify it removing systemd.unified_cgroup_hierarchy=0 and systemd.legacy_systemd_cgroup_controller=1 , like so:
Reboot the node with systemctl:
On reboot, the logs of the config daemon will be visible and will show a failure:
Drift detection will scold us for modifying the node manually. At the same time, if we add back the kernel parameters the Machine Configuration Daemon won't be able to rollback our erroneous changes, since cri-o doesn't start and no container is run for the daemon.
To fix this, we will manually rollback the desiredConfig for the node.
You can use the following one-liner to get the correct/incorrect machineconfig names:
In our case, we want to make the node believe he already "Updated" but we'll make it use the previous configuration.
Make sure the config the machine config pool is moving the nodes to is the correct one:
If it isn't, make sure you deleted the correct machineconfig you originally applied (in our case, 99-openshift-machineconfig-worker-kargs). Otherwise, there might be something else broken, since in the newest versions deleting a wrong machine config should automatically restore the previous rendered-config, as per https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html-single/machine_configuration/index#checking-mco-status_machine-config-overview "
If something goes wrong with a machine config that you apply, you can always back out that change. For example, if you had run oc create -f ./myconfig.yaml to apply a machine config, you could remove that machine config
".Once verified, patch it onto the node's annotations:
The error in the daemon will now change to:
We still need to change the currentConfig saved in /etc/machine-config-daemon/currentconfig .
We can update it like this, make sure to make a backup of the original file and change the MCP_NAME according to your configuration, be extremely careful to use the correct mcp name, as the command below will still give you an output even if you use a wrong name:
After about a minute, the machine config daemon pod should tell you the on-disk state is now valid:
Do try a reboot of the node to verify that everything works correctly once again.
Drain the node:
As root on the node, force it to re-validate their configuration template:
Once the node returns ready and all the pods are running:
Verify the file was deleted, then launch another reboot (better safe than sorry):
Your node should once again work.
You might also have to do some additional steps, as initially I tried messing with the node's annotations of desiredConfig/currentConfig and forcing an update with /run/machine-config-daemon-force , but I tested this on nodes of another MCP and it worked in restoring the nodes.
Hope this helps someone with the same problem!
This issue was similar to openshift/machine-config-operator#1443 and openshift/machine-config-operator#2705 .
Edit:
This exact issue was reported on https://access.redhat.com/solutions/7069660 and https://issues.redhat.com/browse/OCPBUGS-19352
The correct solution seems to be this documentation https://docs.okd.io/4.15/nodes/clusters/nodes-cluster-cgroups-2.html , and therefore to modify nodes.config/cluster adding spec.cgroupMode: "v1" . There is also an OpenShift version here https://docs.openshift.com/container-platform/4.15/installing/install_config/enabling-cgroup-v1.html .
Do make sure to test this before applying it in your cluster, as so far I wasn't able to downgrade to cgroupsv1, possibly because I'm either doing something wrong or have some other configurations giving problems.
Beta Was this translation helpful? Give feedback.
All reactions