Ramping load average on node noticed in 4.12 - 4.13 #1791
Replies: 1 comment
-
I have closed this as the issue has not resurfaced since the last set of upgrades applied |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I've been experiencing a ramping load average over a period of ~3 weeks on one of the nodes in my cluster, it was first noticed in 4.12 but has since been updated to 4.13 and the problem is still persisting.
On logging into the node it looks like crio is the culprit. Rebooting the node (still running the same pods when it comes back online) will see the load average drop for a few weeks as it ramps back up to an unsustainable level. Here is a graph for reference as well as a link to the must-gather which was run just prior to a reboot:
https://drive.google.com/file/d/1I3mB3MN8rZtCEplgmT9dfxxvFRiqfxYZ/view?usp=sharing
I don't think there is anything in particular wrong with the cluster as the other smaller nodes don't exhibit the same symptoms (though they are smaller and have a minimal rate of change).
I have a feeling it may be due to a garbage collection issue and the load ramps up exponentially due to constant iterating over objects that no longer exist.
Ultimately I'm wondering if anybody else has experienced a similar issue in the past and could shed some further light into anything that I should be on the lookout for or statistics that I should be monitoring to keep a closer eye on the issue.
Regards,
Andrew
Beta Was this translation helpful? Give feedback.
All reactions