Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a 2nd cluster-autoscaler with a small footprint for the operator node group only #2345

Open
RobertLucian opened this issue Jul 15, 2021 · 0 comments
Labels
performance A performance improvement

Comments

@RobertLucian
Copy link
Member

Description

The cluster-autoscaler can use up lots of memory when there are lots of nodes to keep track of or/and lots of nodes that need to be added. The cluster-autoscaler has been observed to use up to 1.4GiB of memory (our current limit is set to 1GiB).

If the autoscaler gets evicted because it's using too much memory, there won't be a way to scale up the operator node group for the subsequent pending pod of the autoscaler as there is no autoscaler left to do the job. The suggestion is to have another cluster autoscaler with a minimal resource footprint that would only be responsible for scaling the operator node group. Because this autoscaler will only be watching a single node group (that can only go up to 25 nodes) and because we would be setting the addition rate limit to a small value (i.e. 1-2 node / min), we can be sure this autoscaler won't go big on the resource utilization. This autoscaler will scale up the node group in case the other cluster autoscaler gets evicted.

This cluster autoscaler deployment should also have a higher priority than anything else on the operator node group, to ensure that every other cortex pod will receive a node:
https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/

@RobertLucian RobertLucian added bug Something isn't working provisioning Something related to cluster provisioning performance A performance improvement and removed bug Something isn't working provisioning Something related to cluster provisioning labels Jul 15, 2021
@miguelvr miguelvr removed the bug Something isn't working label Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance improvement
Projects
None yet
Development

No branches or pull requests

2 participants