-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Waiting for k8s nodes to reach count #94
Comments
By default CLUSTER_HEALTH_RETRY=1 so it will fail soon (you can open README and see it). you need to increase this value by export CLUSTER_HEALTH_RETRY=10. That means it will try to check 10 times and it have enough time to verify cluster health check instead 1 by default. |
It helps, but it doesn't necessarily solve the issue, as sometimes some other ASG in cluster is resized in the meantime and it will never reach the wanted value. I've modified the code to it simply gives up after some tries and does the changes anyway. It fits our case and the whole process is monitored by a human, anyway. |
@thorro could you please share with me which code you already modified, maybe I also applied for my case. |
Just a simple hack, hardcoded to 5 tries. diff --git a/eksrollup/lib/k8s.py b/eksrollup/lib/k8s.py
index ad286cf..9cf73f7 100644
--- a/eksrollup/lib/k8s.py
+++ b/eksrollup/lib/k8s.py
@@ -226,6 +226,9 @@ def k8s_nodes_count(desired_node_count, max_retry=app_config['GLOBAL_MAX_RETRY']
"""
logger.info('Checking k8s expected nodes are online after asg scaled up...')
retry_count = 1
+ retry_count2 = 0
+ retry_count2_max = 5
+
nodes_online = False
while retry_count < max_retry:
nodes_online = True
@@ -233,10 +236,16 @@ def k8s_nodes_count(desired_node_count, max_retry=app_config['GLOBAL_MAX_RETRY']
nodes = get_k8s_nodes()
logger.info('Current k8s node count is {}'.format(len(nodes)))
if len(nodes) != desired_node_count:
+ retry_count2 += 1
+ if retry_count2 >= retry_count2_max:
+ logger.info('Not waiting for k8s nodes to reach count {} anymore, continuing anyway '.format(desired_node_count))
+ break
+ |
@thorro - are you using any other flags or features? Currently, I'm using EDIT: Looks like there is an issue with |
Not using it anymore as we switched to EKS managed node groups. |
Noticed the problem with this is caused by: |
We have a big and busy EKS cluster with nodes joining and leaving many times in a day (spot instances failing or being replaced). We try to update each ASG separately with ASG_NAMES setting. The problem is, the eks-rolling-update always checks the whole cluster for node count and it many times fails as node count is not matched with expected.
It should only monitor the selected ASG(s) for expected instance count.
The text was updated successfully, but these errors were encountered: