-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nginx controller with HPA reports 502 error whenever HPA triggers a scale-down. #12157
Comments
This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hello, is Ingress NGINX fronted by a load balancer? Because 502 is a Bad Gateway and I assume your load balancer still having an Ingress NGINX Controller pod in its target list while it's already being shut down. We recently had a discussion about this here: #11890 Even though this is AWS, it might still apply to your setup. Normally it could be useful to just increase the shutdown delay of Ingress NGINX by the amount of time your cloud load balancer takes to completely deregister a target. Regards |
/remove-kind bug |
I'm assuming your fronting your Nginx pods with an Azure Load Balancer. As already mentioned, you need to ensure that the AKS is able to detect that your pod is about to shutdown before it actually does. I'm currently working with something similar in AKS, and we increased the shutdown timer in Nginx to 90s in addition to adjusting the various timeouts in the Azure Load Balancer. If increasing the shutdown time does not work, let us know. |
@toredash How will we ensure that Azure Load Balancer knows that the nginx pod is going to be terminated? Does nginx have any HTTP endpoints that will tell ALB that the pod is getting killed? IMO, If we mark the health endpoint as unhealthy before the actual shutdown procedure is initiated and then proceeded with the normal shutdown might help to deregister the targets much before pod is being killed. ( Sidenote : Haven't looked deeper how tied the health endpoint is to the shutdown process ) |
Once the Pod is instructed to terminate itself, its IP will be removed from the Endpoint list. helm chart values that I've set specifically to tackle this problem:
With this, I am able to restart my nginx deployment without causing any downtime |
We are using the Nginx Ingress chart version v4.11.2 along with the HPA configuration in our production environment. We have observed that whenever the HPA triggers a scale-down operation, many of our applications report 502 errors. Even though number of failures are very less we would like to know, is this behaviour is expected with HPA .?
Approx number of failures is 600 requests/minute
Total number of request is 180K requests/minute
What happened:
Observing failure in application logs with 502 error.
What you expected to happen:
There should not be any error/failure during the nginx pod scale down, as we believe nginx will perform a graceful shutdown of pods whenever a scale down operation kick in.
NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
NGINX Ingress controller
Release: v1.11.2
Build: 46e76e5
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.25.5
Kubernetes version (use
kubectl version
):$ kubectl version
Client Version: v1.28.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
Environment:
Azure AKS:
Linux (e.g. from /etc/os-release):
Kernel (e.g.
uname -a
):How was the ingress-nginx-controller installed: Using helmrelease with flux
$ helm ls -A | grep -i ingress-nginx
private-ingress ingress-nginx 11 2024-08-19 10:00:05.820248075 +0000 UTC deployed ingress-nginx-4.11.2 1.11.2
Current State of the controller:
Everything looks fine from the ingress pod specific.
How to reproduce this issue:
We don't see this issue in UAT, may be because of less number of requests. I think it's easy to re-produce this issue with more number of requests with standard HPA config.
The text was updated successfully, but these errors were encountered: