Nginx controller with HPA reports 502 error whenever HPA triggers a scale-down. #12157

ranjith-vatakkeel · 2024-10-09T09:04:47Z

We are using the Nginx Ingress chart version v4.11.2 along with the HPA configuration in our production environment. We have observed that whenever the HPA triggers a scale-down operation, many of our applications report 502 errors. Even though number of failures are very less we would like to know, is this behaviour is expected with HPA .?

Approx number of failures is 600 requests/minute
Total number of request is 180K requests/minute

      autoscaling:
        behavior:
          scaleDown:
            policies:
            - periodSeconds: 900
              type: Pods
              value: 1
            stabilizationWindowSeconds: 900
          scaleUp:
            policies:
            - periodSeconds: 60
              type: Pods
              value: 2
            stabilizationWindowSeconds: 120
        enabled: true
        maxReplicas: 25
        minReplicas: 3
        targetCPUUtilizationPercentage: 70
        targetMemoryUtilizationPercentage: {}

What happened:

Observing failure in application logs with 502 error.

What you expected to happen:

There should not be any error/failure during the nginx pod scale down, as we believe nginx will perform a graceful shutdown of pods whenever a scale down operation kick in.

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):
NGINX Ingress controller
Release: v1.11.2
Build: 46e76e5
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.25.5

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: v1.28.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4

Environment:

Azure AKS:
Linux (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
How was the ingress-nginx-controller installed: Using helmrelease with flux
$ helm ls -A | grep -i ingress-nginx
private-ingress ingress-nginx 11 2024-08-19 10:00:05.820248075 +0000 UTC deployed ingress-nginx-4.11.2 1.11.2
Current State of the controller:
Everything looks fine from the ingress pod specific.

How to reproduce this issue:
We don't see this issue in UAT, may be because of less number of requests. I think it's easy to re-produce this issue with more number of requests with standard HPA config.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-10-09T09:04:56Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Gacko · 2024-10-09T10:05:31Z

Hello,

is Ingress NGINX fronted by a load balancer? Because 502 is a Bad Gateway and I assume your load balancer still having an Ingress NGINX Controller pod in its target list while it's already being shut down.

We recently had a discussion about this here: #11890

Even though this is AWS, it might still apply to your setup. Normally it could be useful to just increase the shutdown delay of Ingress NGINX by the amount of time your cloud load balancer takes to completely deregister a target.

Regards
Marco

longwuyuan · 2024-10-09T11:20:19Z

/remove-kind bug
/kind support

toredash · 2024-10-19T17:42:34Z

I'm assuming your fronting your Nginx pods with an Azure Load Balancer. As already mentioned, you need to ensure that the AKS is able to detect that your pod is about to shutdown before it actually does.

I'm currently working with something similar in AKS, and we increased the shutdown timer in Nginx to 90s in addition to adjusting the various timeouts in the Azure Load Balancer.

If increasing the shutdown time does not work, let us know.

midhun-mohan · 2024-10-22T11:20:11Z

@toredash How will we ensure that Azure Load Balancer knows that the nginx pod is going to be terminated?

Does nginx have any HTTP endpoints that will tell ALB that the pod is getting killed?

IMO, If we mark the health endpoint as unhealthy before the actual shutdown procedure is initiated and then proceeded with the normal shutdown might help to deregister the targets much before pod is being killed.

( Sidenote : Haven't looked deeper how tied the health endpoint is to the shutdown process )

toredash · 2024-10-22T12:53:47Z

@toredash How will we ensure that Azure Load Balancer knows that the nginx pod is going to be terminated?

Does nginx have any HTTP endpoints that will tell ALB that the pod is getting killed?

IMO, If we mark the health endpoint as unhealthy before the actual shutdown procedure is initiated and then proceeded with the normal shutdown might help to deregister the targets much before pod is being killed.

( Sidenote : Haven't looked deeper how tied the health endpoint is to the shutdown process )

Once the Pod is instructed to terminate itself, its IP will be removed from the Endpoint list.

helm chart values that I've set specifically to tackle this problem:

controller:
  extraArgs:
    shutdown-grace-period: "90"
  
  # Must be larger than the time it takes for an IP address in 
  # the Azure Load Balancer to be considered new and healthy
  minReadySeconds: 30

  service:
    annotations: # https://cloud-provider-azure.sigs.k8s.io/topics/loadbalancer/#loadbalancer-annotations
      service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz
      service.beta.kubernetes.io/azure-disable-load-balancer-floating-ip: false
    externalTrafficPolicy: Local # Ensure only nodes running Nginx is allowed to receive traffic

With this, I am able to restart my nginx deployment without causing any downtime

ranjith-vatakkeel added the kind/bug Categorizes issue or PR as related to a bug. label Oct 9, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 9, 2024

k8s-ci-robot added the needs-priority label Oct 9, 2024

ranjith-vatakkeel changed the title ~~Nginx controller with HPA configuration reports a 502 error whenever HPA triggers a scale-down.~~ Nginx controller with HPA reports 502 error whenever HPA triggers a scale-down. Oct 9, 2024

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Oct 9, 2024

Gacko closed this as completed Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nginx controller with HPA reports 502 error whenever HPA triggers a scale-down. #12157

Nginx controller with HPA reports 502 error whenever HPA triggers a scale-down. #12157

ranjith-vatakkeel commented Oct 9, 2024

k8s-ci-robot commented Oct 9, 2024

Gacko commented Oct 9, 2024

longwuyuan commented Oct 9, 2024

toredash commented Oct 19, 2024

midhun-mohan commented Oct 22, 2024

toredash commented Oct 22, 2024

Nginx controller with HPA reports 502 error whenever HPA triggers a scale-down. #12157

Nginx controller with HPA reports 502 error whenever HPA triggers a scale-down. #12157

Comments

ranjith-vatakkeel commented Oct 9, 2024

k8s-ci-robot commented Oct 9, 2024

Gacko commented Oct 9, 2024

longwuyuan commented Oct 9, 2024

toredash commented Oct 19, 2024

midhun-mohan commented Oct 22, 2024

toredash commented Oct 22, 2024