-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
What change needs making?
Verify ALB listener rule weight propagation before scaling down stable ReplicaSet with dynamicStableScale
Use Cases
When would you use this?
When using Argo Rollouts with AWS ALB traffic routing and dynamicStableScale: true, we experience 503 errors during canary deployments. The errors occur on the OLD (stable) target group when traffic weight changes (e.g., 99% → 100%).
When dynamicStableScale: false and we set scaleDownDelaySeconds to 120 this is not reproduced , but could be an issue for large scale workload as it doubles the capacity.
Environment
Argo Rollouts with Canary strategy
AWS ALB (Application Load Balancer) with IP target mode
Ping-Pong feature enabled for zero-downtime updates
Target Group IP Verification enabled (--aws-verify-target-group)
AWS LB Controller Pod Readiness Gates enabled
Observed Behavior
Rollout reaches final step (99% → 100%)
Argo Rollouts updates the ALB Ingress annotation to route 100% traffic to canary target group, 0% to stable target group
AWS LB Controller submits ModifyRule API call to AWS ALB
Argo Rollouts begins scaling down stable ReplicaSet (due to dynamicStableScale: true)
ALB has NOT yet propagated the 0% weight internally
During propagation window, some requests still arrive at stable target group
Stable target group has no healthy targets (pods terminated) → 503 errors
Proposed Solution
Add ALB Listener Rule Weight Verification before scaling down stable ReplicaSet when using dynamicStableScale: true.
Message from the maintainers:
Need this enhancement? Give it a 👍. We prioritize the issues with the most 👍.