-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Guardrail to avoid downtime #3878
base: master
Are you sure you want to change the base?
Conversation
Published E2E Test Results 4 files 4 suites 3h 7m 38s ⏱️ Results for commit 39be2b5. ♻️ This comment has been updated with latest results. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3878 +/- ##
==========================================
+ Coverage 82.69% 82.73% +0.03%
==========================================
Files 163 163
Lines 22895 22911 +16
==========================================
+ Hits 18934 18956 +22
+ Misses 3087 3083 -4
+ Partials 874 872 -2 ☔ View full report in Codecov by Sentry. |
Published Unit Test Results2 294 tests 2 294 ✅ 3m 0s ⏱️ Results for commit 39be2b5. ♻️ This comment has been updated with latest results. |
b337b49
to
34878b6
Compare
7d87a76
to
fda1eba
Compare
Signed-off-by: Abhishek Bansal <[email protected]>
39898f7
to
3ef7a60
Compare
Quality Gate passedIssues Measures |
Very similar logic is already defined via argo-rollouts/rollout/service.go Line 292 in 5f59344
|
@zachaller but this check is not preventing downtime if we trigger rollout in between the deployment. |
Just wanna double down that this is super critical for us to ensure that if a rollout occurred in the middle of an already ongoing one, we must ensure that the traffic won't get shifted to the stable replica set before it being fully scaled back to the desired amount specially with dynamicStableScale (talking about late stages in the canary process where stable might be at 25% and suddenly gets all 100% traffic if you triggered a new one). Hope it makes 1.8 to unblock us going into prod. |
@zachaller Is this planned for 1.8? |
This would be a bug fix so we could put it into 1.8 but I need time to go through the PR and give it a good review. |
Quality Gate passedIssues Measures |
This PR introduces a guardrail in the rollouts controller to help prevent unexpected downtime during rollouts. The guardrail adds an additional check to ensure that the number of replicas in the stable ReplicaSet is sufficient before any traffic switch occurs.
For example, if the desired weight for canary replicas is set to 60%, and there are 10 replicas in total, the code will verify that, before diverting 60% of the traffic to the canary replicas, at least 40% of the replicas (i.e., 4 in this case) are available in the stable ReplicaSet.
This check is crucial to prevent potential downtime. A common scenario where downtime can occur is when a rollout is already in progress, and a new deployment is triggered. In such cases, if the stable replicas are insufficient, it could lead to service disruption.
Resolves #3372
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.