-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(controller): fix waiting for stable rs to be fully scaled before canary scale down #3899
base: master
Are you sure you want to change the base?
Conversation
Published E2E Test Results 4 files 4 suites 3h 8m 13s ⏱️ For more details on these failures, see this check. Results for commit 01eaf42. ♻️ This comment has been updated with latest results. |
Published Unit Test Results2 293 tests 2 293 ✅ 2m 59s ⏱️ Results for commit 01eaf42. ♻️ This comment has been updated with latest results. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3899 +/- ##
=======================================
Coverage 82.69% 82.70%
=======================================
Files 163 163
Lines 22895 22902 +7
=======================================
+ Hits 18934 18940 +6
- Misses 3087 3088 +1
Partials 874 874 ☔ View full report in Codecov by Sentry. |
1b5655c
to
958336c
Compare
Quality Gate passedIssues Measures |
Can we get a test for this? It will also probably help me understand the exact issue a bit better as well, it sounds like the controller will just take a really long time to complete the rollout due to scaling events etc? |
ec44f8d
to
42cec92
Compare
42cec92
to
3fe25d3
Compare
Quality Gate passedIssues Measures |
Yep, exactly. As a principle, if we had nothing to do with scaling down the stable rs, why are we waiting for its full availability? To me that self-evidently shouldn't be the case. [In my case, that would be forcing a cluster autoscaler to consolidate on a current state (stable and canary pods), just so that when it finishes, we terminate the canary pods (effectively changing that state and causing a new process of consolidation to begin). A sort of a wasted effort.] When we do have something to do with it however, I don't think there is a way around it. I thought of maybe instead of full availability, wait for stable RS PDB's (if exists) non-violation? But I don't think it'd be a correct/expected behavior, especially that the scaling up/down of the stable/canary in that case is incremental. So I left that part out of the PR, if you have any thoughts on it, I'd love to hear it I'm not sure what kind of test to write for this bit though. |
@y-rabie Want to give this PR a good review, will get this into a 1.8 release still |
…canary scale down Signed-off-by: Youssef Rabie <[email protected]>
3fe25d3
to
01eaf42
Compare
Quality Gate passedIssues Measures |
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.As described in the code comment, when used on a large scale with a cluster autoscaler that can disrupt nodes and evict pods, the canary RS stays scaled-up for a while until the stable RS is fully scaled. This makes sense if the controller scaled down the stable RS during the rollout (using
dynamicStableScale
), but it doesn't make sense if it didn't.