"Zombie" jobs after canceling Azure Pipelines job during pod scheduling #4723
Unanswered
skycaptain
asked this question in
Q&A / Need Help
Replies: 1 comment
-
We have the same problem. But the "lingering pod" can pick up new jobs. So it's not really lingering. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am attempting to autoscale Azure Pipelines agents on an AKS using ScaledJobs, as described in this article, but I am experiencing some unexpected behaviour. Consider the following scenario: a pipeline with 8 matrix jobs is started. KEDA then schedules 8 jobs, but due to some resource limitations (such as exceeding resource quotas or the need to scale out another node, which takes 2-3 minutes), only 2 pods can actually be launched. As a result, only 2 jobs are running, and the other 6 jobs are waiting for pods to be launched. So far so good; the remaining jobs will be completed once the pods are finally launched.
The issue is when a developer cancels the pipeline before the remaining 6 pods are launched. Usually, if an agent is running a job and receives a cancel request, the agent will gracefully exit and thus complete the job. However, in this case, the pod and its agent have not even started yet. This results in "zombie" jobs lingering and blocking resources (as there are resource requests defined), which also prevent the cluster from scaling down again.
Since we have defined
minReplicaCount: 1
, ideally, there should be only one job waiting. However, due to the situation described earlier, we actually have 7 jobs running: the 6 that were delayed and 1 created by KEDA to satisfy theminReplicaCount
.Is it possible to cancel scaled jobs when the pipeline is canceled during pod creation?
What I also find counterintuitive is that the jobs have the status "Running" even when the pods are not ready or even scheduled yet. Should there be a "Pending" state until the pod is ready, or is the current behavior to be expected?
Beta Was this translation helpful? Give feedback.
All reactions