[jobs] make status updates robust when controller dies #4602

cg505 · 2025-01-21T22:07:11Z

Jobs cannot get stuck in CANCELLING - it's no longer "terminal".
We use schedule_state rather than job status to determine whether a controller has exited cleanly. This allows us to reliably see if the controller crashed and simplifies some of the checking logic.
Even if jobs are in a terminal status (including SUCCEEDED), we can still set them to FAILED_CONTROLLER if the controller died abnormally, e.g. during cleanup.

In combination with #4552 and #4562, the internal state machine for job status and schedule state should be much more robust and likely to eventually get to a consistent state, even under high load.

Tested (run the relevant ones):

Code formatting: bash format.sh
Manual load test on AWS with r6i.24xlarge controller and ~1400 jobs cancelled.
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

This discrepancy caused issues, such as jobs getting stuck as CANCELLING when the job controller process crashes during cleanup.

cg505 · 2025-01-21T22:07:19Z

/quicktest-core

cg505 · 2025-01-21T22:14:36Z

/quicktest-core

Michaelvll

Thanks @cg505! This PR looks mostly good to me.

sky/jobs/state.py

sky/jobs/utils.py

sky/jobs/state.py

Michaelvll

Thanks @cg505! It looks good to me.

sky/jobs/state.py

sky/jobs/utils.py

Michaelvll · 2025-01-23T07:26:47Z

sky/jobs/utils.py

-        'Waiting for controller process to be RUNNING') + '{status_str}'
-    status_display = rich_utils.safe_status(status_msg.format(status_str=''))
+
+    def should_keep_logging(status: managed_job_state.ManagedJobStatus) -> bool:


minor: Should we allow this function to take Optional[managed_job_state.ManagedJobStatus, so that we don't have to assert before every invocation of this function?

I don't really like any of the alternatives:

Function accepts Optional[managed_job_state.ManagedJobStatus] but throws if it gets None - feels like the type is just wrong in that case.

Function accepts Optional[managed_job_state.ManagedJobStatus] and returns True (or maybe False?) if it gets None - feels inaccurate.

Wrap get_latest_task_id_status to assert non-None status - feels overengineered.

Change all the status-fetching functions to throw if the job does not exist - maybe correct, but out of scope for this PR.

So I guess I'd prefer to just keep this - it's a bit verbose but it's pretty simple and understandable.

Co-authored-by: Zhanghao Wu <[email protected]>

cg505 · 2025-01-23T21:15:12Z

/quicktest-core

cg505 · 2025-01-23T21:16:13Z

/smoke-test managed_jobs

cg505 · 2025-01-24T17:42:05Z

smoke test failure is unrelated, should be fixed in #4548

cg505 added 2 commits January 18, 2025 17:10

[jobs] CANCELLING is not terminal

28d1ec8

This discrepancy caused issues, such as jobs getting stuck as CANCELLING when the job controller process crashes during cleanup.

revamp nonterminal status checking

e9af1ca

cg505 requested a review from Michaelvll January 21, 2025 22:07

lint

dfb405d

Michaelvll reviewed Jan 22, 2025

View reviewed changes

cg505 added 3 commits January 22, 2025 16:21

fix stream_logs_by_id

6f56941

remove set_failed_controller

06864e2

address PR comments

838d6a7

cg505 requested a review from Michaelvll January 23, 2025 02:16

Michaelvll approved these changes Jan 23, 2025

View reviewed changes

Michaelvll added this to the v0.8.0 milestone Jan 23, 2025

cg505 and others added 2 commits January 23, 2025 13:08

Apply suggestions from code review

9a8e4c8

Co-authored-by: Zhanghao Wu <[email protected]>

address PR review

5b8a0a4

cg505 merged commit 485b1cd into skypilot-org:master Jan 24, 2025
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jobs] make status updates robust when controller dies #4602

[jobs] make status updates robust when controller dies #4602

cg505 commented Jan 21, 2025 •

edited

Loading

cg505 commented Jan 21, 2025

cg505 commented Jan 21, 2025

Michaelvll left a comment

Michaelvll left a comment

Michaelvll Jan 23, 2025

cg505 Jan 23, 2025

cg505 commented Jan 23, 2025

cg505 commented Jan 23, 2025

cg505 commented Jan 24, 2025

[jobs] make status updates robust when controller dies #4602

[jobs] make status updates robust when controller dies #4602

Conversation

cg505 commented Jan 21, 2025 • edited Loading

cg505 commented Jan 21, 2025

cg505 commented Jan 21, 2025

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Jan 23, 2025

Choose a reason for hiding this comment

cg505 Jan 23, 2025

Choose a reason for hiding this comment

cg505 commented Jan 23, 2025

cg505 commented Jan 23, 2025

cg505 commented Jan 24, 2025

cg505 commented Jan 21, 2025 •

edited

Loading