Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jobs] make status updates robust when controller dies #4602

Merged
merged 8 commits into from
Jan 24, 2025

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Jan 21, 2025

  1. Jobs cannot get stuck in CANCELLING - it's no longer "terminal".
  2. We use schedule_state rather than job status to determine whether a controller has exited cleanly. This allows us to reliably see if the controller crashed and simplifies some of the checking logic.
  3. Even if jobs are in a terminal status (including SUCCEEDED), we can still set them to FAILED_CONTROLLER if the controller died abnormally, e.g. during cleanup.

In combination with #4552 and #4562, the internal state machine for job status and schedule state should be much more robust and likely to eventually get to a consistent state, even under high load.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual load test on AWS with r6i.24xlarge controller and ~1400 jobs cancelled.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

cg505 added 2 commits January 18, 2025 17:10
This discrepancy caused issues, such as jobs getting stuck as
CANCELLING when the job controller process crashes during cleanup.
@cg505 cg505 requested a review from Michaelvll January 21, 2025 22:07
@cg505
Copy link
Collaborator Author

cg505 commented Jan 21, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Jan 21, 2025

/quicktest-core

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! This PR looks mostly good to me.

sky/jobs/state.py Show resolved Hide resolved
sky/jobs/state.py Outdated Show resolved Hide resolved
sky/jobs/state.py Outdated Show resolved Hide resolved
sky/jobs/state.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/state.py Show resolved Hide resolved
@cg505 cg505 requested a review from Michaelvll January 23, 2025 02:16
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! It looks good to me.

sky/jobs/state.py Outdated Show resolved Hide resolved
sky/jobs/state.py Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
'Waiting for controller process to be RUNNING') + '{status_str}'
status_display = rich_utils.safe_status(status_msg.format(status_str=''))

def should_keep_logging(status: managed_job_state.ManagedJobStatus) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Should we allow this function to take Optional[managed_job_state.ManagedJobStatus, so that we don't have to assert before every invocation of this function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like any of the alternatives:

  1. Function accepts Optional[managed_job_state.ManagedJobStatus] but throws if it gets None - feels like the type is just wrong in that case.
  2. Function accepts Optional[managed_job_state.ManagedJobStatus] and returns True (or maybe False?) if it gets None - feels inaccurate.
  3. Wrap get_latest_task_id_status to assert non-None status - feels overengineered.
  4. Change all the status-fetching functions to throw if the job does not exist - maybe correct, but out of scope for this PR.

So I guess I'd prefer to just keep this - it's a bit verbose but it's pretty simple and understandable.

@Michaelvll Michaelvll added this to the v0.8.0 milestone Jan 23, 2025
@cg505
Copy link
Collaborator Author

cg505 commented Jan 23, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Jan 23, 2025

/smoke-test managed_jobs

@cg505
Copy link
Collaborator Author

cg505 commented Jan 24, 2025

smoke test failure is unrelated, should be fixed in #4548

@cg505 cg505 merged commit 485b1cd into skypilot-org:master Jan 24, 2025
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants