-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jobs] make status updates robust when controller dies #4602
Conversation
This discrepancy caused issues, such as jobs getting stuck as CANCELLING when the job controller process crashes during cleanup.
/quicktest-core |
/quicktest-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! This PR looks mostly good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! It looks good to me.
'Waiting for controller process to be RUNNING') + '{status_str}' | ||
status_display = rich_utils.safe_status(status_msg.format(status_str='')) | ||
|
||
def should_keep_logging(status: managed_job_state.ManagedJobStatus) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Should we allow this function to take Optional[managed_job_state.ManagedJobStatus
, so that we don't have to assert before every invocation of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like any of the alternatives:
- Function accepts
Optional[managed_job_state.ManagedJobStatus]
but throws if it gets None - feels like the type is just wrong in that case. - Function accepts
Optional[managed_job_state.ManagedJobStatus]
and returns True (or maybe False?) if it gets None - feels inaccurate. - Wrap
get_latest_task_id_status
to assert non-None status - feels overengineered. - Change all the status-fetching functions to throw if the job does not exist - maybe correct, but out of scope for this PR.
So I guess I'd prefer to just keep this - it's a bit verbose but it's pretty simple and understandable.
Co-authored-by: Zhanghao Wu <[email protected]>
/quicktest-core |
/smoke-test managed_jobs |
smoke test failure is unrelated, should be fixed in #4548 |
In combination with #4552 and #4562, the internal state machine for job status and schedule state should be much more robust and likely to eventually get to a consistent state, even under high load.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh