Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP fix: make a task engine job stoppable #444

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

chisholm
Copy link
Collaborator

This PR shows an approach using process parallelism to stop jobs: the experiment is run in a child process with a SIGTERM handler which sets a flag in the task engine to cause it to stop in between steps. It can't be stopped in the middle of a long-running step that way though. So the parent process will send SIGTERM first, and if the child doesn't stop quickly enough, then a SIGKILL. The parent process can poll an endpoint for an instruction to stop the job, though we don't have such an endpoint now. There is some commented out code where that would happen.

Forcible stopping is not testable as-is on this branch, since we don't have endpoints to poll yet. There are no additional unit tests right now; as it is written, the child process is transparent to the existing test_run_task_engine unit test. So, much of the code is exercised already. Perhaps a test could be added which stops an experiment early rather than letting it run to completion, to test that part of it. That could be trickier to do...?

One thing I discovered is that mlflow has a special RunStatus value that seems applicable to a forcibly terminated run, "killed". But dioptra has no corresponding job status value. For now, it uses "failed". We might want to add one.

You also can't test it using the legacy mlflow job submission system. You must use the newTaskEngine endpoint. The MLproject system runs the commandline run-experiment tool, which I did not change. It could technically be changed to use the same child process and polling, but that tool is supposed to be able to run independently of the dioptra containers, so it wouldn't make sense for it to poll the restapi.

The python standard library has a "multiprocessing" module, which makes starting and monitoring a subprocess pretty nice and easy. So that's what I used for this.

@chisholm chisholm self-assigned this Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant