Skip to content

sdexec: add stop timeout to handle unkillable processes#6666

Merged
mergify[bot] merged 11 commits intoflux-framework:masterfrom
garlick:sdexec_notify
Feb 28, 2025
Merged

sdexec: add stop timeout to handle unkillable processes#6666
mergify[bot] merged 11 commits intoflux-framework:masterfrom
garlick:sdexec_notify

Conversation

@garlick
Copy link
Member

@garlick garlick commented Feb 26, 2025

Problem: in a system instance using sdexec, jobs may remain in R state when the shell exits with unkillable processes.

This is because the IMP doesn't exit until the unit cgroup is empty, and the IMP must exit for sdexec to collect the shell's exit status.

This PR, in conjunction with flux-framework/flux-security#201, arranges for the unit to enter deactivating state once the shell exits. An sdexec timeout begins at that point. If after a configurable period, the unit has not stopped, SIGUSR1 (IMP proxy for SIGKILL) is sent to the unit. If after another period the unit still has not stopped, it is abandoned and the subprocess protocol is terminated so the job can advance and release resources.

The abandoned unit remains in deactivating state indefinitely, and can be picked up by the sdmon montoring service proposed in #6616

Problem: job-exec may need to set TimeoutStopUSec=infinity to
disable systemd's stop timeout that clobbers the IMP with a SIGKILL.

Since this property has a non-string value, it needs to be special
cased in sdexec_start_transient_unit() so it can be represented using
the proper dbus type.

Add it, with "infinity" support.
Problem: there is no test that ensures sdexec can communicate
TimeoutStopUSec to systemd.

Add some tests.
Problem: Type is a regular systemd property, but libsdexec
requires it to be set in a different way than other properties.

Drop the type parameter from sdexec_start_transient_unit().
If SDEXEC_PROP_Type is not explicitly set in the command options,
set Type=exec, which was the value hard coded by sdexec.

Update sdexec.
Update unit test.
Problem: there is no test for sdexec Type property handling.

Add some tests.
@garlick garlick changed the title WIP: run sdexec imp-shell jobs as Type=notify with sdexec stop tiimeout sdexec: add stop timeout to handle unkillable processes Feb 27, 2025
@garlick
Copy link
Member Author

garlick commented Feb 27, 2025

Ugh I just realized this needs to drain any node that ends up with an abandoned stuck unit.

Problem: an sdexec imp-shell unit can run into the following problem:
- flux-shell is killed/terminates
- there are unkillable children of flux-shell
- the IMP won't exit until the cgroup is empty
- the job remains in R state

This adds a configurable stop timer to sdexec that is triggered when
the unit enters deactivating state.  Disabled by default, the timer
is configured via the following subprocess command options:

SDEXEC_STOP_TIMER_SEC
  Specify the timeout value in seconds.  If non-negative, this enables
  the stop timer.
SDEXEC_STOP_TIMER_SIGNAL
  Specify a signal to send to the unit after the timeout.  By default,
  SIGKILL is used.

The behavior of the stop timer is follows:
- The timer is activated when the unit enters "deactivating" state.
- After STOP_TIMER_SEC seconds, STOP_TIMER_SIGNAL is sent to the unit.
- After another STOP_TIMER_SEC seconds, the unit is abandonded and
  subprocess exec RPC is terminated with an EDEADLK error.

To solve the stated problem, the stop timer must be used with job-exec
changes to run the unit with Type=notify, in conjunction with changes
to the IMP to call sd_notify() STOPPING=1 when the shell exits.
The job-exec changes are coming in a future commit.
Problem: jobs remain in R state when the flux-shell exits with
unkillable processes.

Run imp-shell units with Type=notify and the new sdexec stop timer.
Disable systemd's stop timer by setting TimeoutStopUsec=infinity.
This assumes the IMP has been modified to call sd_notify(3) at
appropriate transitions.

The stop timer, which is enabled by default with a timeout of 30s
and signal of SIGUSR1, may be configured or disabled via the TOML
[exec] table.

Fixes flux-framework#6656
@garlick
Copy link
Member Author

garlick commented Feb 27, 2025

OK, pushed a change that handles the stop timeout speficially rather than just letting job-exec treat it like a generic failure. Here is a test where the sdexec-stop-timer-signal was set to 0 and the shell was killed with -9 to simulate a stuck process.

$ flux run -t 10m nohup sleep inf
flux-job: ƒBd6cBBSFcK started                                          00:00:22
53.647s: job.exception ƒBd6cBBSFcK type=exec severity=0 shell exited with unkillable processes on picl0 (rank 0)
flux-job: task(s) exited with exit code 1

and flux resource drain reports

TIME         STATE    REASON                         NODELIST
Feb27 10:51  drained  unkillable processes from job+ picl0

The unit is still running

● imp-shell-0-fBd5swxvjPd.service - User workload
     Loaded: loaded (/run/user/500/systemd/transient/imp-shell-0-fBd5swxvjPd.service; transient)
  Transient: yes
     Active: deactivating (stop-sigterm) since Thu 2025-02-27 10:27:07 PST; 32min ago
   Main PID: 157097 (flux-imp)
     Status: "IMP child Killed, waiting for cgroup"
      Tasks: 2 (limit: 9562)
     Memory: 1.6M (max: 6.2G available: 6.2G)
        CPU: 111ms
     CGroup: /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-0-fBd5swxvjPd.service
             ├─157097 /usr/lib/aarch64-linux-gnu/flux/flux-imp exec /usr/lib/aarch64-linux-gnu/flux/flux-shell 78897629785751552
             └─157102 sleep inf

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Nice improvement.

I will get a PR to update flux-security to v0.14.0 in CI.

Comment on lines 41 to 46
sdexec-stop-timer-sec
(optional) Configure the length of time in seconds after a unit enters
deactivating state when it will be sent the sdexec-stop-timer-signal.
After the same length of time, if the unit hasn't terminated, for example
due to unkillable processes, the unit will be abandoned and the exec RPC
is terminated. Default 30.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Remind the reader here what triggers the unit entering deactivating state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed that and forced a push.

I forgot that I didn't add tests for the SDEXEC_TIMER stuff. However, if you are OK with merging this as is I'm fine with that and time permitting, I can follow on with some test improvements in another PR.

@grondo
Copy link
Contributor

grondo commented Feb 27, 2025

I will get a PR to update flux-security to v0.14.0 in CI.

Actually that's just a one liner in docker-run-checks.sh, and debian/control and configure.ac should also be updated when v0.14.0 is required, so I'll just let you do that in whatever PR is appropriate.

Problem: the sdexec-stop-timer config options are undocumented.

Add sdexec-stop-timer-signal and sdexec-stop-timer-sec options
to the flux-config-exec(5) man page.
Problem: when sdexec's stop timer terminates a subprocess exec RPC,
the node is not drained and a subsequent job could be co-located
with the unkillable processes.

Add specific handling for an EDEADLK error that drains the node
and then behaves the same as if a EHOSTUNREACH error were received.
@garlick
Copy link
Member Author

garlick commented Feb 27, 2025

Actually that's just a one liner in docker-run-checks.sh, and debian/control and configure.ac should also be updated when v0.14.0 is required, so I'll just let you do that in whatever PR is appropriate.

That should probably go here actually. I 'll tack that on.

Problem: sdexec now launches work in transient Type=notify
containers, but for that to work, a flux-security version
that calls sd_notify() is required.

Bump the minimum required version of flux-security.
Problem: flux-security 0.14 or greater is required now
but the debian control file require 0.13.

Bump the version.
Problem: flux-security 0.13.0 is installed by default in CI
but flux-core now requires 0.14.0 or greater.

Bump the version.
@grondo
Copy link
Contributor

grondo commented Feb 28, 2025

Yeah, it kind of seems reasonable to merge this one as is to keep things moving along.

@garlick
Copy link
Member Author

garlick commented Feb 28, 2025

Great, thanks! Setting MWP then.

@mergify mergify bot merged commit 26e39f2 into flux-framework:master Feb 28, 2025
34 of 35 checks passed
@garlick garlick deleted the sdexec_notify branch February 28, 2025 14:33
@codecov
Copy link

codecov bot commented May 29, 2025

Codecov Report

Attention: Patch coverage is 37.77778% with 56 lines in your changes missing coverage. Please review.

Project coverage is 83.86%. Comparing base (f0eb393) to head (a84adc0).
Report is 340 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/sdexec/sdexec.c 28.26% 33 Missing ⚠️
src/modules/job-exec/exec.c 16.66% 10 Missing ⚠️
src/modules/job-exec/exec_config.c 28.57% 10 Missing ⚠️
src/common/libsdexec/start.c 83.33% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6666      +/-   ##
==========================================
- Coverage   83.90%   83.86%   -0.04%     
==========================================
  Files         533      533              
  Lines       88593    88678      +85     
==========================================
+ Hits        74332    74374      +42     
- Misses      14261    14304      +43     
Files with missing lines Coverage Δ
src/common/libsdexec/start.c 79.53% <83.33%> (+0.13%) ⬆️
src/modules/job-exec/exec.c 78.38% <16.66%> (-2.27%) ⬇️
src/modules/job-exec/exec_config.c 71.68% <28.57%> (-6.32%) ⬇️
src/modules/sdexec/sdexec.c 67.91% <28.26%> (-3.27%) ⬇️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants