sdexec: add stop timeout to handle unkillable processes by garlick · Pull Request #6666 · flux-framework/flux-core

garlick · 2025-02-26T19:46:18Z

Problem: in a system instance using sdexec, jobs may remain in R state when the shell exits with unkillable processes.

This is because the IMP doesn't exit until the unit cgroup is empty, and the IMP must exit for sdexec to collect the shell's exit status.

This PR, in conjunction with flux-framework/flux-security#201, arranges for the unit to enter deactivating state once the shell exits. An sdexec timeout begins at that point. If after a configurable period, the unit has not stopped, SIGUSR1 (IMP proxy for SIGKILL) is sent to the unit. If after another period the unit still has not stopped, it is abandoned and the subprocess protocol is terminated so the job can advance and release resources.

The abandoned unit remains in deactivating state indefinitely, and can be picked up by the sdmon montoring service proposed in #6616

Problem: job-exec may need to set TimeoutStopUSec=infinity to disable systemd's stop timeout that clobbers the IMP with a SIGKILL. Since this property has a non-string value, it needs to be special cased in sdexec_start_transient_unit() so it can be represented using the proper dbus type. Add it, with "infinity" support.

Problem: there is no test that ensures sdexec can communicate TimeoutStopUSec to systemd. Add some tests.

Problem: Type is a regular systemd property, but libsdexec requires it to be set in a different way than other properties. Drop the type parameter from sdexec_start_transient_unit(). If SDEXEC_PROP_Type is not explicitly set in the command options, set Type=exec, which was the value hard coded by sdexec. Update sdexec. Update unit test.

Problem: there is no test for sdexec Type property handling. Add some tests.

garlick · 2025-02-27T17:24:30Z

Ugh I just realized this needs to drain any node that ends up with an abandoned stuck unit.

Problem: an sdexec imp-shell unit can run into the following problem: - flux-shell is killed/terminates - there are unkillable children of flux-shell - the IMP won't exit until the cgroup is empty - the job remains in R state This adds a configurable stop timer to sdexec that is triggered when the unit enters deactivating state. Disabled by default, the timer is configured via the following subprocess command options: SDEXEC_STOP_TIMER_SEC Specify the timeout value in seconds. If non-negative, this enables the stop timer. SDEXEC_STOP_TIMER_SIGNAL Specify a signal to send to the unit after the timeout. By default, SIGKILL is used. The behavior of the stop timer is follows: - The timer is activated when the unit enters "deactivating" state. - After STOP_TIMER_SEC seconds, STOP_TIMER_SIGNAL is sent to the unit. - After another STOP_TIMER_SEC seconds, the unit is abandonded and subprocess exec RPC is terminated with an EDEADLK error. To solve the stated problem, the stop timer must be used with job-exec changes to run the unit with Type=notify, in conjunction with changes to the IMP to call sd_notify() STOPPING=1 when the shell exits. The job-exec changes are coming in a future commit.

Problem: jobs remain in R state when the flux-shell exits with unkillable processes. Run imp-shell units with Type=notify and the new sdexec stop timer. Disable systemd's stop timer by setting TimeoutStopUsec=infinity. This assumes the IMP has been modified to call sd_notify(3) at appropriate transitions. The stop timer, which is enabled by default with a timeout of 30s and signal of SIGUSR1, may be configured or disabled via the TOML [exec] table. Fixes flux-framework#6656

garlick · 2025-02-27T19:00:16Z

OK, pushed a change that handles the stop timeout speficially rather than just letting job-exec treat it like a generic failure. Here is a test where the sdexec-stop-timer-signal was set to 0 and the shell was killed with -9 to simulate a stuck process.

$ flux run -t 10m nohup sleep inf
flux-job: ƒBd6cBBSFcK started                                          00:00:22
53.647s: job.exception ƒBd6cBBSFcK type=exec severity=0 shell exited with unkillable processes on picl0 (rank 0)
flux-job: task(s) exited with exit code 1

and flux resource drain reports

TIME         STATE    REASON                         NODELIST
Feb27 10:51  drained  unkillable processes from job+ picl0

The unit is still running

● imp-shell-0-fBd5swxvjPd.service - User workload
     Loaded: loaded (/run/user/500/systemd/transient/imp-shell-0-fBd5swxvjPd.service; transient)
  Transient: yes
     Active: deactivating (stop-sigterm) since Thu 2025-02-27 10:27:07 PST; 32min ago
   Main PID: 157097 (flux-imp)
     Status: "IMP child Killed, waiting for cgroup"
      Tasks: 2 (limit: 9562)
     Memory: 1.6M (max: 6.2G available: 6.2G)
        CPU: 111ms
     CGroup: /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-0-fBd5swxvjPd.service
             ├─157097 /usr/lib/aarch64-linux-gnu/flux/flux-imp exec /usr/lib/aarch64-linux-gnu/flux/flux-shell 78897629785751552
             └─157102 sleep inf

grondo

This LGTM! Nice improvement.

I will get a PR to update flux-security to v0.14.0 in CI.

grondo · 2025-02-27T22:26:41Z

doc/man5/flux-config-exec.rst

+sdexec-stop-timer-sec
+   (optional) Configure the length of time in seconds after a unit enters
+   deactivating state when it will be sent the sdexec-stop-timer-signal.
+   After the same length of time, if the unit hasn't terminated, for example
+   due to unkillable processes, the unit will be abandoned and the exec RPC
+   is terminated.  Default 30.


Suggestion: Remind the reader here what triggers the unit entering deactivating state.

thanks, fixed that and forced a push.

I forgot that I didn't add tests for the SDEXEC_TIMER stuff. However, if you are OK with merging this as is I'm fine with that and time permitting, I can follow on with some test improvements in another PR.

grondo · 2025-02-27T22:54:08Z

I will get a PR to update flux-security to v0.14.0 in CI.

Actually that's just a one liner in docker-run-checks.sh, and debian/control and configure.ac should also be updated when v0.14.0 is required, so I'll just let you do that in whatever PR is appropriate.

Problem: the sdexec-stop-timer config options are undocumented. Add sdexec-stop-timer-signal and sdexec-stop-timer-sec options to the flux-config-exec(5) man page.

Problem: when sdexec's stop timer terminates a subprocess exec RPC, the node is not drained and a subsequent job could be co-located with the unkillable processes. Add specific handling for an EDEADLK error that drains the node and then behaves the same as if a EHOSTUNREACH error were received.

garlick · 2025-02-27T23:49:45Z

Actually that's just a one liner in docker-run-checks.sh, and debian/control and configure.ac should also be updated when v0.14.0 is required, so I'll just let you do that in whatever PR is appropriate.

That should probably go here actually. I 'll tack that on.

Problem: sdexec now launches work in transient Type=notify containers, but for that to work, a flux-security version that calls sd_notify() is required. Bump the minimum required version of flux-security.

Problem: flux-security 0.14 or greater is required now but the debian control file require 0.13. Bump the version.

Problem: flux-security 0.13.0 is installed by default in CI but flux-core now requires 0.14.0 or greater. Bump the version.

grondo · 2025-02-28T00:17:59Z

Yeah, it kind of seems reasonable to merge this one as is to keep things moving along.

garlick · 2025-02-28T03:38:00Z

Great, thanks! Setting MWP then.

codecov · 2025-05-29T04:00:12Z

Codecov Report

Attention: Patch coverage is 37.77778% with 56 lines in your changes missing coverage. Please review.

Project coverage is 83.86%. Comparing base (f0eb393) to head (a84adc0).
Report is 340 commits behind head on master.

Files with missing lines	Patch %	Lines
src/modules/sdexec/sdexec.c	28.26%	33 Missing ⚠️
src/modules/job-exec/exec.c	16.66%	10 Missing ⚠️
src/modules/job-exec/exec_config.c	28.57%	10 Missing ⚠️
src/common/libsdexec/start.c	83.33%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6666      +/-   ##
==========================================
- Coverage   83.90%   83.86%   -0.04%     
==========================================
  Files         533      533              
  Lines       88593    88678      +85     
==========================================
+ Hits        74332    74374      +42     
- Misses      14261    14304      +43

Files with missing lines	Coverage Δ
src/common/libsdexec/start.c	`79.53% <83.33%> (+0.13%)`	⬆️
src/modules/job-exec/exec.c	`78.38% <16.66%> (-2.27%)`	⬇️
src/modules/job-exec/exec_config.c	`71.68% <28.57%> (-6.32%)`	⬇️
src/modules/sdexec/sdexec.c	`67.91% <28.26%> (-3.27%)`	⬇️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

garlick mentioned this pull request Feb 26, 2025

imp exec: call sd_notify() flux-framework/flux-security#201

Merged

garlick added 4 commits February 27, 2025 07:19

testsuite: test TimeoutStopUSec property

8eb4b4e

Problem: there is no test that ensures sdexec can communicate TimeoutStopUSec to systemd. Add some tests.

testsuite: test Type property

34233f3

Problem: there is no test for sdexec Type property handling. Add some tests.

garlick force-pushed the sdexec_notify branch from d875964 to 3fa2c84 Compare February 27, 2025 15:22

garlick changed the title ~~WIP: run sdexec imp-shell jobs as Type=notify with sdexec stop tiimeout~~ sdexec: add stop timeout to handle unkillable processes Feb 27, 2025

garlick added 2 commits February 27, 2025 10:34

garlick force-pushed the sdexec_notify branch from 3fa2c84 to ade2b54 Compare February 27, 2025 18:56

grondo approved these changes Feb 27, 2025

View reviewed changes

garlick added 2 commits February 27, 2025 15:43

flux-config-exec(5): add sdexec-stop-timer stuff

a7b550d

Problem: the sdexec-stop-timer config options are undocumented. Add sdexec-stop-timer-signal and sdexec-stop-timer-sec options to the flux-config-exec(5) man page.

garlick force-pushed the sdexec_notify branch from ade2b54 to 6ef0d5b Compare February 27, 2025 23:43

garlick added 3 commits February 27, 2025 15:50

configure: require flux-security >= 0.14.0

af9115a

Problem: sdexec now launches work in transient Type=notify containers, but for that to work, a flux-security version that calls sd_notify() is required. Bump the minimum required version of flux-security.

deb: require flux-security >= 0.14.0

120ed73

Problem: flux-security 0.14 or greater is required now but the debian control file require 0.13. Bump the version.

ci: bump default flux-security version

a84adc0

Problem: flux-security 0.13.0 is installed by default in CI but flux-core now requires 0.14.0 or greater. Bump the version.

garlick added the merge-when-passing label Feb 28, 2025

mergify bot merged commit 26e39f2 into flux-framework:master Feb 28, 2025
34 of 35 checks passed

garlick deleted the sdexec_notify branch February 28, 2025 14:33

grondo mentioned this pull request Feb 28, 2025

flux-core can't be built with flux-security < 0.14.0, but this is only necessary for a system instance #6672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdexec: add stop timeout to handle unkillable processes#6666

sdexec: add stop timeout to handle unkillable processes#6666
mergify[bot] merged 11 commits intoflux-framework:masterfrom
garlick:sdexec_notify

garlick commented Feb 26, 2025 •

edited

Loading

Uh oh!

garlick commented Feb 27, 2025

Uh oh!

garlick commented Feb 27, 2025

Uh oh!

grondo left a comment

Uh oh!

grondo Feb 27, 2025

Uh oh!

garlick Feb 27, 2025

Uh oh!

grondo commented Feb 27, 2025

Uh oh!

garlick commented Feb 27, 2025

Uh oh!

grondo commented Feb 28, 2025

Uh oh!

garlick commented Feb 28, 2025

Uh oh!

Uh oh!

codecov bot commented May 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garlick commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garlick commented Feb 27, 2025

Uh oh!

garlick commented Feb 27, 2025

Uh oh!

grondo left a comment

Choose a reason for hiding this comment

Uh oh!

grondo Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

garlick Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

grondo commented Feb 27, 2025

Uh oh!

garlick commented Feb 27, 2025

Uh oh!

grondo commented Feb 28, 2025

Uh oh!

garlick commented Feb 28, 2025

Uh oh!

Uh oh!

codecov bot commented May 29, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garlick commented Feb 26, 2025 •

edited

Loading