Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect signal handling #666

Open
ciandonovan opened this issue Oct 8, 2022 · 2 comments · May be fixed by #712
Open

Incorrect signal handling #666

ciandonovan opened this issue Oct 8, 2022 · 2 comments · May be fixed by #712
Labels
backlog bug Something isn't working

Comments

@ciandonovan
Copy link

ciandonovan commented Oct 8, 2022

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04 & Docker Hub ROS2 Humble images
  • Installation type:
    • binaries
  • Version or commit hash:
    • Humble
  • DDS implementation:
    • Fast-RTPS & Cyclone
  • Client library (if applicable):
    • rclcpp & rclpy

Steps to reproduce issue

Start a ROS2 Launch process which spawns at least one node, such as

ros2 launch robot_state_publisher rsp-launch-urdf-inline.xml --noninteractive

Send a SIGTERM to the ROS2 launch process with

kill -TERM $ros2_launch_PID

or

Use the official ROS2 OCI compliant containers from Docker Hub and run with Docker or Podman

podman run --rm --name ros_signal_test ros:humble ros2 launch robot_state_publisher rsp-launch-urdf-inline.xml

And then stop the container with

podman stop ros_signal_test

Expected behavior

ROS2 Launch should catch the SIGTERM and propagate that signal to each of its child nodes.

Actual behavior

SIGTERM is ignored in the first instance, and in the second containerized example SIGTERM is still ignored but all processes are SIGKILL'd after the podman stop timeout.

Additional information

The ROS2 Launch design article says

"if someone attempts to SIGTERM the launch system, they probably did so out of impatience after sending SIGINT to the launch system, and therefore the launch system should attempt to exit quickly."

This is not an accurate assumption.

The Linux signal man page defines, according to the POSIX.1-1990 standard, "SIGINT" as an "Interrupt from keyboard" and "SIGTERM" as a "Termination signal".

The GNU manual entry on termination signals goes further, defining "SIGINT" as being "sent when the user types the INTR character (normally C-c)" and "SIGTERM" as "a generic signal used to cause program termination".

Neither of these sources make reference to any hierarchy or precedence of these two signals.

Users of the ROS2 Launch system are likely to first run into this issue when try to automate the running of their ROS2 nodes on embedded systems, in systemd services, podman/docker containers, or both!

By default systemd services send SIGTERMs to their processes on shutdown, followed by SIGKILLs after a timeout.

Similarly, Podman/Docker containers get sent a SIGTERM to PID1 when running, for example, podman stop ros_signal_test.

The issue with containers cropped up here #473, but it was only resolved for SIGINTs.
An issue of SIGTERMs not being handled was mentioned here #545 (comment), but I've not been able to find any progress on it.

My current workaround for clean termination of an embedded ROS2 Launch system is to set KillMode=SIGINT for systemd services, or run Podman/Docker containers with the --stop-signal=SIGINT flag or set STOPSIGNAL SIGINT in the Dockerfile.
This only works when ROS2 Launch either detects it's running in non-interactive mode or that is set explicitly with --noninteractive #475.

However, it's oxymoronic to have to both be in "non-interactive" mode and to send a "keyboard interrupt" signal (SIGINT) for clean termination!


Feature request

Feature description

Have the ROS2 Launch system catch SIGTERMs and forward the same signal to their child nodes.

Implementation considerations

Windows??

@clalancette clalancette added bug Something isn't working backlog labels Oct 27, 2022
@ciandonovan
Copy link
Author

ciandonovan commented Jun 13, 2023

I'm working on resolving this myself, but can't figure out where the POSIX signal handlers are actually installed.

It looks like there are OS-agnostic handlers referenced around the line below, but the SIGTERM handler is never called because it doesn't install a POSIX handler for it, so the entire process gets automatically terminated by the kernel on receipt.

https://github.com/ros2/launch/blob/rolling/launch/launch/launch_service.py#L216

There is a POSIX signal handler for SIGINT installed somewhere, since its receipt doesn't terminate the process and instead correctly calls _on_sigint(signum).

The __handlers dict in signal_management.py stores mappings from signals to handler function addresses from what I can make out, but would appreciate help in identifying where the real handler is set at the OS level so I can stop a SIGTERM from killing everything.

@wjwwood mentioned in the TODO

@fujitatomoya
Copy link

ros2/ros2cli#895 is similar problem with ros2run package.

philipp-caspers added a commit to vorausrobotik/voraus-ros-bridge that referenced this issue Oct 18, 2024
This is a workaround for the `special` signal handling of ros2 launch.
See ros2/launch#666 for details.
philipp-caspers added a commit to vorausrobotik/voraus-ros-bridge that referenced this issue Oct 18, 2024
This is a workaround for the `special` signal handling of ros2 launch.
To stop a process docker sends a SIGTERM on default.
The ros2 launch wrapper however ignores SIGTERM (only SIGINT is
considered).
Docker then sends a SIGKILL after a timeout.
As a result, the shutdown times of the container are at least as long as the
timeout.
This change makes docker send SIGINT instead of SIGTERM, resulting in
proper fast shutdown.
See ros2/launch#666 for details.
philipp-caspers added a commit to vorausrobotik/voraus-ros-bridge that referenced this issue Oct 18, 2024
This is a workaround for the `special` signal handling of ros2 launch.
To stop a process docker sends a SIGTERM on default.
The ros2 launch wrapper however ignores SIGTERM (only SIGINT is
considered).
Docker then sends a SIGKILL after a timeout.
As a result, the shutdown times of the container are at least as long as the
timeout.
This change makes docker send SIGINT instead of SIGTERM, resulting in
proper fast shutdown.
See ros2/launch#666 for details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants