Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handle terminate signals in init script #1390

Merged
merged 1 commit into from
Dec 26, 2024

Conversation

mikhailv
Copy link
Contributor

tini (added in PR) does not wait for child processes to exit gracefully, so all processes will be killed immediately. This is why the container stops so fast.

Possible tini issue: krallin/tini#180

This fix makes the init script listen and handle terminate signals. It also ensures that child processes are shut down in reverse order with proper waiting for completion.

How to reproduce:

To see exit codes of processes I used exitsnoop-bpfcc tool from this toolset. Not sure maybe it is possible in easier way.

This command started in terminal: sudo exitsnoop-bpfcc | egrep '(gunicorn|valkey|nginx|python3)'

Before the fix

romm  | INFO:     [RomM][watcher][2024-12-26 02:04:50] Watching /romm/library for changes
romm exited with code 0
romm  | INFO:     [init][2024-12-26 02:07:04] Starting up, please wait...
romm  | INFO:     [init][2024-12-26 02:07:09] starting valkey-server

As you may see, in the logs there are only exited with code 0 and Starting up, no any word about shutting down

Output of exitsnoop-bpfcc:

python3          3865231 3865103 3865231 0.08    0
nginx            3862396 3861987 3862396 134.66  signal 9 (KILL)
valkey-server    3862375 3861987 3862375 136.99  signal 9 (KILL)
gunicorn         3862388 3861987 3862388 135.68  signal 9 (KILL)
python3          3862416 3861987 3862416 134.66  signal 9 (KILL)
python3          3862424 3861987 3862424 134.67  signal 9 (KILL)
python3          3862408 3861987 3862408 134.68  signal 9 (KILL)
gunicorn         3862392 3861987 3862392 135.49  signal 9 (KILL)
gunicorn         3862391 3861987 3862391 135.58  signal 9 (KILL)
nginx            3862403 3861987 3862403 134.69  signal 9 (KILL)
nginx            3862412 3861987 3862412 134.69  signal 9 (KILL)
nginx            3862423 3861987 3862423 134.69  signal 9 (KILL)
nginx            3862407 3861987 3862407 134.69  signal 9 (KILL)
nginx            3862420 3861987 3862420 134.69  signal 9 (KILL)
nginx            3862405 3861987 3862405 134.70  signal 9 (KILL)
nginx            3862409 3861987 3862409 134.70  signal 9 (KILL)
nginx            3862397 3861987 3862397 134.70  signal 9 (KILL)
nginx            3862417 3861987 3862417 134.70  signal 9 (KILL)
nginx            3862427 3861987 3862427 134.69  signal 9 (KILL)
nginx            3862400 3861987 3862400 134.70  signal 9 (KILL)
nginx            3862421 3861987 3862421 134.70  signal 9 (KILL)
nginx            3862414 3861987 3862414 134.70  signal 9 (KILL)
nginx            3862425 3861987 3862425 134.70  signal 9 (KILL)
nginx            3862404 3861987 3862404 134.70  signal 9 (KILL)
nginx            3862415 3861987 3862415 134.70  signal 9 (KILL)
nginx            3862422 3861987 3862422 134.70  signal 9 (KILL)
nginx            3862418 3861987 3862418 134.70  signal 9 (KILL)
nginx            3862402 3861987 3862402 134.70  signal 9 (KILL)
nginx            3862399 3861987 3862399 134.70  signal 9 (KILL)
nginx            3862398 3861987 3862398 134.70  signal 9 (KILL)
nginx            3862410 3861987 3862410 134.70  signal 9 (KILL)
nginx            3862406 3861987 3862406 134.70  signal 9 (KILL)
nginx            3862413 3861987 3862413 134.70  signal 9 (KILL)
python3          3865836 3865711 3865836 0.08    0
nginx            3866197 3865702 3866197 0.01    0

Almost all processes killed and this is bad because some of these processes may write some data to the disk and operation will be interrupted, data can be written incomplete.

After the fix

romm  | [2024-12-26 02:09:29 +0000] [25] [INFO] Application startup complete.
romm  | INFO:     [init][2024-12-26 02:09:48] stopping watcher
romm  | INFO:     [init][2024-12-26 02:09:48] stopping worker
romm  | 02:09:48 Worker a2e75a570ff74c5b8ae556b32ccf1ff7 [PID 49]: warm shut down requested
romm  | 02:09:48 Unsubscribing from channel rq:pubsub:a2e75a570ff74c5b8ae556b32ccf1ff7
romm  | INFO:     [init][2024-12-26 02:09:48] stopping nginx
romm  | INFO:     [init][2024-12-26 02:09:49] stopping gunicorn
romm  | [2024-12-26 02:09:49 +0000] [21] [INFO] Handling signal: term
romm  | [2024-12-26 02:09:49 +0000] [24] [INFO] Shutting down
romm  | [2024-12-26 02:09:49 +0000] [25] [INFO] Shutting down
romm  | [2024-12-26 02:09:49 +0000] [24] [INFO] Waiting for application shutdown.
romm  | [2024-12-26 02:09:49 +0000] [25] [INFO] Waiting for application shutdown.
romm  | [2024-12-26 02:09:49 +0000] [25] [INFO] Application shutdown complete.
romm  | [2024-12-26 02:09:49 +0000] [25] [INFO] Finished server process [25]
romm  | [2024-12-26 02:09:49 +0000] [24] [INFO] Application shutdown complete.
romm  | [2024-12-26 02:09:49 +0000] [24] [INFO] Finished server process [24]
romm  | [2024-12-26 02:09:49 +0000] [21] [ERROR] Worker (pid:25) was sent SIGTERM!
romm  | [2024-12-26 02:09:49 +0000] [21] [ERROR] Worker (pid:24) was sent SIGTERM!
romm  | [2024-12-26 02:09:49 +0000] [21] [INFO] Shutting down: Master
romm  | INFO:     [init][2024-12-26 02:09:49] stopping valkey-server
romm  | 14:signal-handler (1735178989) Received SIGTERM scheduling shutdown...
romm  | 14:M 26 Dec 2024 02:09:49.582 * User requested shutdown...
romm  | 14:M 26 Dec 2024 02:09:49.582 * Saving the final RDB snapshot before exiting.
romm  | 14:M 26 Dec 2024 02:09:49.763 * DB saved on disk
romm  | 14:M 26 Dec 2024 02:09:49.763 # Valkey is now ready to exit, bye bye...
romm exited with code 0
romm  | INFO:     [init][2024-12-26 02:09:50] Starting up, please wait...
romm  | INFO:     [init][2024-12-26 02:09:50] starting valkey-server

You may notice shutting down logs appeared in logs and shutdown handlers executed.

Output of exitsnoop-bpfcc:

python3          3870358 3870032 3870358 75.15   signal 15 (TERM)
python3          3870366 3870032 3870366 75.30   0
nginx            3870371 3870346 3870371 75.37   0
nginx            3870367 3870346 3870367 75.38   0
nginx            3870357 3870346 3870357 75.38   0
nginx            3870353 3870346 3870353 75.38   0
nginx            3870351 3870346 3870351 75.38   0
nginx            3870350 3870346 3870350 75.38   0
nginx            3870349 3870346 3870349 75.38   0
nginx            3870362 3870346 3870362 75.38   0
nginx            3870354 3870346 3870354 75.38   0
nginx            3870365 3870346 3870365 75.38   0
nginx            3870375 3870346 3870375 75.38   0
nginx            3870363 3870346 3870363 75.38   0
nginx            3870348 3870346 3870348 75.38   0
nginx            3870360 3870346 3870360 75.38   0
nginx            3870377 3870346 3870377 75.38   0
nginx            3870356 3870346 3870356 75.38   0
nginx            3870347 3870346 3870347 75.38   0
nginx            3870364 3870346 3870364 75.38   0
nginx            3870359 3870346 3870359 75.38   0
nginx            3870369 3870346 3870369 75.38   0
nginx            3870376 3870346 3870376 75.38   0
nginx            3870373 3870346 3870373 75.38   0
nginx            3870355 3870346 3870355 75.39   0
nginx            3870370 3870346 3870370 75.38   0
nginx            3870346 3870032 3870346 75.39   0
gunicorn         3870316 3870313 3870316 76.48   signal 15 (TERM)
gunicorn         3870317 3870313 3870317 76.50   signal 15 (TERM)
gunicorn         3870313 3870032 3870313 76.79   0
valkey-server    3870079 3870032 3870079 78.35   0
python3          3870372 3870032 3870372 76.07   signal 9 (KILL)

You may see almost all processes exited with 0 code or TERM which is also good I think. Only only process is finishing with KILL every time, not sure why.

Comment on lines 193 to 194
# check for died processes every 5 seconds
sleep 5
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved sleep to the end of the loop because it does not make sense to wait for 5 seconds when script is started instead of starting of all child processes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

@adamantike adamantike self-requested a review December 26, 2024 14:26
Copy link
Collaborator

@adamantike adamantike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change overall, thanks for looking into it!
I can test it locally after the typo for scheduler is fixed.

Comment on lines 119 to 126
PROCESS=$1
if [[ -f "/tmp/${PROCESS}.pid" ]]; then
PID=$(cat "/tmp/${PROCESS}.pid") || true
if [[ -d "/proc/${PID}" ]]; then
info_log "stopping ${PROCESS}"
kill "${PID}" || true
# wait for process exit
while [ -e "/proc/${PID}" ]; do sleep 0.1; done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Seems indentation is broken within this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, converted to tabs


shutdown() {
# shutdown in reverse order
stop_process_pid sheduler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: sheduler instead of scheduler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, thanks, this is why last one python3 process been killed, now termination exit codes looks like

python3          163192 162602 163192 15.15   0
python3          163184 162602 163184 15.39   0
python3          163176 162602 163176 15.45   signal 15 (TERM)
nginx            163193 163163 163193 15.58   0
nginx            163190 163163 163190 15.58   0
nginx            163173 163163 163173 15.58   0
nginx            163189 163163 163189 15.58   0
nginx            163169 163163 163169 15.58   0
nginx            163178 163163 163178 15.58   0
nginx            163180 163163 163180 15.58   0
nginx            163172 163163 163172 15.58   0
nginx            163165 163163 163165 15.59   0
nginx            163191 163163 163191 15.58   0
nginx            163168 163163 163168 15.59   0
nginx            163186 163163 163186 15.59   0
nginx            163174 163163 163174 15.59   0
nginx            163177 163163 163177 15.59   0
nginx            163182 163163 163182 15.59   0
nginx            163166 163163 163166 15.59   0
nginx            163164 163163 163164 15.59   0
nginx            163183 163163 163183 15.59   0
nginx            163167 163163 163167 15.59   0
nginx            163170 163163 163170 15.59   0
nginx            163181 163163 163181 15.59   0
nginx            163187 163163 163187 15.59   0
nginx            163175 163163 163175 15.59   0
nginx            163185 163163 163185 15.59   0
nginx            163163 162602 163163 15.60   0
gunicorn         163154 163151 163154 16.65   signal 15 (TERM)
gunicorn         163155 163151 163155 16.63   signal 15 (TERM)
gunicorn         163151 162602 163151 16.90   0
valkey-server    162649 162602 162649 21.35   0
python3          163862 163735 163862 0.08    0

Comment on lines 193 to 194
# check for died processes every 5 seconds
sleep 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

@adamantike
Copy link
Collaborator

Great improvement, and it works as expected based on my local testing. RomM requires signing all commits before we can merge this PR.

`tini` does not wait for child processes to close, so all processes will be killed immediately. This is why the container stops so fast.

This fix makes the `init` script listen and handle terminate signals. It also ensures that child processes are shut down in reverse order with proper waiting for completion.
@mikhailv mikhailv force-pushed the fix/graceful-shutdown branch from 448071c to f5bd75c Compare December 26, 2024 22:02
@mikhailv
Copy link
Contributor Author

Done, I squashed two commits into one and signed it

@zurdi15 zurdi15 merged commit eea41fc into rommapp:master Dec 26, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants