Fix nvidia hang when a sidecar dies by dsocolobsky · Pull Request #594 · PsycheFoundation/nousnet

dsocolobsky · 2026-02-26T14:43:22Z

Sometimes (most of the time) if a sidecar dies or is killed the GPU pertaining to that sidecar would hang at 100% and not quit, probably some training thread that gets stuck.

This PR should fix this, when a sidecar dies we detect it and then we kill the client altogether since it doesn't make sense to keep going without one of the GPUs.

How to test

Start a run with model HfAuto and enough batch size 32 or 64 should be enough.
Start a client with 2 or more GPUs such as CUDA_VISIBLE_DEVICES=0,1 DP=2 BATCH_SIZE=1 just dev start-training-localnet-light-client
Monitor the GPU usage via nvtop in a separate terminal.
Check the sidecar PID with something like ps aux | grep psyche.sidecar and kill the process once the node is training.

before this patch: We have a GPU stuck at 100% capacity and the client doesn't quit.
after this patch: The client quits and GPUs go back to 0% usage.

small comment change

graceful shutdown

dsocolobsky added 7 commits February 25, 2026 16:16

try to fix nvidia hang by monitoring children

59b6fd5

simplify code

214d83e

try to kill processes more aggresively

49195a4

even more aggressive killing

7e68f2d

simpler approach to avoid nvidia hanging

d4b3300

small comment change

Merge branch 'main' into dy/fix-nvidia-hang

67a0e98

allow clean shutdown when quitting

dca47ff

graceful shutdown

dsocolobsky force-pushed the dy/fix-nvidia-hang branch from 37670c1 to dca47ff Compare February 26, 2026 17:04

Merge branch 'main' into dy/fix-nvidia-hang

b76d9a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nvidia hang when a sidecar dies#594

Fix nvidia hang when a sidecar dies#594
dsocolobsky wants to merge 8 commits intomainfrom
dy/fix-nvidia-hang

dsocolobsky commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dsocolobsky commented Feb 26, 2026

How to test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants