Skip to content

Fix nvidia hang when a sidecar dies#594

Draft
dsocolobsky wants to merge 8 commits intomainfrom
dy/fix-nvidia-hang
Draft

Fix nvidia hang when a sidecar dies#594
dsocolobsky wants to merge 8 commits intomainfrom
dy/fix-nvidia-hang

Conversation

@dsocolobsky
Copy link
Contributor

Sometimes (most of the time) if a sidecar dies or is killed the GPU pertaining to that sidecar would hang at 100% and not quit, probably some training thread that gets stuck.

This PR should fix this, when a sidecar dies we detect it and then we kill the client altogether since it doesn't make sense to keep going without one of the GPUs.

How to test

  1. Start a run with model HfAuto and enough batch size 32 or 64 should be enough.
  2. Start a client with 2 or more GPUs such as CUDA_VISIBLE_DEVICES=0,1 DP=2 BATCH_SIZE=1 just dev start-training-localnet-light-client
  3. Monitor the GPU usage via nvtop in a separate terminal.
  4. Check the sidecar PID with something like ps aux | grep psyche.sidecar and kill the process once the node is training.
  • before this patch: We have a GPU stuck at 100% capacity and the client doesn't quit.
  • after this patch: The client quits and GPUs go back to 0% usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants