This helps you to check the gpus can communicate each other
If your Cluster have 8 Node, You want to use 2 Node (3, 7) 8GPU.
'*' is GPU
NODE3: ********
NODE7: ********
The Master Node is <NODE3 IP address> (Usually, NODE name is NODE IP)
nnodes=2
master_addr=NODE3
master_port=12345
nproc_per_node=<GPU per NODE>
Each NODE havs their own node_rank (0,1,2...), Job script
Example is below
- Make SH file or Directly Run you terminal
-
export NCCL_DEBUG=INFO
Display NCCL Debuggin log
-
export NCCL_SOCKET_IFNAME=${your socket interface name}
- NODE1
NODE2
torchrun --nproc_per_node=8 \ --master_port 12345 --nnodes=2 \ --node_rank=0 --master_addr=NODE1 \ ./ddp.py
torchrun --nproc_per_node=8 \ --master_port 12345 --nnodes=2 \ --node_rank=1 --master_addr=NODE1 \ ./ddp.py
"torch run" or "torch.distributed.launch"
- In your Terminal, Exe each script in their node
Before run script, Chceck your Node can catch GPU for imort torch;torch.cuda.is_availabale()
- If you see the Message "Send" and "Received" each GPUS
you can conclude that your cluster can communicate NODEs well.