Check Socket

This helps you to check the gpus can communicate each other

Nodes communicate with Socket

If your Cluster have 8 Node, You want to use 2 Node (3, 7) 8GPU.
'*' is GPU

NODE3: ********
NODE7: ********

The Master Node is <NODE3 IP address> (Usually, NODE name is NODE IP)
nnodes=2
master_addr=NODE3
master_port=12345
nproc_per_node=<GPU per NODE>

Each NODE havs their own node_rank (0,1,2...), Job script
Example is below

export NCCL_SOCKET_IFNAME=${your socket interface name}

NODE1

torchrun --nproc_per_node=8 \
--master_port 12345 --nnodes=2 \
--node_rank=0 --master_addr=NODE1 \
./ddp.py

NODE2

torchrun --nproc_per_node=8 \
--master_port 12345 --nnodes=2 \
--node_rank=1 --master_addr=NODE1 \
./ddp.py

"torch run" or "torch.distributed.launch"

In your Terminal, Exe each script in their node

Before run script, Chceck your Node can catch GPU for imort torch;torch.cuda.is_availabale()
If you see the Message "Send" and "Received" each GPUS
you can conclude that your cluster can communicate NODEs well.