IMPORTANT: This method currently has performance issues. Use PyTorch patch instead.
NCCL provides a way to override its collective communication functions through the use of plugins. Any NCCL installation looks for a libnccl-net.so library (The library is the plugin) before it starts. If its available then it loads it at runtime and uses it. This allows us to seemlessly integrate with any framework that uses NCCL.
Currently tested frameworks: - PyTorch - Tensorflow through Horovod
First thing that we need to do is modify NCCL to allow our plugin to only implement the functions that are relevent to SwitchML and fallback to existing algorithms for the others.
We are working towards avoiding the need to rebuild a modified NCCL but it is required at the moment.
-
Make sure CUDA is installed.
-
Clone the NCCL repository
git clone https://github.com/NVIDIA/nccl.git
-
Checkout to the specific version that we modified
cd nccl git checkout 195232
-
Apply the NCCL patch
nccl_collectives.patch
git apply p4app-switchml/frameworks_integration/nccl_plugin/nccl_collectives.patch
-
Compile the patched NCCL (If CUDA is not installed in the default path then provide its path using CUDA=)
make -j src.build
-
Register our custom NCCL in the system.
This step is required unless you have installed the patched NCCL directly on the system.
Run:
sudo bash -c 'echo path_to_patched_nccl_repo/build/lib > /etc/ld.so.conf.d/00_switchml.conf'
sudo ldconfig
To remove this registration so that you can use your normal version of NCCL simply undo what you did
sudo rm /etc/ld.so.conf.d/00_switchml.conf
sudo ldconfig
-
Make sure that the client library is built and ready. Refer to the client_lib folder for details on how to build it.
-
Build the nccl plugin.
From the nccl_plugin directory run
make NCCL_HOME=<path_to_patched_nccl_repo/build/> CUDA_HOME=<path to cuda installation>
All variables that can be passed to the nccl plugin makefile:
Variable | Type | Default | Usage |
---|---|---|---|
DEBUG | boolean | 0 | Disable optimizations, add debug symbols, and enable detailed debugging messages. |
DPDK | boolean | 0 | Add dpdk backend specific compiler/linker options. |
MLX5 | boolean | 0 | Add dpdk backend Connect-x5/Connect-x4 specific compiler/linker options. |
MLX4 | boolean | 0 | Add dpdk backend Connect-x3 specific compiler/linker options. |
RDMA | boolean | 0 | Add rdma backend specific compiler/linker options. |
BUILDDIR | path | dev_root/build | Where to store generated objects and the plugin |
GRPC_HOME | path | dev_root/third_party/grpc/build | Where to look for the GRPC installation |
DPDK_HOME | path | dev_root/third_party/dpdk/build | Where to look for the DPDK installation |
CUDA_HOME | path | /usr/local/cuda | Where to look for the CUDA installation |
NCCL_HOME | path | /usr/local | Where to look for the patched NCCL installation |
- Register the nccl plugin
At this point you should confirm that you have a libnccl-net.so shared library in the build directory. This means that the nccl plugin is ready. But now NCCL must be aware that this library/plugin exists.
Run:
sudo echo "path_to_switchml_repo/build/lib" >> /etc/ld.so.conf.d/1switchml.conf
sudo ldconfig
Now we just need to set some environement variables to force NCCL to use our plugin
sudo echo NCCL_COLLNET_ENABLE=1 > /etc/nccl.conf
sudo echo NCCL_ALGO=CollNet > /etc/nccl.conf
sudo echo NCCL_CHECKS_DISABLE=1 >> /etc/nccl.conf # Can improve performance
sudo echo NCCL_IB_DISABLE=1 >> /etc/nccl.conf
You can always skip this and set these environment variables in your scripts or conda environments instead.
Now to verify that everything is working up to this point. We can build and run the NCCL tests.
-
Make sure you have a working MPI installation.
-
Clone the NCCL tests repo
git clone https://github.com/NVIDIA/nccl-tests.git
-
Build NCCL tests
cd nccl-tests make MPI=1 NCCL_HOME=path_to_patched_nccl_repo/build/ MPI_HOME=path_to_mpi_installation
-
Setup switchml configuration in the nccl-tests/build directory
-
Test allreduce
cd build mpirun -np <num_processes> -host=<list_of_host_ips> ./all_reduce_perf --op sum --datatype float --iters 10 --warmup_iters 5
Useful NCCL variables for debugging NCCL_DEBUG=INFO
, NCCL_DEBUG_SUBSYS=ALL
, and NCCL_CHECKS_DISABLE=0
The important thing to keep in mind when running with PyTorch is that you want PyTorch to use our patched NCCL and not their own NCCL module to which they link statically. So depending on where you got your PyTorch binary from or how you compiled it, you may need to recompile PyTorch so that it links dynamically or statically to our patched NCCL.
There are some linking problems at the moment. Instructions will come soon.