Name		Name	Last commit message	Last commit date
parent directory ..
Makefile		Makefile
README.md		README.md
switchml_nccl.patch		switchml_nccl.patch
switchml_plugin.cc		switchml_plugin.cc
utils.h		utils.h

README.md

NCCL Plugin

IMPORTANT: This method currently has performance issues. Use PyTorch patch instead.

NCCL provides a way to override its collective communication functions through the use of plugins. Any NCCL installation looks for a libnccl-net.so library (The library is the plugin) before it starts. If its available then it loads it at runtime and uses it. This allows us to seemlessly integrate with any framework that uses NCCL.

Currently tested frameworks: - PyTorch - Tensorflow through Horovod

1. Building NCCL

First thing that we need to do is modify NCCL to allow our plugin to only implement the functions that are relevent to SwitchML and fallback to existing algorithms for the others.

We are working towards avoiding the need to rebuild a modified NCCL but it is required at the moment.

Make sure CUDA is installed.
Clone the NCCL repository

git clone https://github.com/NVIDIA/nccl.git
Checkout to the specific version that we modified

cd nccl git checkout 195232
Apply the NCCL patch nccl_collectives.patch

git apply p4app-switchml/frameworks_integration/nccl_plugin/nccl_collectives.patch
Compile the patched NCCL (If CUDA is not installed in the default path then provide its path using CUDA=)

make -j src.build
Register our custom NCCL in the system.

This step is required unless you have installed the patched NCCL directly on the system.

Run:

sudo bash -c 'echo path_to_patched_nccl_repo/build/lib > /etc/ld.so.conf.d/00_switchml.conf'
sudo ldconfig

To remove this registration so that you can use your normal version of NCCL simply undo what you did

sudo rm /etc/ld.so.conf.d/00_switchml.conf
sudo ldconfig

2. Building the plugin

Make sure that the client library is built and ready. Refer to the client_lib folder for details on how to build it.
Build the nccl plugin.

From the nccl_plugin directory run

make NCCL_HOME=<path_to_patched_nccl_repo/build/> CUDA_HOME=<path to cuda installation>

All variables that can be passed to the nccl plugin makefile:

Variable	Type	Default	Usage
DEBUG	boolean	0	Disable optimizations, add debug symbols, and enable detailed debugging messages.
DPDK	boolean	0	Add dpdk backend specific compiler/linker options.
MLX5	boolean	0	Add dpdk backend Connect-x5/Connect-x4 specific compiler/linker options.
MLX4	boolean	0	Add dpdk backend Connect-x3 specific compiler/linker options.
RDMA	boolean	0	Add rdma backend specific compiler/linker options.
BUILDDIR	path	dev_root/build	Where to store generated objects and the plugin
GRPC_HOME	path	dev_root/third_party/grpc/build	Where to look for the GRPC installation
DPDK_HOME	path	dev_root/third_party/dpdk/build	Where to look for the DPDK installation
CUDA_HOME	path	/usr/local/cuda	Where to look for the CUDA installation
NCCL_HOME	path	/usr/local	Where to look for the patched NCCL installation

Register the nccl plugin

At this point you should confirm that you have a libnccl-net.so shared library in the build directory. This means that the nccl plugin is ready. But now NCCL must be aware that this library/plugin exists.

Run:

sudo echo "path_to_switchml_repo/build/lib" >> /etc/ld.so.conf.d/1switchml.conf
sudo ldconfig

3. Forcing NCCL to use the plugin

Now we just need to set some environement variables to force NCCL to use our plugin

sudo echo NCCL_COLLNET_ENABLE=1 > /etc/nccl.conf
sudo echo NCCL_ALGO=CollNet > /etc/nccl.conf
sudo echo NCCL_CHECKS_DISABLE=1 >> /etc/nccl.conf # Can improve performance
sudo echo NCCL_IB_DISABLE=1 >> /etc/nccl.conf

You can always skip this and set these environment variables in your scripts or conda environments instead.

4. NCCL tests (Optional)

Now to verify that everything is working up to this point. We can build and run the NCCL tests.

Make sure you have a working MPI installation.
Clone the NCCL tests repo

git clone https://github.com/NVIDIA/nccl-tests.git
Build NCCL tests

cd nccl-tests make MPI=1 NCCL_HOME=path_to_patched_nccl_repo/build/ MPI_HOME=path_to_mpi_installation
Setup switchml configuration in the nccl-tests/build directory
Test allreduce

cd build mpirun -np <num_processes> -host=<list_of_host_ips> ./all_reduce_perf --op sum --datatype float --iters 10 --warmup_iters 5

Useful NCCL variables for debugging NCCL_DEBUG=INFO, NCCL_DEBUG_SUBSYS=ALL, and NCCL_CHECKS_DISABLE=0

5. Running with PyTorch

The important thing to keep in mind when running with PyTorch is that you want PyTorch to use our patched NCCL and not their own NCCL module to which they link statically. So depending on where you got your PyTorch binary from or how you compiled it, you may need to recompile PyTorch so that it links dynamically or statically to our patched NCCL.

6. Running with Tensorflow Horovod

There are some linking problems at the moment. Instructions will come soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nccl_plugin

nccl_plugin

README.md

NCCL Plugin

1. Building NCCL

2. Building the plugin

3. Forcing NCCL to use the plugin

4. NCCL tests (Optional)

5. Running with PyTorch

6. Running with Tensorflow Horovod

Files

nccl_plugin

Directory actions

More options

Directory actions

More options

Latest commit

History

nccl_plugin

Folders and files

parent directory

README.md

NCCL Plugin

1. Building NCCL

2. Building the plugin

3. Forcing NCCL to use the plugin

4. NCCL tests (Optional)

5. Running with PyTorch

6. Running with Tensorflow Horovod