This submodule contains the DLRM model used to generate TopoOpt's testbed DLRM result. It was adopted from Meta's implementaiton of DLRM with model parallelism. Please check the original README file at here for a more detailed description.
The scripts are designed to run distributed model parallel training of DLRM on MIT's 12 node cluster testbed. Please check this document for how to setup RDMA forwarding and using the hacked version of NCCL.
The scripts uses the pytorch version of this repository.
run_a100_fattree.sh
will run DLRM training on the Mellanox ConnectX5 NIC, which are connected to a single switch.
run_a100_topoopt.sh
will run DLRM training on the HPE nics, which are connected to the patch panel. Please be sure the forwarding is setup properly before running this test. Check the document above on how to setup RDMA fowarding.
To execute the program, adjust the parameters in the scripts and run the script on ALL of the workers. The script will automatically pick the master and log the training output on the master machine.