Distributed Data Parallel (DDP) Training with PyTorch

This project demonstrates how to perform distributed training using PyTorch's DistributedDataParallel (DDP) on multiple machines. It includes a simple convolutional neural network trained on the CIFAR-10 dataset.

Project Structure

.
├── ddp_launch.py
├── model.py
├── Dockerfile
├── requirements.txt
└── README.md

Prerequisites

Docker installed on all machines
Machines should be able to communicate with each other over the network
Basic understanding of PyTorch and distributed training concepts

Setup and Installation

Clone this repository to all machines that will participate in the training:
```
git clone <repository-url>
cd <repository-directory>
```
Build the Docker image on all machines:
```
docker build -t ddp-training .
```

Running the Distributed Training

On the master node (replace <MASTER_IP> with the actual IP address of the master node):

docker run --rm -it \
  -e MASTER_ADDR=<MASTER_IP> \
  -e MASTER_PORT=29500 \
  -e WORLD_SIZE=2 \
  -e RANK=0 \
  -p 29500:29500 \
  ddp-training

On the worker node:

docker run --rm -it \
  -e MASTER_ADDR=<MASTER_IP> \
  -e MASTER_PORT=29500 \
  -e WORLD_SIZE=2 \
  -e RANK=1 \
  ddp-training

Make sure to start the master node (RANK=0) first, then start the worker node(s).

Code Overview

ddp_launch.py: Main script that sets up and runs the distributed training.
model.py: Contains the definition of the neural network model.
Dockerfile: Defines the Docker image for the training environment.
requirements.txt: Lists the Python dependencies.

Key Features

Distributed training using PyTorch's DistributedDataParallel
CIFAR-10 dataset used for demonstration
Simple CNN model architecture
Proper data distribution and parallelization across nodes
Logging of training progress for each node

Customization

To adapt this project for your own use:

Modify the SimpleModel in model.py to change the neural network architecture.
Adjust training parameters in ddp_launch.py (e.g., number of epochs, learning rate, etc.).
Replace CIFAR-10 with your own dataset by modifying the dataset loading part in ddp_launch.py.

Troubleshooting

Ensure all machines can communicate over port 29500 (or the port you've specified).
Verify that the MASTER_ADDR is correctly set to the IP address of the master node.
Check that the WORLD_SIZE matches the total number of nodes participating in the training.
If you're using GPUs, ensure CUDA is properly set up on all machines.

Viewing Training Progress

The training progress is logged to the console. You should see output from both the master and worker nodes showing the loss for each epoch and batch.

Saving and Using the Trained Model

After training, the model is saved as cifar_net.pth on the master node. To use this model for inference or further training, you can load it using PyTorch's torch.load() function.

Contributing

Feel free to open issues or submit pull requests if you have suggestions for improvements or encounter any problems.

License

AGPL-3.0 license

Acknowledgments

PyTorch team for their excellent distributed training documentation
CIFAR-10 dataset creators

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
monitoring		monitoring
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ddp-training		ddp-training
ddp_launch.py		ddp_launch.py
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
model.py		model.py
mp_launch.py		mp_launch.py
requirements.txt		requirements.txt
run_tests.sh		run_tests.sh
test_ddp_launch.py		test_ddp_launch.py
use_trained_model.py		use_trained_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Data Parallel (DDP) Training with PyTorch

Project Structure

Prerequisites

Setup and Installation

Running the Distributed Training

Code Overview

Key Features

Customization

Troubleshooting

Viewing Training Progress

Saving and Using the Trained Model

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

License

navingate/SiDisTrO

Folders and files

Latest commit

History

Repository files navigation

Distributed Data Parallel (DDP) Training with PyTorch

Project Structure

Prerequisites

Setup and Installation

Running the Distributed Training

Code Overview

Key Features

Customization

Troubleshooting

Viewing Training Progress

Saving and Using the Trained Model

Contributing

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages