addded torchrun

wonchul-kim · Jan 9, 2024 · 2bc5a02 · 2bc5a02
1 parent d240089
commit 2bc5a02
Showing 1 changed file with 149 additions and 0 deletions.
diff --git a/_posts/pytorch/2024-01-09-torchrun.md b/_posts/pytorch/2024-01-09-torchrun.md
@@ -0,0 +1,149 @@
+---
+layout: post
+title: Torchrun to Execute Distributed Training
+category: Pytorch
+tag: distributed training
+---
+
+# Distributed Training by PyTorch
+
+### 실행
+
+```
+torchrun --nproc_per_node=<node마다의 gpu 갯수> train.py ...
+```
+
+* `...`: `train.py`에 필요한 arugments
+
+* `nproc_per_node`: `node`(서버) 하나마다의 gpu 갯수
+    > `node`마다 gpu 갯수가 다르면...?
+
+* `train.py`: 실행 파일
+
+
+이렇게 실행하면, `os.environ`에는 다음과 같은 정보가 포함되어 있다. 
+
+```python
+rank = os.environ['RANK']
+world_size = os.environ['WORLD_SIZE']
+gpu = os.environ['LOCAL_RANK']
+```
+
+### `torch.distributed`
+
+`torch.distributed`를 사용하기 위해서는 초기화 과정이 필요하며, `torch.distributed.init_process_group`을 이용한다.
+
+> torch.distributed.init_process_group(backend=None, init_method=None, timeout=datetime.timedelta(seconds=1800), world_size=-1, rank=-1, store=None, group_name='', pg_options=None)
+
+- `backend`: `NCCL`과 `Gloo`이 있으며, 
+    - `nccl`: GPU를 활용한 분산 학습
+    - `gloo`: CPU를 활용한 분산 학습
+
+    > 우분투에서만 가능!
+
+- `init_method`: 다른 `node`와의 통신을 하기 위한 URL
+    > 0-순위 프로세스의 IP 주소와 접근 가능한 포트 번호가 있으면 TCP를 통한 초기화를 할 수 있고, 모든 워커들은 0-순위의 프로세스에 연결하고 서로 정보를 교환한다. 그러므로, **one-node multi-gpu**에서는 `localhost IP`인 `127.0.0.1` 혹은 `0.0.0.0`로 설정하고, `port`는 '23456' 을 사용한다.
+
+- `world_size`: process 갯수
+
+- `rank`: process ID
+
+
+이러한 초기화 과정은 각 device(GPU)/process마다 진행이 된다. 예를 들어, 하나의 node에 2개의 GPU가 있다면, 각각의 GPU마다 진행되므로 2번의 초기화 과정이 진행된다. 
+
+```python
+if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
+    print("case 1")
+    args.rank = int(os.environ["RANK"])
+    args.world_size = int(os.environ["WORLD_SIZE"])
+    args.gpu = int(os.environ["LOCAL_RANK"])
+elif "SLURM_PROCID" in os.environ:
+    print("case 2")
+    args.rank = int(os.environ["SLURM_PROCID"])
+    args.gpu = args.rank%torch.cuda.device_count()
+elif hasattr(args, "rank"):
+    print("case 3")
+    pass
+else:
+    print("case 4")
+    print("Not using distributed mode")
+    args.distributed = False
+    return
+
+args.distributed = True
+
+torch.cuda.set_device(args.gpu)
+args.dist_backend = "nccl"
+print(f"| distributed init (rank {args.rank}): {args.dist_url}", flush=True)
+torch.distributed.init_process_group(
+    backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank
+)
+torch.distributed.barrier()
+```
+
+### Dataset
+
+#### `DistributedSampler`를 사용해야하며, 아래의 코드에서처럼 
+
+- `sampler`에서 `shuffle`을 한다면, `DataLoader`에서는 하지 않는다.
+- `batch_size`와 `num_worker`를 `gpu` 갯수로 나눈다.
+
+
+```python
+from torch.utils.data import DataLoader 
+from torch.utils.data.distributed import DistributedSampler
+
+train_sampler = DistributedSampler(dataset=train_dataset, shuffle=True)
+val_sampler = DistributedSampler(dataset=val_dataset, shuffle=False)
+
+train_dataloader = DataLoader(dataset=train_dataset,
+                                batch_size=int(args.batch_size/args.world_size),
+                                shuffle=False,
+                                num_workers=int(len(args.device_ids)*4/args.world_size),
+                                sampler=train_sampler,
+                                pin_memory=True)
+
+val_dataloader = DataLoader(dataset=val_dataset,
+                                batch_size=int(args.batch_size/args.world_size),
+                                shuffle=False,
+                                num_workers=int(len(args.device_ids)*4/args.world_size),
+                                sampler=val_sampler,
+                                pin_memory=True)
+```
+
+
+위를 보면, `num_workers`는 GPU 갯수의 4배를 하였는데, 아래와 같은 이야기가 있다. 
+> harsv Hars Vardhan says the below in [Guidelines for assigning num_workers to DataLoader](https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813/4): 
+I experimented with this a bit. I found that we should use the formula:
+num_worker = 4 * num_GPU .
+Though a factor of 2 and 8 also work good but lower factor (<2) significantly reduces overall performance. Here, worker has no impact on GPU memory allocation. Also, nowadays there are many CPU cores in a machine with few GPUs (<8), so the above formula is practical.
+
+
+#### 리눅스에서 프로세스 갯수를 확인하기 위해서는 다음과 같으며, `num_workers`는 프로세스 갯수만큼 최대로 할당이 가능하다.
+```cmd
+cat /proc/cpuinfo | grep processor
+```
+
+### Model
+
+#### `DistributedDataParallel`
+```python
+from torch.nn.parallel import DistributedDataParallel as DDP 
+model = model.cuda(args.rank)
+model = DDP(module=model, device_ids=[args.rank])
+```
+
+### Train
+
+학습이 진행될 때, 매 epoch가 시작하는 시점에서 `sampler`의 `set_epoch()`를 실행해여 `shuffle`이 동작하도록 해야한다.
+```python
+train_sampler.set_epoch(epoch)
+```
+
+## references:
+
+- [Guidelines for assigning num_workers to DataLoader](https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813/4)
+
+- [pytorch examples for ddp](https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md)
+
+- [💥 Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255)