Skip to content

Commit

Permalink
added test code
Browse files Browse the repository at this point in the history
  • Loading branch information
wonchul committed Jan 10, 2024
1 parent 2a0ce2a commit 6e1aa80
Showing 1 changed file with 20 additions and 19 deletions.
39 changes: 20 additions & 19 deletions _posts/pytorch/2024-01-09-torchrun.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ torch.distributed.init_process_group(
)
torch.distributed.barrier()
```

------------------------------------------------------------------
### Dataset

#### `DistributedSampler`를 사용해야하며, 아래의 코드에서처럼
Expand Down Expand Up @@ -118,29 +118,12 @@ I experimented with this a bit. I found that we should use the formula:
num_worker = 4 * num_GPU .
Though a factor of 2 and 8 also work good but lower factor (<2) significantly reduces overall performance. Here, worker has no impact on GPU memory allocation. Also, nowadays there are many CPU cores in a machine with few GPUs (<8), so the above formula is practical.


#### 리눅스에서 프로세스 갯수를 확인하기 위해서는 다음과 같으며, `num_workers`는 프로세스 갯수만큼 최대로 할당이 가능하다.
```cmd
cat /proc/cpuinfo | grep processor
```

### Model

#### `DistributedDataParallel`
```python
from torch.nn.parallel import DistributedDataParallel as DDP
model = model.cuda(args.rank)
model = DDP(module=model, device_ids=[args.rank])
```

### Train

학습이 진행될 때, 매 epoch가 시작하는 시점에서 `sampler``set_epoch()`를 실행해여 `shuffle`이 동작하도록 해야한다.
```python
train_sampler.set_epoch(epoch)
```

## DEMO
#### DEMO
```python
# dataset.py
class SimpleDataset(torch.utils.data.Dataset):
Expand Down Expand Up @@ -318,6 +301,24 @@ cost 4.337230443954468 ms
위와 같이, 1 ~ 20까지의 data에 대해서 GPU0, GPU1이 각각 배치를 생성하면서 진행되고, train_dataset의 경우 `shuffle`을 하여 매 epoch마다 순서가 다르다.
코드는 [github](https://github.com/wonchul-kim/distributed_training)에서 참고 가능합니다.

------------------------------------------------------------------
### Model

#### `DistributedDataParallel`
```python
from torch.nn.parallel import DistributedDataParallel as DDP
model = model.cuda(args.rank)
model = DDP(module=model, device_ids=[args.rank])
```
------------------------------------------------------------------
### Train

학습이 진행될 때, 매 epoch가 시작하는 시점에서 `sampler``set_epoch()`를 실행해여 `shuffle`이 동작하도록 해야한다.
```python
train_sampler.set_epoch(epoch)
```


## references:

- [Guidelines for assigning num_workers to DataLoader](https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813/4)
Expand Down

0 comments on commit 6e1aa80

Please sign in to comment.