Skip to content

Commit

Permalink
add llama benchmarking
Browse files Browse the repository at this point in the history
  • Loading branch information
wdbaruni committed Jan 16, 2025
1 parent 87c7a74 commit c28bc05
Show file tree
Hide file tree
Showing 13 changed files with 1,585 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ bacalhau
**/Dockerfile
!tools/**/Dockerfile
!scale-tester/**/Dockerfile
!llama-benchmarking/**/Dockerfile
**/.py
.vscode
js-helloworld/outputs/*
Expand Down
13 changes: 13 additions & 0 deletions llama-benchmarking/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM nvcr.io/nvidia/nemo:24.12

WORKDIR /workspace

# Create config directory and copy configs
RUN mkdir -p /workspace/cfg
COPY llama3.1_24.11.1/llama3.1_*.yaml /workspace/cfg/

# Copy training script
COPY run_training.sh /workspace/
RUN chmod +x /workspace/run_training.sh

ENTRYPOINT ["/bin/bash"]
154 changes: 154 additions & 0 deletions llama-benchmarking/llama3.1_24.11.1/README_405b.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Overview

This recipe contains information and scripts to produce performance results for the Llama 3.1 training workload. The scripts help perform environment setup, dataset setup, and launch benchmark jobs.
This variant of the workload is best-suited for GPU clusters with

* At least 576 GPUs with at least 80 GB memory each. Training of this 405-billion parameter variant of the workload will not fit on fewer GPUs with less memory.
* 32 GPUs with at least 80GB memory is the minimum when running proxy configs: <576 GPUs.
* H100 GPUs. This workload runs with BF16 or FP8, which are both supported by H100 GPUs.

# Expected Performance

Performance for Llama 3.1 training is measured by seconds per iteration, or in other words seconds per training step. This metric is logged for every training step in a .out file which is generated inside of the `$STAGE_PATH/results/$GSW_VERSION/$DTYPE/405b/$JOB_TOTAL_GPUS` folder.

Since the performance fluctuates significantly at the beginning, we are using the last training step timing to obtain throughput value.

```shell
grep train_step_timing results/*.out
Epoch 0: : 100%|██████████| 50/50 [16:26<00:00, v_num=gjbq, reduced_train_loss=11.70, global_step=49.00, consumed_samples=12600.0, train_step_timing in s=12.80]
```

To obtain throughput as a tokens per second measurement, follow this formula:
```shell
(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)
```

E.g. 8192 * 252 / 12.84 = 160778

To calculate time to train estimate:
```shell
(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days)
```
E.g. 1e12 / 160778 / 86400 = 71.99 days


To calculate the model flops utilization (MFU):
```shell
MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) / (peak GPU FLOPS)
```

The peak theoretical throughput for H100 FP8 is 1979 TFLOPS and for H100 BF16 is 989 TFLOPS.

The model flops for Llama 3.1 405b for GBS=1 is 2.17E+16. Calculation shown [here](#notes).

E.g. Llama 3.1 405b FP8 on 576x H100 GPUs (GBS=252)
```shell
peak FP8 FLOPS for H100 = 1979 TFLOPS
training step time = 11.24
model flops = 2.17E+16
MFU = 252 * 2.17E+16 / 11.24 / 576 / 1979E+12 = 42.71%
```

| Llama 3.1 405b 24.09 BF16 (TP=8, PP=9, CP=2, VP=7, MBS=1, GA=63) | Throughput on 32x H100 GPUs | Throughput on 96x H100 GPUs | Throughput on 192x H100 GPUs | Throughput on 576x H100 GPUs | Throughput on 1152x H100 GPUs | Throughput on 2304x H100 GPUs |
|---|---|---|---|---|---|---|
| Layers | 7 | 21 | 42 | 126 | 126 | 126 |
| GBS | 126 | 126 | 252 | 252 | 504 | 1008 |
| PP | 1 | 3 | 3 | 9 | 9 | 9 |
| VP | n/a | 7 | 7 | 7 | 7 | 7 |
| Training step time (seconds per step) | 8.85 | 8.7 5 | 16.93 | 17.20 | 17.52 | 17.62 |
| Throughput in tokens per second | 116632 | 117965 | 121936 | 120022 | 235660 | 468646 |
| Model flops utilization | 58.63% | 56.17% | 57.25% | 55.78% | 54.76% | 54.45% |
| Time to train 1T tokens in days | n/a | n/a | n/a | 96.43 | 49.11 | 24.7 |

| Llama 3.1 405b 24.09 FP8 (TP=8, PP=9, CP=2, VP=7, MBS=1, GA=63) | Throughput on 32x H100 GPUs | Throughput on 96x H100 GPUs | Throughput on 192x H100 GPUs | Throughput on 576x H100 GPUs | Throughput on 1152x H100 GPUs | Throughput on 2304x H100 GPUs |
|---|---|---|---|---|---|---|
| Layers | 7 | 21 | 42 | 126 | 126 | 126 |
| GBS | 126 | 126 | 252 | 252 | 504 | 1008 |
| PP | 1 | 3 | 3 | 9 | 9 | 9 |
| VP | n/a | 7 | 7 | 7 | 7 | 7 |
| Training step time (seconds per step) | 5.80 | 5.71 | 11.00 | 11.24 | 11.31 | 12.35 |
| Throughput in tokens per second | 178118 | 180674 | 187740 | 183664 | 365055 | 668626 |
| Model flops utilization | 44.77% | 43.01% | 44.07% | 42.71% | 42.45% | 38.87% |
| Time to train 1T tokens in days | n/a | n/a | n/a | 63.02 | 31.71 | 17.31 |

For proxy configs (<576 GPUs scales) we don't provide time to train estimates to avoid misleading conclusions. Proxy configs are not realistic and were created to allow fit of Llama model to smaller number of GPUs than intended.

# Prerequisites

This recipe requires access to Llama 3.1. Instructions are below if needed.

# Request Access
A HuggingFace account is required and you will need to [create a HuggingFace access token](https://huggingface.co/settings/tokens) in order to run the training script. Add the generated token to your environment via `export HF_TOKEN=<your token>`.

Access to Llama 3.1 must be requested through [Meta's website](https://llama.meta.com/llama-downloads/) then requested on the [HuggingFace Llama](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B) page. The approval process is not automatic and could take a day or more.

# Prepare Environment

Create a staging area by running the attached setup.sh. The script converts the docker image from `nvcr.io/nvidia/nemo:24.09` to the `nvidia+nemo+24.09.sqsh` file under the `$STAGE_PATH` folder and copies NeMo Launcher code from the container.

```shell
# Set the path where all artifacts will be downloaded
export STAGE_PATH=<path to your shared file system folder> (e.g. /lustre/myproject/nemo)
# Set the Slurm partition to launch against
export SLURM_PARTITION="batch"
# Set the Slurm account to launch against
export SLURM_ACCOUNT="account_name"
# Set the number of GPUs per node according to Slurm's gres, this is usually 8 or null - https://slurm.schedmd.com/gres.html
export SLURM_GPUS_PER_NODE=null
# Set HuggingFace token
export HF_TOKEN=<your token>

# Run the setup
bash ./setup.sh
```

# Prepare Dataset
Llama 3.1 405B uses synthetic data for training. A dataset does not need to be prepared. Note that Llama3.1 405B uses the GPT2BPETokenizer as a proxy.

# Run Training

NeMo Launcher is using the Hydra framework to process command line arguments and pass them down as hyperparameters to a multi-node job performing the training.

The training will run for the first 50 steps and will stop afterwards. Log files and results will be located under the `$STAGE_PATH/results/$GSW_VERSION/$DTYPE/405b/$JOB_TOTAL_GPUS` folder.

Below is a command template for launching Llama 3.1 405b model training.
```shell
DTYPE=<fp8/bf16> MODEL_SIZE=405b sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} ./launch.sh
```
Where:
- `DTYPE` and `MODEL_SIZE` are **required** environment variables.
- `DTYPE` can be either `fp8` or `bf16`.
- `MODEL_SIZE` should be `405b` in this case.
- `NUM_NODES` can be calculate by `N_GPUS / N_GPUS_PER_NODE`, `N_GPUS_PER_NODE` is 8 for DGX H100, therefore for 576 GPUs scale, `NUM_NODES` should be `576 / 8 = 72`.

**Note:** it might be necessary to pass `--gres=gpu:8` to sbatch for certain clusters on encountering errors like GPU not found. See https://slurm.schedmd.com/gres.html

The following applies only to the full model scales: 576, 1152, 2304 GPUs. Configurations and Global batch size changes for proxy configs <576 GPUs.
>>>
It is important to maintain these values for model parallelism settings in order to accurately assess performance results for completed jobs against expected baseline for the non-proxy 405b configurations:
* `training.model.tensor_model_parallel_size=8`
* `training.model.pipeline_model_parallel_size=9`
* `training.model.virtual_pipeline_model_parallel_size=7`
* `training.model.context_parallel_size=2`

Global batch size (`training.model.global_batch_size`) value should scale with total number GPUs. The starting global batch size for 576 GPUs is 252, therefore it should set to `<number of total gpus> * 252 / 576`.
>>>
# Notes

```shell
model flops = (sequence length) * ((attention flops) + (mlp flops) + (embedding flops))

model flops breakdown:
attention flops = 12 * (number of layers) * (hidden size)^2 * (1 + (number of query groups)/(number of attention heads) + (sequence length)/(hidden size))
mlp flops = 18 * (number of layers) * (FFN size) * (hidden size)
embedding flops = 6 * (vocab size) * (hidden size)

Llama 3.1 405b calculation:
sequence length = 8192
attention flops = 12 * 126 * 16384^2 * (1 + 16/128 + 8192/16384) = 659,545,915,392
mlp flops = 18 * 126 * 53248 * 16384 = 1,978,637,746,176
embedding flops = 6 * 128256 * 16384 = 12,608,077,824

model flops = 8129 * (659,545,915,392 + 1,978,637,746,176 + 12,608,077,824) = 2.17E16
```
163 changes: 163 additions & 0 deletions llama-benchmarking/llama3.1_24.11.1/README_70b.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Overview

This recipe contains information and scripts to produce performance results for the Llama 3.1 training workload. The scripts help perform environment setup, dataset setup, and launch benchmark jobs.
This variant of the workload is best-suited for GPU clusters with

* At least 64 GPUs with at least 80 GB memory each. Training of this 70-billion parameter variant of the workload will not fit on fewer GPUs with less memory.
* H100 GPUs. This workload runs with BF16 or FP8, which are both supported by H100 GPUs.

# Expected Performance

Performance for Llama 3.1 training is measured by seconds per iteration, or in other words seconds per training step. This metric is logged for every training step in a .out file which is generated inside of the `$STAGE_PATH/results/$GSW_VERSION/${DTYPE}/70b/$JOB_TOTAL_GPUS` folder.

Since the performance fluctuates significantly at the beginning, we are using the last training step timing to obtain throughput value.

```shell
grep train_step_timing results/*.out
Epoch 0: : 100%|██████████| 100/100 [20:15<00:00, reduced_train_loss=6.370, global_step=99.00, consumed_samples=12800.0, train_step_timing in s=11.20]
```

To obtain throughput as a tokens per second measurement, follow this formula:
```shell
(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)
```

E.g. 8192 * 128 / 11.31 = 92712

To calculate time to train estimate:
```shell
(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days)
```
E.g. 1e12 / 92712 / 86400 = 124.84 days


To calculate the model flops utilization (MFU):
```shell
MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) / (peak GPU FLOPS)
```

The peak theoretical throughput for H100 FP8 is 1979 TFLOPS and for H100 BF16 is 989 TFLOPS.

The model flops for Llama 3.1 70b for GBS=1 is 3.94E+15. Calculation shown [here](#notes).

E.g. Llama 3.1 70b FP8 on 64x H100 GPUs (GBS=128)
```shell
peak FLOPS for H100 = 1979 TFLOPS
training step time = 11.31
model flops = 3.94E+15
MFU = 128 * 3.94E+15 / 11.31 / 64 / 1979E+12 = 35.21%
```

| Llama 3.1 70b 24.09 BF16 (TP=4, PP=4, CP=2, VP=5, MBS=1, GA=64) | Throughput on 64x H100 GPUs (GBS=128) | Throughput on 128x H100 GPUs (GBS=256) | Throughput on 256x H100 GPUs (GBS=512) | Throughput on 512x H100 GPUs (GBS=1024) | Throughput on 1024x H100 GPUs (GBS=2048) | Throughput on 2048x H100 GPUs (GBS=4096)
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| Training step time (seconds per step) | 14.72 | 14.73 | 14.8 | 14.89 | 14.92 | 14.98
| Throughput in tokens per second | 71235 | 142373 | 283399 | 563372 | 1124478 | 2248957
| Model flops utilization | 54.10% | 54.06% | 53.81% | 53.48% | 53.38% | 53.38%
| Time to train 1T tokens in days | 162.48 | 81.29 | 40.84 | 20.54 | 10.29 | 5.15

| Llama 3.1 70b 24.09 FP8 (TP=4, PP=4, CP=2, VP=5, MBS=1, GA=64) | Throughput on 64x H100 GPUs (GBS=128) | Throughput on 128x H100 GPUs (GBS=256) | Throughput on 256x H100 GPUs (GBS=512) | Throughput on 512x H100 GPUs (GBS=1024) | Throughput on 1024x H100 GPUs (GBS=2048) | Throughput on 2048x H100 GPUs (GBS=4096)
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| Training step time (seconds per step) | 11.01 | 10.93 | 11.16 | 11.18 | 11.28 | 11.39
| Throughput in tokens per second | 95239 | 191871 | 375834 | 750323 | 1487342 | 2945955
| Model flops utilization | 36.17% | 36.43% | 35.68% | 35.62% | 35.30% | 34.96%
| Time to train 1T tokens in days | 121.53 | 60.32 | 30.80 | 15.43 | 7.78 | 3.93

# Prerequisites

This recipe requires access to Llama 3.1. Instructions are below if needed.

# Request Access
A HuggingFace account is required and you will need to [create a HuggingFace access token](https://huggingface.co/settings/tokens) in order to run the training script. Add the generated token to your environment via ```export HF_TOKEN=<your token>```.

Access to Llama 3.1 must be requested through [Meta's website](https://llama.meta.com/llama-downloads/) then requested on the [HuggingFace Llama](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B) page. The approval process is not automatic and could take a day or more.

# Prepare Environment

Create a staging area by running the attached setup.sh. The script converts the docker image from `nvcr.io/nvidia/nemo:24.09` to the `nvidia+nemo+24.09.sqsh` file under the $STAGE_PATH folder and copies NeMo Launcher code from the container. The setup script also downloads Llama3 tokenizer related files from HuggingFace [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) repo using `HF_TOKEN` obtained in the previous step. **Note:** Llama3.1 8B and 70B use the same tokenizer and for this recipe we use the Llama 3 tokenizer.

```shell
# Set the path where all artifacts will be downloaded
export STAGE_PATH=<path to your shared file system folder> (e.g. /lustre/myproject/nemo)
# Set the Slurm partition to launch against
export SLURM_PARTITION="batch"
# Set the Slurm account to launch against
export SLURM_ACCOUNT="account_name"
# Set the number of GPUs per node according to Slurm's gres, this is usually 8 or null - https://slurm.schedmd.com/gres.html
export SLURM_GPUS_PER_NODE=null
# Set HuggingFace token
export HF_TOKEN=<your token>

# Run the setup
bash ./setup.sh
```

# Prepare Dataset
Pre-training a GPT-3 model requires a text-based dataset to be downloaded and pre-processed for the NeMo Framework to ingest the data optimally. [The Pile](https://huggingface.co/datasets/monology/pile-uncopyrighted) is often used as the dataset for pre-training models. The NeMo Framework contains helper scripts to download and pre-process the dataset. The following steps outline how to download and pre-process the dataset on DGX Cloud with an explanation of key points after.

Make sure `$STAGE_PATH/llama3.1-dataset/llama` contains tokenizer files downloaded from previous step.

Run the `generate_dataset.sh` script. The script launches several Slurm jobs that will download the dataset from The Pile, pre-process it and save it in a form suitable for subsequent training. The resulting dataset files will be saved under the `$STAGE_PATH/llama3.1-dataset` folder. The dataset creation may use up to 100GB. Make sure you have sufficient disk space available.


```shell
bash ./generate_dataset.sh
```

If the dataset generation step was successful there should be 2 idx and 2 bin files in the $STAGE_PATH/llama3.1-dataset folder.

```shell
my-llama_00_text_document.bin
my-llama_00_text_document.idx
my-llama_01_text_document.bin
my-llama_01_text_document.idx
```

If that is not the case, check the log files in: $STAGE_PATH/results.data_preparation


# Run Training

NeMo Launcher is using the Hydra framework to process command line arguments and pass them down as hyperparameters to a multi-node job performing the training.

The training will run for the first 50 steps and will stop afterwards. Log files and results will be located under the `$STAGE_PATH/results/$GSW_VERSION/${DTYPE}/70b/$JOB_TOTAL_GPUS` folder.

Below is a command template for launching Llama 3.1 70b model training.
```shell
DTYPE=<fp8/bf16> MODEL_SIZE=70b sbatch -A ${SLURM_ACCOUNT} -p ${SLURM_PARTITION} -N ${NUM_NODES} ./launch.sh
```

Where:
- `DTYPE` and `MODEL_SIZE` are **required** environment variables.
- `DTYPE` can be either `fp8` or `bf16`.
- `MODEL_SIZE` should be `70b` in this case.
- `NUM_NODES` can be calculate by `N_GPUS / N_GPUS_PER_NODE`, `N_GPUS_PER_NODE` is 8 for DGX H100, therefore for 128 GPUs scale, `NUM_NODES` should be `128 / 8 = 16`.

**Note:** it might be necessary to pass `--gres=gpu:8` to sbatch for certain clusters on encountering errors like GPU not found. See https://slurm.schedmd.com/gres.html

It is important to maintain these values for model parallelism settings in order to accurately assess performance results for completed jobs against expected baseline:
* `training.model.tensor_model_parallel_size=4`
* `training.model.pipeline_model_parallel_size=4`
* `training.model.virtual_pipeline_model_parallel_size=5`
* `training.model.context_parallel_size=2`

Global batch size ( training.model.global_batch_size) value should be set to ```<number of nodes> * 16. E.g., 16 * 16 = 256 (in the example above)```.

# Notes

```shell
model flops = (sequence length) * ((attention flops) + (mlp flops) + (embedding flops))

model flops breakdown:
attention flops = 12 * (number of layers) * (hidden size)^2 * (1 + (number of query groups)/(number of attention heads) + (sequence length)/(hidden size))
mlp flops = 18 * (number of layers) * (FFN size) * (hidden size)
embedding flops = 6 * (vocab size) * (hidden size)

Llama 3.1 70b calculation:
sequence length = 8192
attention flops = 12 * 80 * 8192^2 * (1 + 8/64 + 8192/8192) = 136,902,082,560
mlp flops = 18 * 80 * 28672 * 8192 = 338,228,674,560
embedding flops = 6 * 128256 * 8192 = 6,304,038,912

model flops = 8129 * (136,902,082,560 + 338,228,674,560 + 6,304,038,912) = 3.94E+15
```

Loading

0 comments on commit c28bc05

Please sign in to comment.