Skip to content

Commit

Permalink
GPU documentation (#2)
Browse files Browse the repository at this point in the history
* docs(moe): GPU running and slurm docs

* docs: Fixed markup
  • Loading branch information
hXl3s authored Jan 14, 2025
1 parent 5fa2abb commit c28619b
Show file tree
Hide file tree
Showing 2 changed files with 79 additions and 5 deletions.
61 changes: 56 additions & 5 deletions mixture_of_experts_pretraining/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,18 +366,23 @@ python run_clm.py model.config_path=mixtral80.json eval_frequency=3 n_eval_examp
python run_clm.py model.name_or_path=gpt2 eval_frequency=3 n_eval_examples=100 per_device_train_batch_size=4 gradient_accumulation_steps=2 sched.warmup_ratio=0. max_steps=30
```
# 6. Training Mixtral 8x7B with NeMo on GPU Device
# 6. Training Mixtral 8x22B with NeMo on GPU Device
**IMPORTANT** GPU implementation is a supplementary reference and it is not used for RCP
generation. There is convergence gap between GPU and TPU reference and GPU code cannot be used as a
inplace substitute of TPU code.
## Docker Image
Build and push docker image:
```shell
docker build -t <regristry_path_image_name>:<image_tag> -f nemo_example.Dockerfile .
docker build -t <registry_path_image_name>:<image_tag> -f docker/gpu/Dockerfile .
docker push <registry_path_image_name>:<image_tag>
```
## Run workflow
## Kubernetes workflow
### Run workflow
In order for this workflow to function, in the ```helm-context``` directory, there must exist a **_select-configuration.yaml_** file.
Expand All @@ -388,7 +393,7 @@ Package and schedule job. An example job name could be "nemo-gpt3-175b-nemo-16gp
helm install <username_workload_job_name> helm-context/
```
## Monitor workflow
### Monitor workflow
Check pod status (use this to find the name of the pod you want logs from)
Expand All @@ -413,6 +418,52 @@ Get logs (Using pod name from earlier)
kubectl logs "<pod_name>"
```
## Slurm/Pyxis workflow
### Preprocessing
For GPU implementation both dataset and checkpoint has to be preprocessed. This can be done once,
before experimentation and saved. **IMPORTANT** saved checkpoint and dataset has to be accessible by
all nodes in the system.
To get preprocessed checkpoint, run checkpoint_download.py script
```shell
python scripts/gpu/checkpoint_download.py --checkpoint_id mistralal/Mixtral-8x22B-v0.1 \
--output_dir <path to save checkpoint> --hf_token <your token to HF repository>
```
This script will download specified checkpoint from huggingface repository, preprocess it and save
into specified directory
To preprocess dataset, use dataset_preprocessing.py script
```shell
python scripts/gpu/dataset_preprocessing.py --input-tokenizer <path to tokenizer from checkpoint> \
--workdir <working directory>
```
After preprocessing, dataset will be saved into <working directory>/output
### Running
Slurm workflow by default loads config /config/config.yaml. Make sure correct config is specified or
modify the script to mount correct config into /app/training/config/config.yaml
To run the job specify required input environmental variables:
```shell
export CONT=<registry_path_image_name>:<image_tag>
export DATA=<path to preprocessed dataset>
export CKPT=<path to preprocessed checkpoint>
export NODES=<number of nodes to run on>
export OUTPUT=<output directory>
```
After that run sbatch command using scripts/gpu/run.sub:
```shell
sbatch -N${NODES} <vendor specific informations> scripts/gpu/run.sub
```
# 7. Reference
* [MLPerf Training: MoE Benchmark Proposal from Nvidia](https://docs.google.com/document/d/1NOJ_vt-o2WHFXmisLRk6Mn7Ki2CeB5UNeTkFrYHoE1I/edit?usp=sharing)
* [Mixtral of Experts](https://arxiv.org/pdf/2401.04088)
Expand Down Expand Up @@ -461,4 +512,4 @@ rclone copy mlc-training:mlcommons-training-wg-public/mixtral_8x22b/checkpoints/
```
mkdir -p docker-images
rclone copy mlc-training:mlcommons-training-wg-public/mixtral_8x22b/docker-images ./docker-images -P
```
```
23 changes: 23 additions & 0 deletions mixture_of_experts_pretraining/scripts/gpu/run.sub
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

: "${CONT:?Base Container image is not set, please specify CONT envvar}"
: "${DATA:?Data directory is not set, please specify DATA envvar}"
: "${CKPT:?Checkpoint directory is not set, please specify CKPT envvar}"
: "${NODES:?Number of nodes is not set, please specify NODES envvar}"
: "${OUTPUT:?Output directory is not set, please specify OUTPUT envvar}"

CONT_MOUNTS="${DATA}:/app/dataset:ro,${CKPT}:/app/checkpoints:ro,${OUTPUT}:/results"

: "${MASTER_PORT:=29500}"
export MASTER_PORT
export MASTER_ADDR="$(scontrol show hostnames "${SLURM_JOB_NODELIST-}" | head -n1)"

srun -l --kill-on-bad-exit=0 --mpi="${SLURM_MPI_TYPE:-pmix}" \
--ntasks="$(( NODES * ${GPUS:-8} ))" \
--ntasks-per-node="${GPUS:-8}" \
--container-image="${CONT}" \
--container-mounts="${CONT_MOUNTS}" \
--container-env=MASTER_PORT,MASTER_ADDR \
slurm2pytorch python /app/training/run_clm.py output_dir=/results \
dataset.train_dataset_path=/app/dataset dataset.eval_dataset_path=/app/dataset \

0 comments on commit c28619b

Please sign in to comment.