GPU documentation (#2)

* docs(moe): GPU running and slurm docs * docs: Fixed markup
mlcommons · Jan 14, 2025 · c28619b · c28619b
1 parent 5fa2abb
commit c28619b
Show file tree

Hide file tree

Showing 2 changed files with 79 additions and 5 deletions.
diff --git a/mixture_of_experts_pretraining/README.md b/mixture_of_experts_pretraining/README.md
@@ -366,18 +366,23 @@ python run_clm.py model.config_path=mixtral80.json eval_frequency=3 n_eval_examp
 python run_clm.py model.name_or_path=gpt2 eval_frequency=3 n_eval_examples=100 per_device_train_batch_size=4 gradient_accumulation_steps=2 sched.warmup_ratio=0. max_steps=30
 ```
 
-# 6. Training Mixtral 8x7B with NeMo on GPU Device
+# 6. Training Mixtral 8x22B with NeMo on GPU Device
+
+**IMPORTANT** GPU implementation is a supplementary reference and it is not used for RCP
+generation. There is convergence gap between GPU and TPU reference and GPU code cannot be used as a
+inplace substitute of TPU code.
 
 ## Docker Image
 
 Build and push docker image:
 
 ```shell
-docker build -t <regristry_path_image_name>:<image_tag> -f nemo_example.Dockerfile .
+docker build -t <registry_path_image_name>:<image_tag> -f docker/gpu/Dockerfile .
 docker push <registry_path_image_name>:<image_tag>
 ```
 
-## Run workflow
+## Kubernetes workflow
+### Run workflow
 
 In order for this workflow to function, in the ```helm-context``` directory, there must exist a **_select-configuration.yaml_** file.
 
@@ -388,7 +393,7 @@ Package and schedule job. An example job name could be "nemo-gpt3-175b-nemo-16gp
 helm install <username_workload_job_name> helm-context/
 ```
 
-## Monitor workflow
+### Monitor workflow
 
 Check pod status (use this to find the name of the pod you want logs from)
 
@@ -413,6 +418,52 @@ Get logs (Using pod name from earlier)
 kubectl logs "<pod_name>"
 ```
 
+## Slurm/Pyxis workflow
+
+### Preprocessing
+
+For GPU implementation both dataset and checkpoint has to be preprocessed. This can be done once,
+before experimentation and saved. **IMPORTANT** saved checkpoint and dataset has to be accessible by
+all nodes in the system.
+
+To get preprocessed checkpoint, run checkpoint_download.py script
+```shell
+python scripts/gpu/checkpoint_download.py --checkpoint_id mistralal/Mixtral-8x22B-v0.1 \
+    --output_dir <path to save checkpoint> --hf_token <your token to HF repository>
+```
+
+This script will download specified checkpoint from huggingface repository, preprocess it and save
+into specified directory
+
+To preprocess dataset, use dataset_preprocessing.py script
+```shell
+python scripts/gpu/dataset_preprocessing.py --input-tokenizer <path to tokenizer from checkpoint> \
+    --workdir <working directory>
+```
+
+After preprocessing, dataset will be saved into <working directory>/output
+
+### Running
+
+Slurm workflow by default loads config /config/config.yaml. Make sure correct config is specified or
+modify the script to mount correct config into /app/training/config/config.yaml
+
+To run the job specify required input environmental variables:
+
+```shell
+export CONT=<registry_path_image_name>:<image_tag>
+export DATA=<path to preprocessed dataset>
+export CKPT=<path to preprocessed checkpoint>
+export NODES=<number of nodes to run on>
+export OUTPUT=<output directory>
+```
+
+After that run sbatch command using scripts/gpu/run.sub:
+```shell
+
+sbatch -N${NODES} <vendor specific informations> scripts/gpu/run.sub
+```
+
 # 7. Reference
 * [MLPerf Training: MoE Benchmark Proposal from Nvidia](https://docs.google.com/document/d/1NOJ_vt-o2WHFXmisLRk6Mn7Ki2CeB5UNeTkFrYHoE1I/edit?usp=sharing)
 * [Mixtral of Experts](https://arxiv.org/pdf/2401.04088)
@@ -461,4 +512,4 @@ rclone copy mlc-training:mlcommons-training-wg-public/mixtral_8x22b/checkpoints/
 ```
 mkdir -p docker-images
 rclone copy mlc-training:mlcommons-training-wg-public/mixtral_8x22b/docker-images ./docker-images -P
-```
+```
diff --git a/mixture_of_experts_pretraining/scripts/gpu/run.sub b/mixture_of_experts_pretraining/scripts/gpu/run.sub
@@ -0,0 +1,23 @@
+#!/bin/bash
+
+: "${CONT:?Base Container image is not set, please specify CONT envvar}"
+: "${DATA:?Data directory is not set, please specify DATA envvar}"
+: "${CKPT:?Checkpoint directory is not set, please specify CKPT envvar}"
+: "${NODES:?Number of nodes is not set, please specify NODES envvar}"
+: "${OUTPUT:?Output directory is not set, please specify OUTPUT envvar}"
+
+CONT_MOUNTS="${DATA}:/app/dataset:ro,${CKPT}:/app/checkpoints:ro,${OUTPUT}:/results"
+
+: "${MASTER_PORT:=29500}"
+export MASTER_PORT
+export MASTER_ADDR="$(scontrol show hostnames "${SLURM_JOB_NODELIST-}" | head -n1)"
+
+srun -l --kill-on-bad-exit=0 --mpi="${SLURM_MPI_TYPE:-pmix}" \
+         --ntasks="$(( NODES * ${GPUS:-8} ))" \
+         --ntasks-per-node="${GPUS:-8}" \
+         --container-image="${CONT}"  \
+         --container-mounts="${CONT_MOUNTS}" \
+         --container-env=MASTER_PORT,MASTER_ADDR \
+             slurm2pytorch python /app/training/run_clm.py output_dir=/results \
+                  dataset.train_dataset_path=/app/dataset dataset.eval_dataset_path=/app/dataset \
+