Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sakura]Add v3 172B exp2 scripts #13

Merged
merged 11 commits into from
Aug 24, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 4 additions & 7 deletions pretrain/scripts/v3-172b-exp2-sakura/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

LLM-jp v3 172B exp2 の学習をSakuraクラスタ上で行うスクリプトです。

Experiment: https://github.com/llm-jp/experiments/issues/9
Experiment: https://github.com/llm-jp/experiments/issues/14
k141303 marked this conversation as resolved.
Show resolved Hide resolved

## スペック

Expand All @@ -11,19 +11,16 @@ Experiment: https://github.com/llm-jp/experiments/issues/9

## 実行方法

事前に v3-megatron-sakura インストーラで `/data/experiments/{exp-id}/environment` に環境をインストールしたものとします。
事前に v3-megatron-sakura インストーラで `/home/shared/experiments/{exp-id}/environment` に環境をインストールしたものとします。
`{exp-id}` は登録時のIDを指定しますが、実験結果保全のため本実験のIDは指定しないでください。
また `/data/experiments/{exp-id}/checkpoints` に以前のチェックポイントが保存されているものとします。
また `/home/shared/experiments/{exp-id}/checkpoints` に以前のチェックポイントが保存されているものとします。

```shell
cd /data/experiments/{exp-id}

# 実行環境と同じ階層にスクリプトをコピー
cp {this directory} .

# ログ保存用ディレクトリ
mkdir outputs

# 実行
sbatch sbatch.sh
sbatch scripts/pretrain/scripts/v3-172b-exp2-sakura/sbatch.sh
```
5 changes: 3 additions & 2 deletions pretrain/scripts/v3-172b-exp2-sakura/sbatch.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
#!/bin/bash
#SBATCH --job-name=llama-2-172b-exp2
#SBATCH --job-name=9_llama-2-172b-exp2
#SBATCH --partition=gpu
#SBATCH --nodes=64
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --output=outputs/%x-%j.out
#SBATCH --error=outputs/%x-%j.err

Expand Down Expand Up @@ -36,4 +37,4 @@ mpirun \
-x MASTER_PORT=$MASTER_PORT \
-x NUM_NODES=$NUM_NODES \
-x NUM_GPUS_PER_NODE=$NUM_GPUS_PER_NODE \
bash train.sh
bash scripts/pretrain/scripts/v3-172b-exp2-sakura/train.sh
12 changes: 5 additions & 7 deletions pretrain/scripts/v3-172b-exp2-sakura/train.sh
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ TRAIN_STEPS=$((${LR_WARMUP_STEPS} + ${LR_DECAY_ITERS}))

# model config
TOKENIZER_MODEL=${ENV_DIR}/src/llm-jp-tokenizer/models/ver3.0/llm-jp-tokenizer-100k.ver3.0b1.model
CHECKPOINT_LOAD_DIR=checkpoints/tp${TENSOR_PARALLEL_SIZE}-pp${PIPELINE_PARALLEL_SIZE}-cp${CONTEXT_PARALLEL_SIZE}
CHECKPOINT_SAVE_DIR=checkpoints/tp${TENSOR_PARALLEL_SIZE}-pp${PIPELINE_PARALLEL_SIZE}-cp${CONTEXT_PARALLEL_SIZE}
CHECKPOINT_LOAD_DIR=/home/shared/experiments/9/checkpoints/tp${TENSOR_PARALLEL_SIZE}-pp${PIPELINE_PARALLEL_SIZE}-cp${CONTEXT_PARALLEL_SIZE}
CHECKPOINT_SAVE_DIR=/home/shared/experiments/9/checkpoints/tp${TENSOR_PARALLEL_SIZE}-pp${PIPELINE_PARALLEL_SIZE}-cp${CONTEXT_PARALLEL_SIZE}

mkdir -p ${CHECKPOINT_SAVE_DIR}

Expand Down Expand Up @@ -220,7 +220,6 @@ VALID_DATA_PATH="" # Skip validation
JOB_NAME="llama-2-172b-exp2-sakura"

# run
export NVTE_FUSED_ATTN=0
python ${ENV_DIR}/src/Megatron-LM/pretrain_gpt.py \
--tensor-model-parallel-size ${TENSOR_PARALLEL_SIZE} \
--pipeline-model-parallel-size ${PIPELINE_PARALLEL_SIZE} \
Expand All @@ -241,11 +240,11 @@ python ${ENV_DIR}/src/Megatron-LM/pretrain_gpt.py \
--train-iters ${TRAIN_STEPS} \
--tokenizer-type Llama2Tokenizer \
--tokenizer-model ${TOKENIZER_MODEL} \
${CHECKPOINT_ARGS} \
--load ${CHECKPOINT_LOAD_DIR} \
--save ${CHECKPOINT_SAVE_DIR} \
--data-path ${TRAIN_DATA_PATH} \
--split 1000,0,0 \
--data-cache-path cache \
--data-cache-path /home/shared/experiments/9/cache \
--distributed-backend nccl \
--init-method-std 0.02 \
--lr ${LR} \
Expand Down Expand Up @@ -283,6 +282,5 @@ python ${ENV_DIR}/src/Megatron-LM/pretrain_gpt.py \
--log-throughput \
--wandb-name ${JOB_NAME} \
--wandb-project "Llama-2-175B" \
--wandb-entity "nii-geniac" \
--use-gcp-dynamic-checkpointing \
--wandb-entity "nii-geniac"