Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts for FP8 behavior check #37

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions pretrain/scripts/fp8-behavior-check/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# FP8 check scripts

This directory contains several scripts to check the behavior of FP8 operations on Megatron-LM with existing checkpoints.

## Prerequisites

The following directories and contents must be exist before running scripts in this directory:

* `checkpoints/3.8b/base_{iter:07d}`: Megatron-LM checkpoints of the base models
* `environment`: Training environment created by the installer
* `eval_environment`: Evaluation environment created by the installer
* `outputs`: Slurm log directory
* `scripts`: Clone of this repository

## Scripts

All scripts must be invoked from the root of the experiment directory.

* `run_train.sh`: Runs cont'd training with several configurations
* `run_convert.sh`: Runs model conversion to Hugging Face format
* `run_eval.sh`: Runs llm-jp-eval 1.3.1 evaluation
72 changes: 72 additions & 0 deletions pretrain/scripts/fp8-behavior-check/convert_13b.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#!/bin/bash
# Model conversion script for converting Megatron format checkpoints into Hugging Face format
#
# This script needs one node on the `gpu` partition of the cluster.
# However, a GPU is necessary to verify CUDA functionality, even though no VRAM will be used.
#
# Usage:
# On a cluster with SLURM:
# Run `sbatch --partition {partition} convert.sh SOURCE_DIR TARGET_DIR`
# On a cluster without SLURM:
# Run `bash convert.sh SOURCE_DIR TARGET_DIR TEMPORAL_DIR > outpus/convert.out 2> outputs/convert.err`
# - SOURCE_DIR: Megatron checkpoint directory including `iter_NNNNNNN`
# - TARGET_DIR: Output directory for the Hugging Face format
#
# Example:
# sbatch convert.sh /data/experiments/{exp-id}/checkpoints/iter_0001000 /data/experiments/{exp-id}/hf_checkpoints/iter_0001000
#
#SBATCH --job-name=0031_convert
#SBATCH --partition=<FIX_ME>
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200G
#SBATCH --output=outputs/%x-%j.out
#SBATCH --error=outputs/%x-%j.err

set -e

MEGATRON_CHECKPOINT_DIR=${1%/}
HF_CHECKPOINT_DIR=$2

ENV_DIR=environment

source ${ENV_DIR}/scripts/environment.sh
source ${ENV_DIR}/venv/bin/activate

TOKENIZER_MODEL_DIR=${ENV_DIR}/src/llm-jp-tokenizer/hf/ver3.0/llm-jp-tokenizer-100k.ver3.0b2

TARGET_ITER_DIR=$(basename $MEGATRON_CHECKPOINT_DIR) # iter_NNNNNNN
ITER=$(( 10#$(echo $TARGET_ITER_DIR | sed 's/^iter_//') )) # NNNNNNN (no 0 padding)
echo ITER=$ITER

if [[ -z "$ITER" || ! "$ITER" =~ ^[0-9]+$ ]]; then # check if directory is valid
>&2 echo "Error: ITER=$ITER is not a valid number. Exiting."
exit 1
fi

# Create a unique temporal working directory to avoid affecting the original directory and
# to allow multiple runs to execute simultaneously.
TMP_DIR=$(mktemp -d "${HOME}/ckpt_convert.XXXXXXXX")
>&2 echo TMP_DIR=$TMP_DIR
ln -s $(readlink -f $MEGATRON_CHECKPOINT_DIR) ${TMP_DIR}/${TARGET_ITER_DIR}
echo $ITER > "${TMP_DIR}/latest_checkpointed_iteration.txt"

echo "Converting $MEGATRON_CHECKPOINT_DIR"

python ${ENV_DIR}/src/Megatron-LM/tools/checkpoint/convert.py \
--model-type GPT \
--loader mcore \
--saver llama2_hf \
--load-dir $TMP_DIR \
--save-dir $HF_CHECKPOINT_DIR \
--hf-tokenizer-path $TOKENIZER_MODEL_DIR \
--save-dtype bfloat16 \
--loader-transformer-impl "transformer_engine" \
--megatron-path ${ENV_DIR}/src/Megatron-LM

cp ${TOKENIZER_MODEL_DIR}/* $HF_CHECKPOINT_DIR

rm -r $TMP_DIR
echo "Done"
51 changes: 51 additions & 0 deletions pretrain/scripts/fp8-behavior-check/convert_3.8b.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
# Model conversion script for FP8 experiment.
# Usage:
# sbatch /path/to/convert.sh SRC_DIR DEST_DIR
#
#SBATCH --job-name=0031_convert
#SBATCH --partition=<FIX_ME>
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200G
#SBATCH --output=outputs/%x-%j.out
#SBATCH --error=outputs/%x-%j.err

set -eu -o pipefail

if [ $# -ne 2 ]; then
>&2 echo "Usage: $0 SRC_DIR DEST_DIR"
exit 1
fi

SRC_DIR=$1; shift
DEST_DIR=$1; shift

if [ -e ${DEST_DIR} ]; then
>&2 echo "DEST_DIR already exists: ${DEST_DIR}"
exit 1
fi

ENV_DIR=environment

source ${ENV_DIR}/scripts/environment.sh
source ${ENV_DIR}/venv/bin/activate

TOKENIZER_MODEL_DIR=${ENV_DIR}/src/llm-jp-tokenizer/hf/ver3.0/llm-jp-tokenizer-100k.ver3.0b2

python ${ENV_DIR}/src/Megatron-LM/tools/checkpoint/convert.py \
--model-type GPT \
--loader mcore \
--saver llama2_hf \
--load-dir ${SRC_DIR} \
--save-dir ${DEST_DIR} \
--hf-tokenizer-path ${TOKENIZER_MODEL_DIR} \
--save-dtype bfloat16 \
--loader-transformer-impl "transformer_engine" \
--megatron-path ${ENV_DIR}/src/Megatron-LM

cp ${TOKENIZER_MODEL_DIR}/* ${DEST_DIR}

echo "Done"
44 changes: 44 additions & 0 deletions pretrain/scripts/fp8-behavior-check/run_convert.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash

mkdir -p checkpoints_hf/{3.8b,13b}

# 3.8B

#for d in $(ls checkpoints/3.8b); do
# echo $d
# sbatch \
# --partition=gpu-small \
# scripts/pretrain/scripts/fp8-behavior-check/convert_3.8b.sh \
# checkpoints/3.8b/$d \
# checkpoints_hf/3.8b/$d
#done

# 13B

CONFIGS=(
contd_0000000.fp8.hybrid.m0.i1.h1.most_recent.wgrad
contd_0239000.fp8.hybrid.m0.i1.h1.most_recent.wgrad
)

SRC_ROOT=/home/shared/experiments/0031_fp8-behavior/checkpoints/13b
DEST_ROOT=/home/shared/experiments/0031_fp8-behavior/checkpoints_hf/13b

for c in ${CONFIGS[@]}; do
s=${SRC_ROOT}/$c
d=${DEST_ROOT}/$c

for i in `ls $s | egrep '^iter_.{7}$'`; do
if [ -e $d/$i ]; then
echo "Exists: $s/$i"
continue
fi

echo "Converting: $s/$i"
sbatch \
--job-name=0031_convert \
--partition=gpu-small-lp \
scripts/pretrain/scripts/fp8-behavior-check/convert_13b.sh \
$s/$i \
$d/$i
done
done
16 changes: 16 additions & 0 deletions pretrain/scripts/fp8-behavior-check/run_eval.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash

for cfg_file in $(find checkpoints_hf -name config.json | sort); do
cfg=$(dirname $cfg_file | sed 's/checkpoints_hf\///')
if [ -e processed/$cfg ]; then
echo "Already processed: $cfg"
continue
fi

sbatch \
--partition=gpu-small-lp \
scripts/pretrain/scripts/fp8-behavior-check/run_llm-jp-eval.sh checkpoints_hf/$cfg $cfg

mkdir -p $(dirname processed/$cfg) && touch processed/$cfg
echo "Started: $cfg"
done
52 changes: 52 additions & 0 deletions pretrain/scripts/fp8-behavior-check/run_llm-jp-eval.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/bin/bash
#SBATCH --job-name=0031_eval
#SBATCH --partition=<partition>
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --mem=200G
#SBATCH --output=outputs/%x-%j.out
#SBATCH --error=outputs/%x-%j.err

set -eu -o pipefail

# Open file limit
ulimit -n 65536 1048576

EXPERIMENT_DIR=eval_environment

ENV_DIR=${EXPERIMENT_DIR}/environment
source ${ENV_DIR}/scripts/environment.sh
source ${ENV_DIR}/venv/bin/activate

# Arguments
MODEL=$1
WANDB_RUN_NAME=$2

# Semi-fixed vars
CONFIG_TEMPLATE=${EXPERIMENT_DIR}/resources/config_base.yaml
TOKENIZER=$MODEL
WANDB_ENTITY=llm-jp-eval
WANDB_PROJECT=0031_fp8-behavior-check

# Fixed vars
CONFIG_DIR=${ENV_DIR}/src/llm-jp-eval/configs
SCRIPT_PATH=${ENV_DIR}/src/llm-jp-eval/scripts/evaluate_llm.py
DATASET_DIR=${ENV_DIR}/data/llm-jp-eval/${LLM_JP_EVAL_TAG}/evaluation/dev

# Config settings
NEW_CONFIG=${CONFIG_DIR}/config.${WANDB_PROJECT}.$(echo ${WANDB_RUN_NAME} | tr '/' '_').yaml
REPLACE_VARS=("MODEL" "TOKENIZER" "DATASET_DIR" "WANDB_ENTITY" "WANDB_PROJECT" "WANDB_RUN_NAME")

# Create a new config file to save the config file of each run
cp $CONFIG_TEMPLATE $NEW_CONFIG

# Replace variables
for VAR in "${REPLACE_VARS[@]}"; do
VALUE=$(eval echo \${$VAR})
sed -i "s|<<${VAR}>>|${VALUE}|g" $NEW_CONFIG
done

# Run llm-jp-eval
python $SCRIPT_PATH -cn $(basename $NEW_CONFIG)

echo "Done"
64 changes: 64 additions & 0 deletions pretrain/scripts/fp8-behavior-check/run_train.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash

run_job() {
echo $@
PARAM_SIZE=$1; shift
sbatch \
--partition=gpu-small \
--nodes=8 \
scripts/pretrain/scripts/fp8-behavior-check/sbatch_${PARAM_SIZE}.sh \
$@
}

# arg order: enabled, format, margin, interval, history, algo, wgrad, iter

# All runs are commented out for safety.

#run_job 3.8b false hybrid 0 1 1 most_recent true 0 1000
#run_job 3.8b false hybrid 0 1 1 most_recent true 2000 3000
#run_job 3.8b false hybrid 0 1 1 most_recent true 20000 21000
#run_job 3.8b false hybrid 0 1 1 most_recent true 200000 201000

#run_job 3.8b true hybrid 0 1 1 most_recent true 0 1000
#run_job 3.8b true hybrid 0 1 1 most_recent true 2000 3000
#run_job 3.8b true hybrid 0 1 1 most_recent true 20000 21000
#run_job 3.8b true hybrid 0 1 1 most_recent true 200000 201000

#run_job 3.8b true e3m4 0 1 1 most_recent true 200000 201000

#run_job 3.8b true hybrid 1 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 2 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 3 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 4 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 5 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 6 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 7 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 8 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 16 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 32 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 64 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 128 1 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 256 1 1 most_recent true 200000 201000

#run_job 3.8b true hybrid 0 2 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 0 4 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 0 8 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 0 16 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 0 32 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 0 64 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 0 128 1 most_recent true 200000 201000
#run_job 3.8b true hybrid 0 256 1 most_recent true 200000 201000

#run_job 3.8b true hybrid 0 1 2 max true 200000 201000
#run_job 3.8b true hybrid 0 1 4 max true 200000 201000
#run_job 3.8b true hybrid 0 1 8 max true 200000 201000
#run_job 3.8b true hybrid 0 1 16 max true 200000 201000
#run_job 3.8b true hybrid 0 1 32 max true 200000 201000
#run_job 3.8b true hybrid 0 1 64 max true 200000 201000
#run_job 3.8b true hybrid 0 1 128 max true 200000 201000
#run_job 3.8b true hybrid 0 1 256 max true 200000 201000

#run_job 3.8b true hybrid 0 1 1 most_recent false 200000 201000

#run_job 13b true hybrid 0 1 1 most_recent true 0 50000
#run_job 13b true hybrid 0 1 1 most_recent true 239000 289000
70 changes: 70 additions & 0 deletions pretrain/scripts/fp8-behavior-check/sbatch_13b.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#!/bin/bash
#SBATCH --job-name=0031_train_13b
#SBATCH --partition={partition}
#SBATCH --nodes=8
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
#SBATCH --output=outputs/%x-%j.out
#SBATCH --error=outputs/%x-%j.err

# PLEASE run this script from the root of the experiment directory.


set -eu -o pipefail

if [ $# -ne 9 ]; then
>&2 echo "Usage $0 ENABLED FORMAT MARGIN INTERVAL AMAX_HIST_LEN AMAX_ALGO WGRAD ITER STOP"
exit 1
fi

FP8_ENABLED=$1; shift
FP8_FORMAT=$1; shift
FP8_MARGIN=$1; shift
FP8_INTERVAL=$1; shift
FP8_AMAX_HISTORY_LEN=$1; shift
FP8_AMAX_COMPUTE_ALGO=$1; shift
FP8_WGRAD=$1; shift
LOAD_ITER=$1; shift
FORCE_STOP_ITER=$1; shift

EXPERIMENT_DIR=$(pwd)
ENV_DIR=${EXPERIMENT_DIR}/environment

source ${ENV_DIR}/scripts/environment.sh
source ${ENV_DIR}/venv/bin/activate

export MASTER_ADDR=$(scontrol show hostname $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=$((10000 + ($SLURM_JOBID % 50000)))

echo "MASTER_ADDR=${MASTER_ADDR}"

NUM_NODES=$SLURM_JOB_NUM_NODES
NUM_GPUS_PER_NODE=$(echo $SLURM_TASKS_PER_NODE | cut -d '(' -f 1)
NUM_GPUS=$((${NUM_NODES} * ${NUM_GPUS_PER_NODE}))

echo NUM_NODES=$NUM_NODES
echo NUM_GPUS_PER_NODE=$NUM_GPUS_PER_NODE
echo NUM_GPUS=$NUM_GPUS

mpirun \
-np $NUM_GPUS \
--npernode $NUM_GPUS_PER_NODE \
-bind-to none \
-map-by slot \
-x EXPERIMENT_DIR=$EXPERIMENT_DIR \
-x MASTER_ADDR=$MASTER_ADDR \
-x MASTER_PORT=$MASTER_PORT \
-x NUM_NODES=$NUM_NODES \
-x NUM_GPUS_PER_NODE=$NUM_GPUS_PER_NODE \
\
-x FP8_ENABLED=$FP8_ENABLED \
-x FP8_FORMAT=$FP8_FORMAT \
-x FP8_MARGIN=$FP8_MARGIN \
-x FP8_INTERVAL=$FP8_INTERVAL \
-x FP8_AMAX_HISTORY_LEN=$FP8_AMAX_HISTORY_LEN \
-x FP8_AMAX_COMPUTE_ALGO=$FP8_AMAX_COMPUTE_ALGO \
-x FP8_WGRAD=$FP8_WGRAD \
-x LOAD_ITER=$LOAD_ITER \
-x FORCE_STOP_ITER=${FORCE_STOP_ITER} \
\
bash scripts/pretrain/scripts/fp8-behavior-check/train_13b.sh
Loading