can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1 #709

anilozlu · 2024-10-05T20:10:18Z

System Info

trn1.2xlarge instance on AWS EC2
optimum-neuron version 0.0.25.dev0
transformers version 4.43.2
Amazon Linux 2023 AMI with python=3.9 AND Hugging Face Ubuntu 22.04 AMI with python=3.10

Who can help?

@michaelbenayoun

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I am trying to fine-tune Llama-3-8B on a single trn1.2xlarge instance. I am following the tutorial here: https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm but changing PROCESSES_PER_NODE and TP_DEGREE variables. My compilation script looks like this:

#!/bin/bash
set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ec2-user/cache_dir_neuron/"

PROCESSES_PER_NODE=2

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=2
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

MAX_STEPS=25

XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE train.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir

however, during compilation of some graphs I get this error:

2024-10-02 17:35:11.000783:  103330  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.neff', '--model-type=transformer', '--distribution-strategy=llm-training', '--enable-saturate-infinity', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2024-10-02T17:35:11Z [TEN404] Internal tensorizer error: TritiumFusion:Should be able to fuse two loops! - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.

I can compile and complete training without error if I set the batch_size to 1, however I would like to be able to increase the batch size to speed up training.
I also get these warnings which may be relevant:

torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
[2024-10-05 19:58:00.706: W neuronx_distributed/parallel_layers/parallel_state.py:439] [rank_0_pp-1_tp-1_dp-1] Failed to initialize NKI parallel state with exception intra_layer_model parallel group is not initialized.Proceeding without distributed NKI support.

Expected behavior

I expect the model to compile and training script to function properly without error.

The text was updated successfully, but these errors were encountered:

anilozlu added the bug Something isn't working label Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1 #709

can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1 #709

anilozlu commented Oct 5, 2024 •

edited

Loading

can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1 #709

can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1 #709

Comments

anilozlu commented Oct 5, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

anilozlu commented Oct 5, 2024 •

edited

Loading