Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1 #709

Open
3 of 4 tasks
anilozlu opened this issue Oct 5, 2024 · 0 comments
Open
3 of 4 tasks
Labels
bug Something isn't working

Comments

@anilozlu
Copy link

anilozlu commented Oct 5, 2024

System Info

trn1.2xlarge instance on AWS EC2
optimum-neuron version 0.0.25.dev0
transformers version 4.43.2
Amazon Linux 2023 AMI with python=3.9 AND Hugging Face Ubuntu 22.04 AMI with python=3.10

Who can help?

@michaelbenayoun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I am trying to fine-tune Llama-3-8B on a single trn1.2xlarge instance. I am following the tutorial here: https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm but changing PROCESSES_PER_NODE and TP_DEGREE variables. My compilation script looks like this:

#!/bin/bash
set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ec2-user/cache_dir_neuron/"

PROCESSES_PER_NODE=2

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=2
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

MAX_STEPS=25

XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE train.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir

however, during compilation of some graphs I get this error:

2024-10-02 17:35:11.000783:  103330  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.neff', '--model-type=transformer', '--distribution-strategy=llm-training', '--enable-saturate-infinity', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2024-10-02T17:35:11Z [TEN404] Internal tensorizer error: TritiumFusion:Should be able to fuse two loops! - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.

I can compile and complete training without error if I set the batch_size to 1, however I would like to be able to increase the batch size to speed up training.
I also get these warnings which may be relevant:

torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
[2024-10-05 19:58:00.706: W neuronx_distributed/parallel_layers/parallel_state.py:439] [rank_0_pp-1_tp-1_dp-1] Failed to initialize NKI parallel state with exception intra_layer_model parallel group is not initialized.Proceeding without distributed NKI support.

Expected behavior

I expect the model to compile and training script to function properly without error.

@anilozlu anilozlu added the bug Something isn't working label Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant