Llama3-8B finetuning shows runtime error of TDRV:v2_cc_execute #658

jianyinglangaws · 2024-07-17T22:20:20Z

System Info

The same script works with `Neuron SDK 2.18.0` and `optimum-neuronx v0.0.22`.  But with the latest software stack 

(aws_neuron_venv_pytorch) [ec2-user@ip-172-31-29-22 text-generation]$ yum list | grep neuron
aws-neuronx-collectives.x86_64                                    2.21.46.0_69b77134b-1                       @neuron         
aws-neuronx-dkms.noarch                                           2.17.17.0-dkms                              @neuron         
aws-neuronx-runtime-lib.x86_64                                    2.21.41.0_fb1705f5f-1                       @neuron         
aws-neuronx-tools.x86_64                                          2.18.3.0-1                                  @neuron         
aws-neuron-dkms.noarch                                            2.3.26.0-dkms                               neuron          
aws-neuron-dkms.src                                               2.3.26.0-dkms                               neuron          
aws-neuron-k8-plugin.x86_64                                       1.9.3.0-1                                   neuron          
aws-neuron-k8-scheduler.x86_64                                    1.9.3.0-1                                   neuron          
aws-neuron-runtime.x86_64                                         1.6.24.0-1                                  neuron          
aws-neuron-runtime-base.x86_64                                    1.6.21.0-1                                  neuron          
aws-neuron-tools.x86_64                                           2.1.4.0-1                                   neuron          
aws-neuronx-dkms.src                                              2.17.17.0-dkms                              neuron          
aws-neuronx-gpsimd-customop.x86_64                                0.2.3.0-1                                   neuron          
aws-neuronx-gpsimd-customop-lib.x86_64                            0.11.4.0-1                                  neuron          
aws-neuronx-gpsimd-tools.x86_64                                   0.11.3.0_36dcb86d4-1                        neuron          
aws-neuronx-k8-plugin.x86_64                                      2.21.14.0-1                                 neuron          
aws-neuronx-k8-scheduler.x86_64                                   2.21.14.0-1                                 neuron          
aws-neuronx-oci-hook.x86_64                                       2.4.4.0-1                                   neuron          
tensorflow-model-server-neuron.x86_64                             2.8.0.2.3.0.0-0                             neuron          
tensorflow-model-server-neuronx.x86_64                            2.10.1.2.11.4.0-0                           neuron

(aws_neuron_venv_pytorch) [ec2-user@ip-172-31-29-22 text-generation]$ pip list | grep neuron
aws-neuronx-runtime-discovery 2.9
libneuronxla                  2.0.2335
neuronx-cc                    2.13.66.0+6dfecc895
neuronx-distributed           0.7.0
optimum-neuron                0.0.23
torch-neuronx                 2.1.2.2.1.0
transformers-neuronx          0.10.0.21

gives the following error.

745142719040221994+6bd63055/model.neff. Exiting with a successfully compiled graph.
2024-Jul-17 22:00:32.531450 57376:58367 ERROR  TDRV:v2_cc_execute                           [nec_dev 1, gid 1] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/952ba576-b08c-4ac0-b0f1-12e560e5e362/model.MODULE_11203961494150985019+6bd63055.neff2024-Jul-17 22:00:32.5314452024-Jul-17 22:00:32.5314692024-Jul-17 22:00:32.5314502024-Jul-17 22:00:32.5314522024-Jul-17 22:00:32.531461 57380:58467 ERROR  TDRV:v2_cc_execute                            57381:57583 ERROR  TDRV:v2_cc_execute                           
 57379:57681 ERROR  TDRV:v2_cc_execute                            57378:58269 ERROR  TDRV:v2_cc_execute                           [nec_dev 4, gid 4] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/ebe15e9e-3f50-4061-8ef1-9c80d9a78071/model.MODULE_12660838522657173708+6bd63055.neff[nec_dev 5, gid 5] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/a12cbe53-3042-4c9a-86d4-4bbcac795471/model.MODULE_15938951179947649509+6bd63055.neff 57382:57461 ERROR  TDRV:v2_cc_execute                           [nec_dev 6, gid 6] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/eac078e9-79e3-49af-8695-c65797ca89c0/model.MODULE_3281029225498615900+6bd63055.neff[nec_dev 3, gid 3] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/43f12c55-64e6-45ad-a002-a0676ed72df9/model.MODULE_5948348338269179475+6bd63055.neff
[nec_dev 7, gid 7] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/59a5b5cd-fff2-4315-a603-8a152f5186ca/model.MODULE_12429740934125521760+6bd63055.neff2024-Jul-17 22:00:32.531563


 57376:58367 ERROR   ENC:enc_dump_neff_info                      [nec_dev 1, gid 1] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/952ba576-b08c-4ac0-b0f1-12e560e5e362/model.MODULE_11203961494150985019+6bd63055.neff2024-Jul-17 22:00:32.531607
2024-Jul-17 22:00:32.5316292024-Jul-17 22:00:32.5316332024-Jul-17 22:00:32.531631 57379:57681 ERROR   ENC:enc_dump_neff_info                       57378:58269 ERROR   ENC:enc_dump_neff_info                      
 57381:57583 ERROR   ENC:enc_dump_neff_info                       57380:58467 ERROR   ENC:enc_dump_neff_info                      [nec_dev 4, gid 4] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/ebe15e9e-3f50-4061-8ef1-9c80d9a78071/model.MODULE_12660838522657173708+6bd63055.neff[nec_dev 3, gid 3] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/43f12c55-64e6-45ad-a002-a0676ed72df9/model.MODULE_5948348338269179475+6bd63055.neff[nec_dev 6, gid 6] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/eac078e9-79e3-49af-8695-c65797ca89c0/model.MODULE_3281029225498615900+6bd63055.neff[nec_dev 5, gid 5] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/a12cbe53-3042-4c9a-86d4-4bbcac795471/model.MODULE_15938951179947649509+6bd63055.neff2024-Jul-17 22:00:32.531670 57382:57461 ERROR   ENC:enc_dump_neff_info                      
2024-Jul-17 22:00:32.531701 57376:58367 ERROR   ENC:enc_dump_neff_info



### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

The script I used is as below:

Launch the instance with Amazon Linux2023
Install the deps using the following script

Configure Linux for Neuron repository updates

sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

Update OS packages

sudo yum update -y

Install OS headers

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

Install git

sudo yum install git -y

install Neuron Driver

sudo yum install aws-neuronx-dkms-2.* -y

Install Neuron Runtime

sudo yum install aws-neuronx-collectives-2.* -y
sudo yum install aws-neuronx-runtime-lib-2.* -y

Install Neuron Tools

sudo yum install aws-neuronx-tools-2.* -y

#Create python3 venv
sudo yum install -y libxcrypt-compat
sudo yum install -y gcc-c++
python3 -m venv /home/ec2-user/aws_neuron_venv_pytorch

#Activate venv
source ~/aws_neuron_venv_pytorch/bin/activate

python -m pip install -U pip

Install Jupyter notebook kernel

pip install ipykernel
python3 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels

Set pip repository pointing to the Neuron repository

python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

Install wget, awscli

python -m pip install wget
python -m pip install awscli

Install Neuron Compiler and Framework

python -m pip install neuronx-cc==2.* torch-neuronx torchvision

#Install optmimum-neuronx
pip3 install --upgrade-strategy eager optimum[neuronx]

Download scripts

git clone https://github.com/huggingface/optimum-neuron.git

cd optimum-neuron/notebooks/text-generation/

Login with your huggingface token ID to download gated models

huggingface-cli login --token YOUR_TOKEN

Create a python3 file download_data.py to download and prcoess dataset under directory optimum-neuron/notebooks/text-generation/:

from datasets import load_dataset
from random import randrange

Load dataset from the hub

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

def format_dolly(sample):
instruction = f"### Instruction\n{sample['instruction']}"
context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
response = f"### Answer\n{sample['response']}"
# join all the parts together
prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
return prompt

from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

from transformers import AutoTokenizer

Hugging Face model id

model_id = "meta-llama/Meta-Llama-3-8B" # gated

model_id = "meta-llama/Llama-2-7b-hf" # gated

tokenizer = AutoTokenizer.from_pretrained(model_id)
from random import randint

add utils method to path for loading dataset

import sys
sys.path.append("./scripts/utils") # make sure you change this to the correct path
from pack_dataset import pack_dataset

template dataset to add prompt to each sample

def template_dataset(sample):
sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
return sample

apply prompt template per sample

dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))

print random sample

print(dataset[randint(0, len(dataset))]["text"])

tokenize dataset

dataset = dataset.map(
lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
)

chunk dataset

lm_dataset = pack_dataset(dataset, chunk_length=2048) # We use 2048 as the maximum length for packing

save train_dataset to disk

dataset_path = "tokenized_dolly"
lm_dataset.save_to_disk(dataset_path)
Run the above script:

python download_data.py

Compile the finetuning script on inf2.8xlarge with the compile_llama3.sh script

MALLOC_ARENA_MAX=64 neuron_parallel_compile torchrun --nproc_per_node=8 scripts/run_clm.py
--model_id "meta-llama/Meta-Llama-3-8B"
--dataset_path "tokenized_dolly"
--bf16 True
--learning_rate 5e-5
--output_dir dolly_llama
--overwrite_output_dir True
--per_device_train_batch_size 1
--gradient_checkpointing True
--tensor_parallel_size 8
--max_steps 10
--logging_steps 10
--gradient_accumulation_steps 16

Run the finetuning on inf2.8xlarge with the run_llama3.sh script

MALLOC_ARENA_MAX=64 torchrun --nproc_per_node=8 scripts/run_clm.py
--model_id "meta-llama/Meta-Llama-3-8B"
--dataset_path "tokenized_dolly"
--bf16 True
--learning_rate 5e-5
--output_dir dolly_llama
--overwrite_output_dir True
--skip_cache_push True
--per_device_train_batch_size 1
--gradient_checkpointing True
--tensor_parallel_size 8
--num_train_epochs 3
--logging_steps 10
--gradient_accumulation_steps 16


### Expected behavior

The run command should give performance numbers.

The text was updated successfully, but these errors were encountered:

michaelbenayoun · 2024-07-18T08:57:01Z

It should be fixed on main.
Also you might also encounter a MPMD issue after the first epoch depending on your logging strategy, this is fixed in #654 .

jianyinglangaws · 2024-07-22T21:49:28Z

The script can run with the Neuron SDK 2.19.1 and the optimum-neuron main. However, the loss value shows nan.

2024-07-22 20:10:28.000737:  280430  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/M
ODULE_17784021259853473086+abb26765/model.neff. Exiting with a successfully compiled graph.                                                          
{'loss': nan, 'learning_rate': 4.796747967479675e-05, 'epoch': 0.12}                                                                                 
{'loss': nan, 'learning_rate': 4.59349593495935e-05, 'epoch': 0.24}                                                                                  
{'loss': nan, 'learning_rate': 4.390243902439025e-05, 'epoch': 0.36}                                                                                 
{'loss': nan, 'learning_rate': 4.186991869918699e-05, 'epoch': 0.48}                                                                                 
{'loss': nan, 'learning_rate': 3.983739837398374e-05, 'epoch': 0.6}                                                                                  
{'loss': nan, 'learning_rate': 3.780487804878049e-05, 'epoch': 0.72}                                                                                 
{'loss': nan, 'learning_rate': 3.577235772357724e-05, 'epoch': 0.84}                                                                                 
{'loss': nan, 'learning_rate': 3.373983739837399e-05, 'epoch': 0.96}                                                                                 
{'loss': nan, 'learning_rate': 3.170731707317073e-05, 'epoch': 1.09}                                                                                 
{'loss': nan, 'learning_rate': 2.9674796747967482e-05, 'epoch': 1.21}                                                                                
{'loss': nan, 'learning_rate': 2.764227642276423e-05, 'epoch': 1.33}                                                                                 
{'loss': nan, 'learning_rate': 2.5609756097560977e-05, 'epoch': 1.45}                                                                                
{'loss': nan, 'learning_rate': 2.3577235772357724e-05, 'epoch': 1.57}                                                                                
{'loss': nan, 'learning_rate': 2.1544715447154475e-05, 'epoch': 1.69}                                                                                
{'loss': nan, 'learning_rate': 1.9512195121951222e-05, 'epoch': 1.81}                                                                                
{'loss': nan, 'learning_rate': 1.747967479674797e-05, 'epoch': 1.93}                                                                                 
{'loss': nan, 'learning_rate': 1.5447154471544717e-05, 'epoch': 2.05}                                                                                
{'loss': nan, 'learning_rate': 1.3414634146341466e-05, 'epoch': 2.17}                                                                                
{'loss': nan, 'learning_rate': 1.1382113821138211e-05, 'epoch': 2.29}                                                                                
{'loss': nan, 'learning_rate': 9.34959349593496e-06, 'epoch': 2.41}                                                                                  
{'loss': nan, 'learning_rate': 7.317073170731707e-06, 'epoch': 2.53}                                                                                 
{'loss': nan, 'learning_rate': 5.2845528455284555e-06, 'epoch': 2.65}                                                                                
{'loss': nan, 'learning_rate': 3.2520325203252037e-06, 'epoch': 2.77}                                                                                
{'loss': nan, 'learning_rate': 1.2195121951219514e-06, 'epoch': 2.89}

github-actions · 2024-10-14T11:40:20Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

jianyinglangaws added the bug Something isn't working label Jul 17, 2024

github-actions bot added the Stale label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3-8B finetuning shows runtime error of TDRV:v2_cc_execute #658

Llama3-8B finetuning shows runtime error of TDRV:v2_cc_execute #658

jianyinglangaws commented Jul 17, 2024

michaelbenayoun commented Jul 18, 2024

jianyinglangaws commented Jul 22, 2024

github-actions bot commented Oct 14, 2024

Llama3-8B finetuning shows runtime error of TDRV:v2_cc_execute #658

Llama3-8B finetuning shows runtime error of TDRV:v2_cc_execute #658

Comments

jianyinglangaws commented Jul 17, 2024

System Info

Configure Linux for Neuron repository updates

Update OS packages

Install OS headers

Install git

install Neuron Driver

Install Neuron Runtime

Install Neuron Tools

Install Jupyter notebook kernel

Set pip repository pointing to the Neuron repository

Install wget, awscli

Install Neuron Compiler and Framework

Load dataset from the hub

Hugging Face model id

model_id = "meta-llama/Llama-2-7b-hf" # gated

add utils method to path for loading dataset

template dataset to add prompt to each sample

apply prompt template per sample

print random sample

tokenize dataset

chunk dataset

save train_dataset to disk

michaelbenayoun commented Jul 18, 2024

jianyinglangaws commented Jul 22, 2024

github-actions bot commented Oct 14, 2024