Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3-8B finetuning shows runtime error of TDRV:v2_cc_execute #658

Open
jianyinglangaws opened this issue Jul 17, 2024 · 3 comments
Open
Labels
bug Something isn't working Stale

Comments

@jianyinglangaws
Copy link

System Info

The same script works with `Neuron SDK 2.18.0` and `optimum-neuronx v0.0.22`.  But with the latest software stack 

(aws_neuron_venv_pytorch) [ec2-user@ip-172-31-29-22 text-generation]$ yum list | grep neuron
aws-neuronx-collectives.x86_64                                    2.21.46.0_69b77134b-1                       @neuron         
aws-neuronx-dkms.noarch                                           2.17.17.0-dkms                              @neuron         
aws-neuronx-runtime-lib.x86_64                                    2.21.41.0_fb1705f5f-1                       @neuron         
aws-neuronx-tools.x86_64                                          2.18.3.0-1                                  @neuron         
aws-neuron-dkms.noarch                                            2.3.26.0-dkms                               neuron          
aws-neuron-dkms.src                                               2.3.26.0-dkms                               neuron          
aws-neuron-k8-plugin.x86_64                                       1.9.3.0-1                                   neuron          
aws-neuron-k8-scheduler.x86_64                                    1.9.3.0-1                                   neuron          
aws-neuron-runtime.x86_64                                         1.6.24.0-1                                  neuron          
aws-neuron-runtime-base.x86_64                                    1.6.21.0-1                                  neuron          
aws-neuron-tools.x86_64                                           2.1.4.0-1                                   neuron          
aws-neuronx-dkms.src                                              2.17.17.0-dkms                              neuron          
aws-neuronx-gpsimd-customop.x86_64                                0.2.3.0-1                                   neuron          
aws-neuronx-gpsimd-customop-lib.x86_64                            0.11.4.0-1                                  neuron          
aws-neuronx-gpsimd-tools.x86_64                                   0.11.3.0_36dcb86d4-1                        neuron          
aws-neuronx-k8-plugin.x86_64                                      2.21.14.0-1                                 neuron          
aws-neuronx-k8-scheduler.x86_64                                   2.21.14.0-1                                 neuron          
aws-neuronx-oci-hook.x86_64                                       2.4.4.0-1                                   neuron          
tensorflow-model-server-neuron.x86_64                             2.8.0.2.3.0.0-0                             neuron          
tensorflow-model-server-neuronx.x86_64                            2.10.1.2.11.4.0-0                           neuron       
(aws_neuron_venv_pytorch) [ec2-user@ip-172-31-29-22 text-generation]$ pip list | grep neuron
aws-neuronx-runtime-discovery 2.9
libneuronxla                  2.0.2335
neuronx-cc                    2.13.66.0+6dfecc895
neuronx-distributed           0.7.0
optimum-neuron                0.0.23
torch-neuronx                 2.1.2.2.1.0
transformers-neuronx          0.10.0.21

gives the following error.

745142719040221994+6bd63055/model.neff. Exiting with a successfully compiled graph.
2024-Jul-17 22:00:32.531450 57376:58367 ERROR  TDRV:v2_cc_execute                           [nec_dev 1, gid 1] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/952ba576-b08c-4ac0-b0f1-12e560e5e362/model.MODULE_11203961494150985019+6bd63055.neff2024-Jul-17 22:00:32.5314452024-Jul-17 22:00:32.5314692024-Jul-17 22:00:32.5314502024-Jul-17 22:00:32.5314522024-Jul-17 22:00:32.531461 57380:58467 ERROR  TDRV:v2_cc_execute                            57381:57583 ERROR  TDRV:v2_cc_execute                           
 57379:57681 ERROR  TDRV:v2_cc_execute                            57378:58269 ERROR  TDRV:v2_cc_execute                           [nec_dev 4, gid 4] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/ebe15e9e-3f50-4061-8ef1-9c80d9a78071/model.MODULE_12660838522657173708+6bd63055.neff[nec_dev 5, gid 5] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/a12cbe53-3042-4c9a-86d4-4bbcac795471/model.MODULE_15938951179947649509+6bd63055.neff 57382:57461 ERROR  TDRV:v2_cc_execute                           [nec_dev 6, gid 6] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/eac078e9-79e3-49af-8695-c65797ca89c0/model.MODULE_3281029225498615900+6bd63055.neff[nec_dev 3, gid 3] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/43f12c55-64e6-45ad-a002-a0676ed72df9/model.MODULE_5948348338269179475+6bd63055.neff
[nec_dev 7, gid 7] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/59a5b5cd-fff2-4315-a603-8a152f5186ca/model.MODULE_12429740934125521760+6bd63055.neff2024-Jul-17 22:00:32.531563


 57376:58367 ERROR   ENC:enc_dump_neff_info                      [nec_dev 1, gid 1] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/952ba576-b08c-4ac0-b0f1-12e560e5e362/model.MODULE_11203961494150985019+6bd63055.neff2024-Jul-17 22:00:32.531607
2024-Jul-17 22:00:32.5316292024-Jul-17 22:00:32.5316332024-Jul-17 22:00:32.531631 57379:57681 ERROR   ENC:enc_dump_neff_info                       57378:58269 ERROR   ENC:enc_dump_neff_info                      
 57381:57583 ERROR   ENC:enc_dump_neff_info                       57380:58467 ERROR   ENC:enc_dump_neff_info                      [nec_dev 4, gid 4] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/ebe15e9e-3f50-4061-8ef1-9c80d9a78071/model.MODULE_12660838522657173708+6bd63055.neff[nec_dev 3, gid 3] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/43f12c55-64e6-45ad-a002-a0676ed72df9/model.MODULE_5948348338269179475+6bd63055.neff[nec_dev 6, gid 6] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/eac078e9-79e3-49af-8695-c65797ca89c0/model.MODULE_3281029225498615900+6bd63055.neff[nec_dev 5, gid 5] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/a12cbe53-3042-4c9a-86d4-4bbcac795471/model.MODULE_15938951179947649509+6bd63055.neff2024-Jul-17 22:00:32.531670 57382:57461 ERROR   ENC:enc_dump_neff_info                      
2024-Jul-17 22:00:32.531701 57376:58367 ERROR   ENC:enc_dump_neff_info    


### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

The script I used is as below:

Launch the instance with Amazon Linux2023
Install the deps using the following script

Configure Linux for Neuron repository updates

sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

Update OS packages

sudo yum update -y

Install OS headers

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

Install git

sudo yum install git -y

install Neuron Driver

sudo yum install aws-neuronx-dkms-2.* -y

Install Neuron Runtime

sudo yum install aws-neuronx-collectives-2.* -y
sudo yum install aws-neuronx-runtime-lib-2.* -y

Install Neuron Tools

sudo yum install aws-neuronx-tools-2.* -y

#Create python3 venv
sudo yum install -y libxcrypt-compat
sudo yum install -y gcc-c++
python3 -m venv /home/ec2-user/aws_neuron_venv_pytorch

#Activate venv
source ~/aws_neuron_venv_pytorch/bin/activate

python -m pip install -U pip

Install Jupyter notebook kernel

pip install ipykernel
python3 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels

Set pip repository pointing to the Neuron repository

python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

Install wget, awscli

python -m pip install wget
python -m pip install awscli

Install Neuron Compiler and Framework

python -m pip install neuronx-cc==2.* torch-neuronx torchvision

#Install optmimum-neuronx
pip3 install --upgrade-strategy eager optimum[neuronx]

Download scripts

git clone https://github.com/huggingface/optimum-neuron.git

cd optimum-neuron/notebooks/text-generation/

Login with your huggingface token ID to download gated models

huggingface-cli login --token YOUR_TOKEN

Create a python3 file download_data.py to download and prcoess dataset under directory optimum-neuron/notebooks/text-generation/:

from datasets import load_dataset
from random import randrange

Load dataset from the hub

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

def format_dolly(sample):
instruction = f"### Instruction\n{sample['instruction']}"
context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
response = f"### Answer\n{sample['response']}"
# join all the parts together
prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
return prompt

from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

from transformers import AutoTokenizer

Hugging Face model id

model_id = "meta-llama/Meta-Llama-3-8B" # gated

model_id = "meta-llama/Llama-2-7b-hf" # gated

tokenizer = AutoTokenizer.from_pretrained(model_id)
from random import randint

add utils method to path for loading dataset

import sys
sys.path.append("./scripts/utils") # make sure you change this to the correct path
from pack_dataset import pack_dataset

template dataset to add prompt to each sample

def template_dataset(sample):
sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
return sample

apply prompt template per sample

dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))

print random sample

print(dataset[randint(0, len(dataset))]["text"])

tokenize dataset

dataset = dataset.map(
lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
)

chunk dataset

lm_dataset = pack_dataset(dataset, chunk_length=2048) # We use 2048 as the maximum length for packing

save train_dataset to disk

dataset_path = "tokenized_dolly"
lm_dataset.save_to_disk(dataset_path)
Run the above script:

python download_data.py

Compile the finetuning script on inf2.8xlarge with the compile_llama3.sh script

MALLOC_ARENA_MAX=64 neuron_parallel_compile torchrun --nproc_per_node=8 scripts/run_clm.py
--model_id "meta-llama/Meta-Llama-3-8B"
--dataset_path "tokenized_dolly"
--bf16 True
--learning_rate 5e-5
--output_dir dolly_llama
--overwrite_output_dir True
--per_device_train_batch_size 1
--gradient_checkpointing True
--tensor_parallel_size 8
--max_steps 10
--logging_steps 10
--gradient_accumulation_steps 16

Run the finetuning on inf2.8xlarge with the run_llama3.sh script

MALLOC_ARENA_MAX=64 torchrun --nproc_per_node=8 scripts/run_clm.py
--model_id "meta-llama/Meta-Llama-3-8B"
--dataset_path "tokenized_dolly"
--bf16 True
--learning_rate 5e-5
--output_dir dolly_llama
--overwrite_output_dir True
--skip_cache_push True
--per_device_train_batch_size 1
--gradient_checkpointing True
--tensor_parallel_size 8
--num_train_epochs 3
--logging_steps 10
--gradient_accumulation_steps 16


### Expected behavior

The run command should give performance numbers.
@jianyinglangaws jianyinglangaws added the bug Something isn't working label Jul 17, 2024
@michaelbenayoun
Copy link
Member

It should be fixed on main.
Also you might also encounter a MPMD issue after the first epoch depending on your logging strategy, this is fixed in #654 .

@jianyinglangaws
Copy link
Author

The script can run with the Neuron SDK 2.19.1 and the optimum-neuron main. However, the loss value shows nan.

2024-07-22 20:10:28.000737:  280430  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.13.66.0+6dfecc895/M
ODULE_17784021259853473086+abb26765/model.neff. Exiting with a successfully compiled graph.                                                          
{'loss': nan, 'learning_rate': 4.796747967479675e-05, 'epoch': 0.12}                                                                                 
{'loss': nan, 'learning_rate': 4.59349593495935e-05, 'epoch': 0.24}                                                                                  
{'loss': nan, 'learning_rate': 4.390243902439025e-05, 'epoch': 0.36}                                                                                 
{'loss': nan, 'learning_rate': 4.186991869918699e-05, 'epoch': 0.48}                                                                                 
{'loss': nan, 'learning_rate': 3.983739837398374e-05, 'epoch': 0.6}                                                                                  
{'loss': nan, 'learning_rate': 3.780487804878049e-05, 'epoch': 0.72}                                                                                 
{'loss': nan, 'learning_rate': 3.577235772357724e-05, 'epoch': 0.84}                                                                                 
{'loss': nan, 'learning_rate': 3.373983739837399e-05, 'epoch': 0.96}                                                                                 
{'loss': nan, 'learning_rate': 3.170731707317073e-05, 'epoch': 1.09}                                                                                 
{'loss': nan, 'learning_rate': 2.9674796747967482e-05, 'epoch': 1.21}                                                                                
{'loss': nan, 'learning_rate': 2.764227642276423e-05, 'epoch': 1.33}                                                                                 
{'loss': nan, 'learning_rate': 2.5609756097560977e-05, 'epoch': 1.45}                                                                                
{'loss': nan, 'learning_rate': 2.3577235772357724e-05, 'epoch': 1.57}                                                                                
{'loss': nan, 'learning_rate': 2.1544715447154475e-05, 'epoch': 1.69}                                                                                
{'loss': nan, 'learning_rate': 1.9512195121951222e-05, 'epoch': 1.81}                                                                                
{'loss': nan, 'learning_rate': 1.747967479674797e-05, 'epoch': 1.93}                                                                                 
{'loss': nan, 'learning_rate': 1.5447154471544717e-05, 'epoch': 2.05}                                                                                
{'loss': nan, 'learning_rate': 1.3414634146341466e-05, 'epoch': 2.17}                                                                                
{'loss': nan, 'learning_rate': 1.1382113821138211e-05, 'epoch': 2.29}                                                                                
{'loss': nan, 'learning_rate': 9.34959349593496e-06, 'epoch': 2.41}                                                                                  
{'loss': nan, 'learning_rate': 7.317073170731707e-06, 'epoch': 2.53}                                                                                 
{'loss': nan, 'learning_rate': 5.2845528455284555e-06, 'epoch': 2.65}                                                                                
{'loss': nan, 'learning_rate': 3.2520325203252037e-06, 'epoch': 2.77}                                                                                
{'loss': nan, 'learning_rate': 1.2195121951219514e-06, 'epoch': 2.89}      

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

2 participants