Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neuron SDK 2.18.0 updates #71

Merged
merged 2 commits into from
Apr 1, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions torch-neuronx/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ The following samples are available for training:
| [hf_bert_jp](training/hf_bert_jp) | Fine-tuning & Deployment Hugging Face BERT Japanese model | DataParallel |
| [hf_sentiment_analysis](training/hf_sentiment_analysis) | Examples of training Hugging Face bert-base-cased model for a text classification task with Trn1 Single Neuron and Distributed Training | DataParallel |
| [customop_mlp](training/customop_mlp) | Examples of training a multilayer perceptron model with a custom Relu operator on a single Trn1 | DataParallel |
| [tp_dp_gpt_neox_20b_hf_pretrain](training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain) [Deprecated] | Please note the following sample location has changed to [NeuronX Distributed Repository](https://github.com/aws-neuron/neuronx-distributed). Training GPT-NEOX 20B model using neuronx-distributed | Tensor Parallel & DataParallel |
| [tp_dp_gpt_neox_6.9b_hf_pretrain](training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_6.9b_hf_pretrain) [Deprecated] | Please note the following sample location has changed to [NeuronX Distributed Repository](https://github.com/aws-neuron/neuronx-distributed). Training GPT-NEOX 6.9B model using neuronx-distributed | Tensor Parallel & DataParallel |
| [tp_zero1_llama2_7b_hf_pretrain](training/llama2/tp_zero1_llama2_7b_hf_pretrain) [Deprecated] | Please note the following sample location has changed to [NeuronX Distributed Repository](https://github.com/aws-neuron/neuronx-distributed). Training Llama-2 7B model using neuronx-distributed | Tensor Parallel |
| [tp_pp_llama2_70b_hf_pretrain](training/llama2/tp_pp_llama2_70b_hf_pretrain) [Deprecated] | Please note the following sample location has changed to [NeuronX Distributed Repository](https://github.com/aws-neuron/neuronx-distributed). Training Llama-2 70B model using neuronx-distributed | Tensor Parallel & Pipeline Parallel |
| [tp_dp_gpt_neox_20b_hf_pretrain](https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain) | Training GPT-NEOX 20B model using neuronx-distributed | Tensor Parallel & DataParallel |
| [tp_dp_gpt_neox_6.9b_hf_pretrain](https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_6.9b_hf_pretrain) | Training GPT-NEOX 6.9B model using neuronx-distributed | Tensor Parallel & DataParallel |
| [tp_zero1_llama2_7b_hf_pretrain](https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/llama2/tp_zero1_llama2_7b_hf_pretrain) | Training Llama-2 7B model using neuronx-distributed | Tensor Parallel |
| [tp_pp_llama2_70b_hf_pretrain](https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/llama2/tp_pp_llama2_hf_pretrain) | Training Llama-2 70B model using neuronx-distributed | Tensor Parallel & Pipeline Parallel |

## Inference

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
"source": [
"%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
"# torchvision version pinned to avoid pulling in torch 2.0\n",
"!pip install -U transformers torchvision==0.14.1 opencv-python Pillow"
"!pip install -U transformers opencv-python Pillow"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
"source": [
"%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
"# torchvision version pinned to avoid pulling in torch 2.0\n",
"!pip install -U transformers torchvision==0.14.1 opencv-python Pillow"
"!pip install -U transformers opencv-python Pillow"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
"- `opencv-python-headless`\n",
"- `imageio`\n",
"- `scipy`\n",
"- `accelerate`\n",
"Furthermore, it requires the `ffmpeg` video-audio converter which is used to extract audio from the input videos.\n",
"\n",
"`torch-neuronx` and `neuronx-cc` should be installed when you configure your environment following the Inf2 setup guide. The remaining dependencies can be installed below:"
Expand All @@ -53,7 +54,7 @@
"outputs": [],
"source": [
"%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
"!pip install transformers==4.30.2 opencv-python-headless==4.8.0.74 imageio scipy opencv-python==4.8.0.74\n",
"!pip install transformers==4.30.2 opencv-python-headless==4.8.0.74 imageio scipy accelerate opencv-python==4.8.0.74\n",
"\n",
"!wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz\n",
"!tar xvf ffmpeg-git-amd64-static.tar.xz\n",
Expand Down
107 changes: 93 additions & 14 deletions torch-neuronx/inference/hf_pretrained_sdxl_base_1024_inference.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,12 @@
"metadata": {},
"source": [
"**Install Dependencies**\n",
"\n",
"This tutorial requires the following pip packages to be installed:\n",
"- `torch-neuronx`\n",
"- `neuronx-cc`\n",
"- `diffusers==0.20.0`\n",
"- `transformers==4.26.1`\n",
"- `accelerate==0.16.0`\n",
"- `diffusers==0.20.2`\n",
"- `transformers==4.33.1`\n",
"- `accelerate==0.22.0`\n",
"- `matplotlib`\n",
"\n",
"`torch-neuronx` and `neuronx-cc` will be installed when you configure your environment following the Inf2 setup guide. The remaining dependencies can be installed below:"
Expand All @@ -53,7 +52,7 @@
"outputs": [],
"source": [
"%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
"!pip install diffusers==0.20.0 transformers==4.26.1 accelerate==0.16.0 matplotlib"
"!pip install diffusers==0.20.2 transformers==4.33.1 accelerate==0.22.0 matplotlib"
]
},
{
Expand All @@ -79,7 +78,8 @@
"from diffusers import DiffusionPipeline\n",
"from diffusers.models.unet_2d_condition import UNet2DConditionOutput\n",
"from diffusers.models.attention_processor import Attention\n",
" \n",
"from transformers.models.clip.modeling_clip import CLIPTextModelOutput\n",
"\n",
"from matplotlib import pyplot as plt\n",
"from matplotlib import image as mpimg\n",
"import time\n",
Expand All @@ -96,12 +96,12 @@
"source": [
"**Define utility classes and functions**\n",
"\n",
"The following section defines some utility classes and functions. In particular, we define a double-wrapper for the UNet. These wrappers enable `torch_neuronx.trace` to trace the wrapped models for compilation with the Neuron compiler. In addition, the `get_attention_scores_neuron` utility function performs optimized attention score calculation and is used to replace the origianl `get_attention_scores` function in the `diffusers` package via a monkey patch (see the next code block under \"Compile UNet and save\" for usage)."
"The following section defines some utility classes and functions. In particular, we define a double-wrapper for the UNet and text encoders. These wrappers enable `torch_neuronx.trace` to trace the wrapped models for compilation with the Neuron compiler. The second wrapper enables the compiled model (which is a TorchScript object so loses the pre compilation attributes) to be used in the pipeline without having to modify the pipeline source code. In addition, the `get_attention_scores_neuron` utility function performs optimized attention score calculation and is used to replace the origianl `get_attention_scores` function in the `diffusers` package via a monkey patch (see the next code block under \"Compile UNet and save\" for usage)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -160,7 +160,29 @@
" encoder_hidden_states,\n",
" added_cond_kwargs[\"text_embeds\"],\n",
" added_cond_kwargs[\"time_ids\"])[0]\n",
" return UNet2DConditionOutput(sample=sample)"
" return UNet2DConditionOutput(sample=sample)\n",
" \n",
"\n",
"class TextEncoderOutputWrapper(nn.Module):\n",
" def __init__(self, traceable_text_encoder, original_text_encoder):\n",
" super().__init__()\n",
" self.traceable_text_encoder = traceable_text_encoder\n",
" self.config = original_text_encoder.config\n",
" self.dtype = original_text_encoder.dtype\n",
" self.device = original_text_encoder.device\n",
"\n",
" def forward(self, text_input_ids, output_hidden_states=True):\n",
" out_tuple = self.traceable_text_encoder(text_input_ids)\n",
" return CLIPTextModelOutput(text_embeds=out_tuple[0], last_hidden_state=out_tuple[1], hidden_states=out_tuple[2])\n",
" \n",
"class TraceableTextEncoder(nn.Module):\n",
" def __init__(self, text_encoder):\n",
" super().__init__()\n",
" self.text_encoder = text_encoder\n",
"\n",
" def forward(self, text_input_ids):\n",
" out_tuple = self.text_encoder(text_input_ids, output_hidden_states=True, return_dict=False)\n",
" return out_tuple"
]
},
{
Expand All @@ -171,9 +193,10 @@
"**Compile the model into an optimized TorchScript and save the TorchScript**\n",
"\n",
"In the following section, we will compile parts of the Stable Diffusion pipeline for execution on Neuron. Note that this only needs to be done once: After you have compiled and saved the model by running the following section of code, you can reuse it any number of times without having to recompile. In particular, we will compile:\n",
"1. The VAE decoder;\n",
"2. The UNet, and\n",
"3. The VAE_post_quant_conv\n",
"1. The text encoders (text_encoder, text_encoder_2)\n",
"2. The VAE decoder;\n",
"3. The UNet, and\n",
"4. The VAE_post_quant_conv\n",
"These blocks are chosen because they represent the bulk of the compute in the pipeline, and performance benchmarking has shown that running them on Neuron yields significant performance benefit.\n",
"\n",
"Several points worth noting are:\n",
Expand All @@ -193,6 +216,58 @@
"# Model ID for SD XL version pipeline\n",
"model_id = \"stabilityai/stable-diffusion-xl-base-1.0\"\n",
"\n",
"# --- Compile Text Encoders and save ---\n",
"\n",
"pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)\n",
"\n",
"\n",
"# Apply wrappers to make text encoders traceable\n",
"traceable_text_encoder = copy.deepcopy(TraceableTextEncoder(pipe.text_encoder))\n",
"traceable_text_encoder_2 = copy.deepcopy(TraceableTextEncoder(pipe.text_encoder_2))\n",
"\n",
"del pipe\n",
"\n",
"text_input_ids_1 = torch.tensor([[49406, 736, 1615, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
" 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
" 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
" 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
" 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
" 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
" 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
" 49407, 49407, 49407, 49407, 49407, 49407, 49407]])\n",
"\n",
"\n",
"text_input_ids_2 = torch.tensor([[49406, 736, 1615, 49407, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
" 0, 0, 0, 0, 0, 0, 0]])\n",
"\n",
"\n",
"# Text Encoder 1\n",
"neuron_text_encoder = torch_neuronx.trace(\n",
" traceable_text_encoder,\n",
" text_input_ids_1,\n",
" compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),\n",
")\n",
"\n",
"text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')\n",
"torch.jit.save(neuron_text_encoder, text_encoder_filename)\n",
"\n",
"\n",
"# Text Encoder 2\n",
"neuron_text_encoder_2 = torch_neuronx.trace(\n",
" traceable_text_encoder_2,\n",
" text_input_ids_2,\n",
" compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2'),\n",
")\n",
"\n",
"text_encoder_2_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2/model.pt')\n",
"torch.jit.save(neuron_text_encoder_2, text_encoder_2_filename)\n",
"\n",
"# --- Compile VAE decoder and save ---\n",
"\n",
"# Only keep the model being compiled in RAM to minimze memory pressure\n",
Expand Down Expand Up @@ -296,14 +371,16 @@
},
"outputs": [],
"source": [
"# --- Load all compiled models ---\n",
"# --- Load all compiled models and run pipeline ---\n",
"COMPILER_WORKDIR_ROOT = 'sdxl_compile_dir_1024'\n",
"model_id = \"stabilityai/stable-diffusion-xl-base-1.0\"\n",
"text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')\n",
"text_encoder_2_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2/model.pt')\n",
"decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')\n",
"unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')\n",
"post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')\n",
"\n",
"pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32, low_cpu_mem_usage=True)\n",
"pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)\n",
"\n",
"# Load the compiled UNet onto two neuron cores.\n",
"pipe.unet = NeuronUNet(UNetWrap(pipe.unet))\n",
Expand All @@ -313,6 +390,8 @@
"# Load other compiled models onto a single neuron core.\n",
"pipe.vae.decoder = torch.jit.load(decoder_filename)\n",
"pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)\n",
"pipe.text_encoder = TextEncoderOutputWrapper(torch.jit.load(text_encoder_filename), pipe.text_encoder)\n",
"pipe.text_encoder_2 = TextEncoderOutputWrapper(torch.jit.load(text_encoder_2_filename), pipe.text_encoder_2)\n",
"\n",
"# Run pipeline\n",
"prompt = [\"a photo of an astronaut riding a horse on mars\",\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,17 +62,14 @@ OUTPUT_DIR="/llama_checkpoints"
CURRENT_BATCH_JOB_ID=$(echo "$AWS_BATCH_JOB_ID" | sed 's/#.*//')
CHECKPOINT_PATH="$CHECKPOINT_SAVE_URI$CURRENT_BATCH_JOB_ID"

NODE_ID=0
WORLD_SIZE=1

if [ -v AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS ]
then
export MASTER_ADDR=$AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS
else
export MASTER_ADDR=`ip -f inet addr show eth0 | grep -Po 'inet \K[\d.]+'`
fi

DP=$(($NEURON_RT_NUM_CORES * $WORLD_SIZE / $TP_DEGREE))
DP=$(($NEURON_RT_NUM_CORES * $NTASKS / $TP_DEGREE))
ACC_STEPS=$(($GBS / $MBS / $DP))

EXTRA_ARGS=" "
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -190,4 +190,3 @@
"nbformat": 4,
"nbformat_minor": 2
}

2 changes: 1 addition & 1 deletion torch-neuronx/training/hf_summarization/BartLarge.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
"metadata": {},
"outputs": [],
"source": [
"%pip install -U optimum-neuron==0.0.15 accelerate==0.23.0 datasets>=1.8.0 sentencepiece!=0.1.92 protobuf==3.20.3 rouge-score nltk py7zr evaluate\n",
"%pip install -U optimum-neuron==0.0.15 accelerate==0.23.0 datasets>=1.8.0 sentencepiece!=0.1.92 rouge-score nltk py7zr evaluate\n",
"# now restart the kernel"
]
},
Expand Down
2 changes: 1 addition & 1 deletion torch-neuronx/training/hf_summarization/T5Large.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
"metadata": {},
"outputs": [],
"source": [
"%pip install -U optimum-neuron==0.0.15 accelerate==0.23.0 datasets>=1.8.0 sentencepiece!=0.1.92 protobuf==3.20.3 rouge-score nltk py7zr evaluate\n",
"%pip install -U optimum-neuron==0.0.15 accelerate==0.23.0 datasets>=1.8.0 sentencepiece!=0.1.92 rouge-score nltk py7zr evaluate\n",
"# now restart the kernel"
]
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"1. First compile the model using the utility `neuron_parallel_compile` to compile the model to run on the AWS Trainium device.\n",
"1. Run the fine-tuning script to train the model based on the associated task (e.g. mrpc). The training job will use 2 workers with data parallel to speed up the training. If you have a larger instance (trn1.32xlarge) you can increase the worker count to 8 or 32.\n",
"\n",
"It has been tested and run on a trn1.2xlarge\n",
"It has been tested and run on a trn1.32xlarge\n",
"\n",
"**Reference:** https://huggingface.co/xlm-roberta-base"
]
Expand Down Expand Up @@ -73,13 +73,13 @@
"outputs": [],
"source": [
"model_name = \"xlm-roberta-base\"\n",
"env_var_options = \"\"\n",
"num_workers = 2\n",
"env_var_options = \"XLA_USE_BF16=1 NEURON_CC_FLAGS=\\'--model-type=transformer --verbose=info\\'\"\n",
"num_workers = 32\n",
"task_name = \"mrpc\"\n",
"batch_size = 8\n",
"max_seq_length = 128\n",
"batch_size = 16\n",
"max_seq_length = 512\n",
"learning_rate = 2e-05\n",
"num_train_epochs = 5\n",
"num_train_epochs = 100\n",
"model_base_name = model_name"
]
},
Expand Down
Loading
Loading