aws-neuron · aws-rxgupta · Apr 1, 2024 · Mar 30, 2024 · Apr 1, 2024
@@ -20,10 +20,10 @@ The following samples are available for training:
 | [hf_bert_jp](training/hf_bert_jp)                           | Fine-tuning & Deployment Hugging Face BERT Japanese model                                                                               | DataParallel |
 | [hf_sentiment_analysis](training/hf_sentiment_analysis)     | Examples of training Hugging Face bert-base-cased model for a text classification task with Trn1 Single Neuron and Distributed Training | DataParallel |
 | [customop_mlp](training/customop_mlp)     | Examples of training a multilayer perceptron model with a custom Relu operator on a single Trn1 | DataParallel |
-| [tp_dp_gpt_neox_20b_hf_pretrain](training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain) [Deprecated]   | Please note the following sample location has changed to [NeuronX Distributed Repository](https://github.com/aws-neuron/neuronx-distributed). Training GPT-NEOX 20B model using neuronx-distributed | Tensor Parallel & DataParallel |
-| [tp_dp_gpt_neox_6.9b_hf_pretrain](training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_6.9b_hf_pretrain) [Deprecated] | Please note the following sample location has changed to [NeuronX Distributed Repository](https://github.com/aws-neuron/neuronx-distributed). Training GPT-NEOX 6.9B model using neuronx-distributed | Tensor Parallel & DataParallel |
-| [tp_zero1_llama2_7b_hf_pretrain](training/llama2/tp_zero1_llama2_7b_hf_pretrain) [Deprecated] | Please note the following sample location has changed to [NeuronX Distributed Repository](https://github.com/aws-neuron/neuronx-distributed). Training Llama-2 7B model using neuronx-distributed | Tensor Parallel |
-| [tp_pp_llama2_70b_hf_pretrain](training/llama2/tp_pp_llama2_70b_hf_pretrain) [Deprecated] | Please note the following sample location has changed to [NeuronX Distributed Repository](https://github.com/aws-neuron/neuronx-distributed). Training Llama-2 70B model using neuronx-distributed | Tensor Parallel & Pipeline Parallel |
+| [tp_dp_gpt_neox_20b_hf_pretrain](https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_20b_hf_pretrain) | Training GPT-NEOX 20B model using neuronx-distributed | Tensor Parallel & DataParallel |
+| [tp_dp_gpt_neox_6.9b_hf_pretrain](https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/tp_dp_gpt_neox_hf_pretrain/tp_dp_gpt_neox_6.9b_hf_pretrain) | Training GPT-NEOX 6.9B model using neuronx-distributed | Tensor Parallel & DataParallel |
+| [tp_zero1_llama2_7b_hf_pretrain](https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/llama2/tp_zero1_llama2_7b_hf_pretrain) | Training Llama-2 7B model using neuronx-distributed | Tensor Parallel |
+| [tp_pp_llama2_70b_hf_pretrain](https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/llama2/tp_pp_llama2_hf_pretrain) | Training Llama-2 70B model using neuronx-distributed | Tensor Parallel & Pipeline Parallel |
 
 ## Inference
 

@@ -54,7 +54,7 @@
             "source": [
                 "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
                 "# torchvision version pinned to avoid pulling in torch 2.0\n",
-                "!pip install -U transformers torchvision==0.14.1 opencv-python Pillow"
+                "!pip install -U transformers opencv-python Pillow"
             ]
         },
         {

@@ -54,7 +54,7 @@
             "source": [
                 "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
                 "# torchvision version pinned to avoid pulling in torch 2.0\n",
-                "!pip install -U transformers torchvision==0.14.1 opencv-python Pillow"
+                "!pip install -U transformers opencv-python Pillow"
             ]
         },
         {

@@ -41,6 +41,7 @@
                 "- `opencv-python-headless`\n",
                 "- `imageio`\n",
                 "- `scipy`\n",
+                "- `accelerate`\n",
                 "Furthermore, it requires the `ffmpeg` video-audio converter which is used to extract audio from the input videos.\n",
                 "\n",
                 "`torch-neuronx` and `neuronx-cc` should be installed when you configure your environment following the Inf2 setup guide. The remaining dependencies can be installed below:"
@@ -53,7 +54,7 @@
             "outputs": [],
             "source": [
                 "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
-                "!pip install transformers==4.30.2 opencv-python-headless==4.8.0.74 imageio scipy opencv-python==4.8.0.74\n",
+                "!pip install transformers==4.30.2 opencv-python-headless==4.8.0.74 imageio scipy accelerate opencv-python==4.8.0.74\n",
                 "\n",
                 "!wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz\n",
                 "!tar xvf ffmpeg-git-amd64-static.tar.xz\n",

@@ -34,13 +34,12 @@
             "metadata": {},
             "source": [
                 "**Install Dependencies**\n",
-                "\n",
                 "This tutorial requires the following pip packages to be installed:\n",
                 "- `torch-neuronx`\n",
                 "- `neuronx-cc`\n",
-                "- `diffusers==0.20.0`\n",
-                "- `transformers==4.26.1`\n",
-                "- `accelerate==0.16.0`\n",
+                "- `diffusers==0.20.2`\n",
+                "- `transformers==4.33.1`\n",
+                "- `accelerate==0.22.0`\n",
                 "- `matplotlib`\n",
                 "\n",
                 "`torch-neuronx` and `neuronx-cc` will be installed when you configure your environment following the Inf2 setup guide. The remaining dependencies can be installed below:"
@@ -53,7 +52,7 @@
             "outputs": [],
             "source": [
                 "%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect\n",
-                "!pip install diffusers==0.20.0 transformers==4.26.1 accelerate==0.16.0 matplotlib"
+                "!pip install diffusers==0.20.2 transformers==4.33.1 accelerate==0.22.0 matplotlib"
             ]
         },
         {
@@ -79,7 +78,8 @@
                 "from diffusers import DiffusionPipeline\n",
                 "from diffusers.models.unet_2d_condition import UNet2DConditionOutput\n",
                 "from diffusers.models.attention_processor import Attention\n",
-                " \n",
+                "from transformers.models.clip.modeling_clip import CLIPTextModelOutput\n",
+                "\n",
                 "from matplotlib import pyplot as plt\n",
                 "from matplotlib import image as mpimg\n",
                 "import time\n",
@@ -96,12 +96,12 @@
             "source": [
                 "**Define utility classes and functions**\n",
                 "\n",
-                "The following section defines some utility classes and functions. In particular, we define a double-wrapper for the UNet. These wrappers enable `torch_neuronx.trace` to trace the wrapped models for compilation with the Neuron compiler. In addition, the `get_attention_scores_neuron` utility function performs optimized attention score calculation and is used to replace the origianl `get_attention_scores` function in the `diffusers` package via a monkey patch (see the next code block under \"Compile UNet and save\" for usage)."
+                "The following section defines some utility classes and functions. In particular, we define a double-wrapper for the UNet and text encoders. These wrappers enable `torch_neuronx.trace` to trace the wrapped models for compilation with the Neuron compiler. The second wrapper enables the compiled model (which is a TorchScript object so loses the pre compilation attributes) to be used in the pipeline without having to modify the pipeline source code. In addition, the `get_attention_scores_neuron` utility function performs optimized attention score calculation and is used to replace the origianl `get_attention_scores` function in the `diffusers` package via a monkey patch (see the next code block under \"Compile UNet and save\" for usage)."
             ]
         },
         {
             "cell_type": "code",
-            "execution_count": 4,
+            "execution_count": 3,
             "metadata": {},
             "outputs": [],
             "source": [
@@ -160,7 +160,29 @@
                 "                               encoder_hidden_states,\n",
                 "                               added_cond_kwargs[\"text_embeds\"],\n",
                 "                               added_cond_kwargs[\"time_ids\"])[0]\n",
-                "        return UNet2DConditionOutput(sample=sample)"
+                "        return UNet2DConditionOutput(sample=sample)\n",
+                "    \n",
+                "\n",
+                "class TextEncoderOutputWrapper(nn.Module):\n",
+                "    def __init__(self, traceable_text_encoder, original_text_encoder):\n",
+                "        super().__init__()\n",
+                "        self.traceable_text_encoder = traceable_text_encoder\n",
+                "        self.config = original_text_encoder.config\n",
+                "        self.dtype = original_text_encoder.dtype\n",
+                "        self.device = original_text_encoder.device\n",
+                "\n",
+                "    def forward(self, text_input_ids, output_hidden_states=True):\n",
+                "        out_tuple = self.traceable_text_encoder(text_input_ids)\n",
+                "        return CLIPTextModelOutput(text_embeds=out_tuple[0], last_hidden_state=out_tuple[1], hidden_states=out_tuple[2])\n",
+                "    \n",
+                "class TraceableTextEncoder(nn.Module):\n",
+                "    def __init__(self, text_encoder):\n",
+                "        super().__init__()\n",
+                "        self.text_encoder = text_encoder\n",
+                "\n",
+                "    def forward(self, text_input_ids):\n",
+                "        out_tuple = self.text_encoder(text_input_ids, output_hidden_states=True, return_dict=False)\n",
+                "        return out_tuple"
             ]
         },
         {
@@ -171,9 +193,10 @@
                 "**Compile the model into an optimized TorchScript and save the TorchScript**\n",
                 "\n",
                 "In the following section, we will compile parts of the Stable Diffusion pipeline for execution on Neuron. Note that this only needs to be done once: After you have compiled and saved the model by running the following section of code, you can reuse it any number of times without having to recompile. In particular, we will compile:\n",
-                "1. The VAE decoder;\n",
-                "2. The UNet, and\n",
-                "3. The VAE_post_quant_conv\n",
+                "1. The text encoders (text_encoder, text_encoder_2)\n",
+                "2. The VAE decoder;\n",
+                "3. The UNet, and\n",
+                "4. The VAE_post_quant_conv\n",
                 "These blocks are chosen because they represent the bulk of the compute in the pipeline, and performance benchmarking has shown that running them on Neuron yields significant performance benefit.\n",
                 "\n",
                 "Several points worth noting are:\n",
@@ -193,6 +216,58 @@
                 "# Model ID for SD XL version pipeline\n",
                 "model_id = \"stabilityai/stable-diffusion-xl-base-1.0\"\n",
                 "\n",
+                "# --- Compile Text Encoders and save ---\n",
+                "\n",
+                "pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)\n",
+                "\n",
+                "\n",
+                "# Apply wrappers to make text encoders traceable\n",
+                "traceable_text_encoder = copy.deepcopy(TraceableTextEncoder(pipe.text_encoder))\n",
+                "traceable_text_encoder_2 = copy.deepcopy(TraceableTextEncoder(pipe.text_encoder_2))\n",
+                "\n",
+                "del pipe\n",
+                "\n",
+                "text_input_ids_1 = torch.tensor([[49406,   736,  1615, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
+                "         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
+                "         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
+                "         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
+                "         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
+                "         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
+                "         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,\n",
+                "         49407, 49407, 49407, 49407, 49407, 49407, 49407]])\n",
+                "\n",
+                "\n",
+                "text_input_ids_2 = torch.tensor([[49406,   736,  1615, 49407,     0,     0,     0,     0,     0,     0,\n",
+                "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
+                "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
+                "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
+                "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
+                "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
+                "             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,\n",
+                "             0,     0,     0,     0,     0,     0,     0]])\n",
+                "\n",
+                "\n",
+                "# Text Encoder 1\n",
+                "neuron_text_encoder = torch_neuronx.trace(\n",
+                "    traceable_text_encoder,\n",
+                "    text_input_ids_1,\n",
+                "    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder'),\n",
+                ")\n",
+                "\n",
+                "text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')\n",
+                "torch.jit.save(neuron_text_encoder, text_encoder_filename)\n",
+                "\n",
+                "\n",
+                "# Text Encoder 2\n",
+                "neuron_text_encoder_2 = torch_neuronx.trace(\n",
+                "    traceable_text_encoder_2,\n",
+                "    text_input_ids_2,\n",
+                "    compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2'),\n",
+                ")\n",
+                "\n",
+                "text_encoder_2_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2/model.pt')\n",
+                "torch.jit.save(neuron_text_encoder_2, text_encoder_2_filename)\n",
+                "\n",
                 "# --- Compile VAE decoder and save ---\n",
                 "\n",
                 "# Only keep the model being compiled in RAM to minimze memory pressure\n",
@@ -296,14 +371,16 @@
             },
             "outputs": [],
             "source": [
-                "# --- Load all compiled models ---\n",
+                "# --- Load all compiled models and run pipeline ---\n",
                 "COMPILER_WORKDIR_ROOT = 'sdxl_compile_dir_1024'\n",
                 "model_id = \"stabilityai/stable-diffusion-xl-base-1.0\"\n",
+                "text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder/model.pt')\n",
+                "text_encoder_2_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'text_encoder_2/model.pt')\n",
                 "decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_decoder/model.pt')\n",
                 "unet_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'unet/model.pt')\n",
                 "post_quant_conv_filename = os.path.join(COMPILER_WORKDIR_ROOT, 'vae_post_quant_conv/model.pt')\n",
                 "\n",
-                "pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32, low_cpu_mem_usage=True)\n",
+                "pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float32)\n",
                 "\n",
                 "# Load the compiled UNet onto two neuron cores.\n",
                 "pipe.unet = NeuronUNet(UNetWrap(pipe.unet))\n",
@@ -313,6 +390,8 @@
                 "# Load other compiled models onto a single neuron core.\n",
                 "pipe.vae.decoder = torch.jit.load(decoder_filename)\n",
                 "pipe.vae.post_quant_conv = torch.jit.load(post_quant_conv_filename)\n",
+                "pipe.text_encoder = TextEncoderOutputWrapper(torch.jit.load(text_encoder_filename), pipe.text_encoder)\n",
+                "pipe.text_encoder_2 = TextEncoderOutputWrapper(torch.jit.load(text_encoder_2_filename), pipe.text_encoder_2)\n",
                 "\n",
                 "# Run pipeline\n",
                 "prompt = [\"a photo of an astronaut riding a horse on mars\",\n",

@@ -62,17 +62,14 @@ OUTPUT_DIR="/llama_checkpoints"
 CURRENT_BATCH_JOB_ID=$(echo "$AWS_BATCH_JOB_ID" | sed 's/#.*//')
 CHECKPOINT_PATH="$CHECKPOINT_SAVE_URI$CURRENT_BATCH_JOB_ID"
 
-NODE_ID=0
-WORLD_SIZE=1
-
 if [ -v AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS ]
 then
 	export MASTER_ADDR=$AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS
 else
 	export MASTER_ADDR=`ip -f inet addr show eth0 | grep -Po 'inet \K[\d.]+'`
 fi
 
-DP=$(($NEURON_RT_NUM_CORES * $WORLD_SIZE / $TP_DEGREE))
+DP=$(($NEURON_RT_NUM_CORES * $NTASKS / $TP_DEGREE))
 ACC_STEPS=$(($GBS / $MBS / $DP))
 
 EXTRA_ARGS=" "

@@ -190,4 +190,3 @@
     "nbformat": 4,
     "nbformat_minor": 2
 }
-
@@ -38,7 +38,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install -U optimum-neuron==0.0.15 accelerate==0.23.0 datasets>=1.8.0 sentencepiece!=0.1.92 protobuf==3.20.3 rouge-score nltk py7zr evaluate\n",
+    "%pip install -U optimum-neuron==0.0.15 accelerate==0.23.0 datasets>=1.8.0 sentencepiece!=0.1.92 rouge-score nltk py7zr evaluate\n",
     "# now restart the kernel"
    ]
   },

@@ -38,7 +38,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip install -U optimum-neuron==0.0.15 accelerate==0.23.0 datasets>=1.8.0 sentencepiece!=0.1.92 protobuf==3.20.3 rouge-score nltk py7zr evaluate\n",
+    "%pip install -U optimum-neuron==0.0.15 accelerate==0.23.0 datasets>=1.8.0 sentencepiece!=0.1.92 rouge-score nltk py7zr evaluate\n",
     "# now restart the kernel"
    ]
   },

@@ -12,7 +12,7 @@
                 "1. First compile the model using the utility `neuron_parallel_compile` to compile the model to run on the AWS Trainium device.\n",
                 "1. Run the fine-tuning script to train the model based on the associated task (e.g. mrpc). The training job will use 2 workers with data parallel to speed up the training. If you have a larger instance (trn1.32xlarge) you can increase the worker count to 8 or 32.\n",
                 "\n",
-                "It has been tested and run on a trn1.2xlarge\n",
+                "It has been tested and run on a trn1.32xlarge\n",
                 "\n",
                 "**Reference:** https://huggingface.co/xlm-roberta-base"
             ]
@@ -73,13 +73,13 @@
             "outputs": [],
             "source": [
                 "model_name = \"xlm-roberta-base\"\n",
-                "env_var_options = \"\"\n",
-                "num_workers = 2\n",
+                "env_var_options = \"XLA_USE_BF16=1 NEURON_CC_FLAGS=\\'--model-type=transformer --verbose=info\\'\"\n",
+                "num_workers = 32\n",
                 "task_name = \"mrpc\"\n",
-                "batch_size = 8\n",
-                "max_seq_length = 128\n",
+                "batch_size = 16\n",
+                "max_seq_length = 512\n",
                 "learning_rate = 2e-05\n",
-                "num_train_epochs = 5\n",
+                "num_train_epochs = 100\n",
                 "model_base_name = model_name"
             ]
         },
Original file line number	Diff line number	Diff line change
Expand Up		@@ -190,4 +190,3 @@
		"nbformat": 4,
		"nbformat_minor": 2
		}