1
1
# Transformers Neuron (`` transformers-neuronx `` ) Developer Guide
2
2
3
- Transformers Neuron for Trn1/ Inf2 is a software package that enables
3
+ Transformers Neuron for Trn1 and Inf2 is a software package that enables
4
4
PyTorch users to perform large language model (LLM) inference on
5
5
second-generation Neuron hardware (See: [ NeuronCore-v2] ( https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuron-core-v2.html ) ).
6
6
@@ -29,7 +29,7 @@ new features are developed.
29
29
To install the most rigorously tested stable release, use the PyPI pip wheel:
30
30
31
31
```
32
- pip install transformers-neuronx
32
+ pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
33
33
```
34
34
35
35
## Development Version
@@ -158,10 +158,12 @@ API via the ``HuggingFaceGenerationModelAdapter`` adapter class. In the followin
158
158
demonstrate how to run sampling with temperature using the `` GPT2 `` model:
159
159
160
160
```
161
+ import os
161
162
from transformers_neuronx.gpt2.model import GPT2ForSampling
162
163
from transformers_neuronx.generation_utils import HuggingFaceGenerationModelAdapter
163
164
from transformers_neuronx.module import save_pretrained_split
164
165
from transformers import AutoModelForCausalLM, AutoTokenizer
166
+ os.environ['NEURON_CC_FLAGS'] = '--model-type=transformer-inference'
165
167
166
168
# Load and save the CPU model
167
169
model_cpu = AutoModelForCausalLM.from_pretrained('gpt2')
@@ -193,18 +195,167 @@ sample_output = model.generate(
193
195
print([tokenizer.decode(tok) for tok in sample_output])
194
196
```
195
197
196
- ## Serialization support
198
+ ## int8 weight storage support
199
+
200
+ Transformers Neuron supports int8 weight storage for the ` GPT2 ` model class.
201
+ int8 weight storage can be used to reduce memory bandwidth usage to improve
202
+ model performace. int8 weight storage support for additional model classes
203
+ will be added in an uncoming relesae. In the following example we demonstrate
204
+ how to apply int8 weight storage to the ` GPT2 ` model via the
205
+ ` QuantizationConfig ` and ` NeuronConfig ` configs:
206
+
207
+ ```
208
+ import os
209
+ import torch
210
+ from transformers_neuronx.gpt2.model import GPT2ForSampling
211
+ from transformers_neuronx.module import save_pretrained_split
212
+ from transformers_neuronx.config import NeuronConfig, QuantizationConfig
213
+ from transformers import AutoModelForCausalLM, AutoTokenizer
214
+ os.environ['NEURON_CC_FLAGS'] = '--model-type=transformer-inference'
215
+
216
+ # Cast attention and mlp layers to low precisions only; layernorms stay as f32
217
+ def amp_callback(model, dtype):
218
+ for block in model.transformer.h:
219
+ block.attn.to(dtype)
220
+ block.mlp.to(dtype)
221
+ model.lm_head.to(dtype)
222
+
223
+ # Load and save the CPU model with bfloat16 casting
224
+ model_cpu = AutoModelForCausalLM.from_pretrained('gpt2')
225
+ amp_callback(model_cpu, torch.bfloat16)
226
+ save_pretrained_split(model_cpu, 'gpt2-split')
227
+
228
+ # Set the weight storage config use int8 quantization and bf16 dequantization
229
+ neuron_config = NeuronConfig(
230
+ quant=QuantizationConfig(quant_dtype='s8', dequant_dtype='bf16'),
231
+ )
232
+
233
+ # Create and compile the Neuron model
234
+ model_neuron = GPT2ForSampling.from_pretrained('gpt2-split', batch_size=1, tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
235
+ model_neuron.to_neuron()
236
+
237
+ # Get a tokenizer and exaple input
238
+ tokenizer = AutoTokenizer.from_pretrained('gpt2')
239
+ text = "Hello, I'm a language model,"
240
+ encoded_input = tokenizer(text, return_tensors='pt')
241
+
242
+ # Run inference
243
+ with torch.inference_mode():
244
+ generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
245
+ print([tokenizer.decode(tok) for tok in generated_sequence])
246
+
247
+ ```
248
+
249
+ ## Parallel Input Prompt Context Encoding
250
+
251
+ Transformers Neuron supports parallel input prompt context encoding for the ` GPT2 `
252
+ model class. Parallel context encoding can be used to significantly reduce
253
+ the latency of the input prompt context encoding before the autoregressive
254
+ decoder token generation loop. Parallel context encoding support for additional
255
+ model classes will be added in an uncoming release.
256
+
257
+ The ` GPT2ForSamplingWithContextBroadcasting ` class has a ` context_length_estimate `
258
+ variable that determines the number of input prompt tokens that will be processed in
259
+ parallel. For optimal results, this should be set to a power of 2 that is
260
+ closest to the most frequently seen input prompt length.
261
+ In the following example we demonstrate how to apply parallel context encoding
262
+ to the ` GPT2 ` model via the ` GPT2ForSamplingWithContextBroadcasting ` class.
263
+ In this example, we set the ` context_length_estimate ` to be 128, which is
264
+ the closest power of 2 the length of the input prompt (97 tokens).
265
+
266
+ ```
267
+ import os
268
+ import math
269
+ import torch
270
+ from transformers_neuronx.gpt2.model import GPT2ForSamplingWithContextBroadcasting
271
+ from transformers_neuronx.module import save_pretrained_split
272
+ from transformers import AutoModelForCausalLM, AutoTokenizer
273
+ os.environ['NEURON_CC_FLAGS'] = '--model-type=transformer-inference' # Apply optimal
274
+
275
+ # Load and save the CPU model with bfloat16 casting
276
+ model_cpu = AutoModelForCausalLM.from_pretrained('gpt2')
277
+ save_pretrained_split(model_cpu, 'gpt2-split')
278
+
279
+ # Get a tokenizer and exaple input
280
+ tokenizer = AutoTokenizer.from_pretrained('gpt2')
281
+ text = "Hello, I'm a generative AI language model. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is powered by large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). With generative AI on AWS, you can reinvent your applications, create entirely new customer experiences, drive unprecedented levels of productivity, and transform your business. "
282
+ encoded_input = tokenizer(text, return_tensors='pt')
283
+
284
+ # Set the number of tokens that will be processed in parallel
285
+ prompt_len = encoded_input.input_ids.shape[1]
286
+ context_length_estimate = int(2 ** math.ceil(math.log(prompt_len, 2))) # Use the closest power of two bucket size
287
+
288
+ # Create and compile the Neuron model
289
+ model_neuron = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2-split', batch_size=1, tp_degree=2, n_positions=256, amp='bf16', context_length_estimate=context_length_estimate)
290
+ model_neuron.to_neuron()
291
+
292
+ # Run inference
293
+ with torch.inference_mode():
294
+ generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
295
+ print([tokenizer.decode(tok) for tok in generated_sequence])
296
+ ```
297
+
298
+ The ` GPT2ForSamplingWithContextBroadcasting ` class can also process
299
+ an input prompt that has a different batch size from the batch size of the
300
+ autoregressive decoder output. For example, an input prompt with batch size = 1 can
301
+ be used to produce an output of batch size = 5 to generate multiple suggestions
302
+ for the same input prompt. The input prompt batch size can be specified using
303
+ the ` prompt_batch_size ` argument and the autoregressive decoder output batch
304
+ size can be specified using the ` batch_size ` argument. In the following example
305
+ we demonstrate how to apply parallel context encoding to the ` GPT2 ` model
306
+ to generate 5 outputs for a single input.
307
+
308
+ ```
309
+ import os
310
+ import math
311
+ import torch
312
+ from transformers_neuronx.gpt2.model import GPT2ForSamplingWithContextBroadcasting
313
+ from transformers_neuronx.module import save_pretrained_split
314
+ from transformers import AutoModelForCausalLM, AutoTokenizer
315
+ os.environ['NEURON_CC_FLAGS'] = '--model-type=transformer-inference'
316
+
317
+ # Load and save the CPU model with bfloat16 casting
318
+ model_cpu = AutoModelForCausalLM.from_pretrained('gpt2')
319
+ save_pretrained_split(model_cpu, 'gpt2-split')
320
+
321
+ # Get a tokenizer and exaple input
322
+ tokenizer = AutoTokenizer.from_pretrained('gpt2')
323
+ text = "Hello, I'm a generative AI language model. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is powered by large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). With generative AI on AWS, you can reinvent your applications, create entirely new customer experiences, drive unprecedented levels of productivity, and transform your business. "
324
+ encoded_input = tokenizer(text, return_tensors='pt')
325
+
326
+ # Set the number of tokens that will be processed in parallel
327
+ prompt_len = encoded_input.input_ids.shape[1]
328
+ context_length_estimate = int(2 ** math.ceil(math.log(prompt_len, 2))) # Use the closest power of two bucket size
329
+
330
+ # Create and compile the Neuron model
331
+ model_neuron = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2-split', prompt_batch_size=1, batch_size=5, tp_degree=2, n_positions=256, amp='bf16', context_length_estimate=context_length_estimate)
332
+ model_neuron.to_neuron()
333
+
334
+ # Run inference
335
+ with torch.inference_mode():
336
+ generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
337
+ for i, output in enumerate(generated_sequence):
338
+ print('-'*50)
339
+ print(f'Batch {i} output:')
340
+ print(tokenizer.decode(output))
341
+ ```
342
+
343
+
344
+ ## [ Experimental] Serialization support
197
345
198
346
Transformers Neuron supports model serialization (model saving and loading) for
199
- the `` GPT2 ` ` model class. Serialization support for additional model classes
347
+ the ` GPT2 ` model class. Serialization support for additional model classes
200
348
will be added in an uncoming relesae. In the following example we demonstrate
201
- how to save and load the `` GPT2 ` ` model:
349
+ how to save and load the ` GPT2 ` model:
202
350
203
351
```
352
+ import os
353
+ import torch
204
354
from transformers_neuronx.gpt2.model import GPT2ForSampling
205
355
from transformers_neuronx.generation_utils import HuggingFaceGenerationModelAdapter
206
356
from transformers_neuronx.module import save_pretrained_split
207
357
from transformers import AutoModelForCausalLM, AutoTokenizer
358
+ os.environ['NEURON_CC_FLAGS'] = '--model-type=transformer-inference'
208
359
209
360
# Load and save the CPU model
210
361
model_cpu = AutoModelForCausalLM.from_pretrained('gpt2')
@@ -221,7 +372,39 @@ model_neuron._save_compiled_artifacts('gpt2-neuron')
221
372
model_neuron = GPT2ForSampling.from_pretrained('gpt2-split', batch_size=1, tp_degree=2, n_positions=256, amp='f32', unroll=None)
222
373
model_neuron._load_compiled_artifacts('gpt2-neuron') # Load the compiled Neuron artifacts
223
374
model_neuron.to_neuron() # Load the model weights but skip compilation
375
+ # Get a tokenizer and exaple input
376
+ tokenizer = AutoTokenizer.from_pretrained('gpt2')
377
+ text = "Hello, I'm a language model,"
378
+ encoded_input = tokenizer(text, return_tensors='pt')
379
+
380
+ # Run inference
381
+ with torch.inference_mode():
382
+ generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
383
+ print([tokenizer.decode(tok) for tok in generated_sequence])
384
+ ```
385
+
386
+ ## model-type=transformer-inference Compiler Flag
387
+
388
+ We recommend using the ` --model-type=transformer-inference ` compiler flag for optimized
389
+ decoder-only LLM inference. In a future release, this compiler flag may be enabled
390
+ by default. This compiler flag can be enabled via the ` NEURON_CC_FLAGS ` environment
391
+ variable:
392
+
224
393
```
394
+ export NEURON_CC_FLAGS="--model-type=transformer-inference"
395
+ ```
396
+
397
+ ## Running inference with multiple models
398
+
399
+ Multiple transformers-neuronx models can be loaded at the same time as long
400
+ as the total number of consumed NeuronCores is less than or equal to the total
401
+ number of NeuronCores on the instance. For example, three tp-degree=8 models can be
402
+ loaded and run in parallel on an inf2.48xlarge which has 24 NeuronCores. The
403
+ ` NEURON_RT_NUM_CORES ` and ` NEURON_RT_VISIBLE_CORES ` environment variables
404
+ can be used to allocate the necessary number of NeuronCores to each process
405
+ to run multiple transformers-neuronx models in parallel. See the
406
+ [ NeuronCore Allocation and Model Placement for Inference] ( https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/inference/core-placement.html#torch-neuronx-core-placement-guide )
407
+ section for additional information about how to use these environment variables.
225
408
226
409
# Examples
227
410
0 commit comments