Skip to content

Commit

Permalink
Update README.rst
Browse files Browse the repository at this point in the history
- Updated "Latest News" section - Moved old news to new "Blogs" section
- Updated "What is Transformer Engine?" section - more concise with benefits highlighted
- Updated "Installation" section - prefer pip install from wheel and removed paddle mention
- Updated "Integrations" section - added ColossalAI and MaxText, Removed periflow

Signed-off-by: Santosh Bhavani <[email protected]>
  • Loading branch information
sbhavani authored Dec 23, 2024
1 parent 06f2286 commit 99ab2af
Showing 1 changed file with 35 additions and 54 deletions.
89 changes: 35 additions & 54 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,53 +19,25 @@ Latest News
* [05/2024] `Accelerating Transformers with NVIDIA cuDNN 9 <https://developer.nvidia.com/blog/accelerating-transformers-with-nvidia-cudnn-9/>`_
* [03/2024] `Turbocharged Training: Optimizing the Databricks Mosaic AI stack with FP8 <https://www.databricks.com/blog/turbocharged-training-optimizing-databricks-mosaic-ai-stack-fp8>`_
* [03/2024] `FP8 Training Support in SageMaker Model Parallelism Library <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-release-notes.html>`_
* [12/2023] `New NVIDIA NeMo Framework Features and NVIDIA H200 <https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility/>`_

.. image:: docs/examples/H200-NeMo-performance.png
:width: 600
:alt: H200

* [11/2023] `Inflection-2: The Next Step Up <https://inflection.ai/inflection-2>`_
* [11/2023] `Unleashing The Power Of Transformers With NVIDIA Transformer Engine <https://lambdalabs.com/blog/unleashing-the-power-of-transformers-with-nvidia-transformer-engine>`_
* [11/2023] `Accelerating PyTorch Training Workloads with FP8 <https://towardsdatascience.com/accelerating-pytorch-training-workloads-with-fp8-5a5123aec7d7>`_
* [09/2023] `Transformer Engine added to AWS DL Container for PyTorch Training <https://github.com/aws/deep-learning-containers/pull/3315>`_
* [06/2023] `Breaking MLPerf Training Records with NVIDIA H100 GPUs <https://developer.nvidia.com/blog/breaking-mlperf-training-records-with-nvidia-h100-gpus/>`_
* [04/2023] `Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1) <https://www.mosaicml.com/blog/coreweave-nvidia-h100-part-1>`_

What is Transformer Engine?
===========================
.. overview-begin-marker-do-not-remove
Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including
using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower
memory utilization in both training and inference. TE provides a collection of highly optimized
building blocks for popular Transformer architectures and an automatic mixed precision-like API that
can be used seamlessly with your framework-specific code. TE also includes a framework agnostic
C++ API that can be integrated with other deep learning libraries to enable FP8 support for Transformers.

As the number of parameters in Transformer models continues to grow, training and inference for
architectures such as BERT, GPT and T5 become very memory and compute-intensive. Most deep learning
frameworks train with FP32 by default. This is not essential, however, to achieve full accuracy for
many deep learning models. Using mixed-precision training, which combines single-precision (FP32)
with lower precision (e.g. FP16) format when training a model, results in significant speedups with
minimal differences in accuracy as compared to FP32 training. With Hopper GPU
architecture FP8 precision was introduced, which offers improved performance over FP16 with no
degradation in accuracy. Although all major deep learning frameworks support FP16, FP8 support is
not available natively in frameworks today.

TE addresses the problem of FP8 support by providing APIs that integrate with popular Large Language
Model (LLM) libraries. It provides a Python API consisting of modules to easily build a Transformer
layer as well as a framework-agnostic library in C++ including structs and kernels needed for FP8 support.
Modules provided by TE internally maintain scaling factors and other values needed for FP8 training, greatly
simplifying mixed precision training for users.

Highlights
==========

* Easy-to-use modules for building Transformer layers with FP8 support
* Optimizations (e.g. fused kernels) for Transformer models
* Support for FP8 on NVIDIA Hopper and NVIDIA Ada GPUs
* Support for optimizations across all precisions (FP16, BF16) on NVIDIA Ampere GPU architecture generations and later
Transformer Engine (TE) accelerates Transformer models on NVIDIA GPUs with optimized LLM kernels and 8-bit floating point (FP8) on Hopper and Ada GPUs.

* FP8 Performance
FP8 delivers significant speedups in both training and inference compared to FP16/BF16, with no degradation in accuracy.
* Simplified Mixed Precision
TE handles the complexities of FP8 training, including scaling factors, making it easy to integrate with existing workflows.
* Accelerated Kernels
TE includes highly optimized kernels for FlashAttention, PagedAttention, and more, with support for cuDNN and FlashAttention-2.
* Parallelism Support
TE is designed to work seamlessly with parallelism strategies, including Data Parallelism (DP), Fully Sharded Data Parallelism (FSDP), Tensor Parallelism (TP), Pipeline Parallelism (PP), Expert Parallelism (EP), and Context Parallelism (CP).
* Reduced Memory Usage
Using FP8 model weights leads to lower memory consumption, enabling larger models to be trained and deployed efficiently.
* Framework Support
TE supports PyTorch, JAX, and C++ APIs, and is integrated with popular LLM frameworks.

Examples
========
Expand Down Expand Up @@ -166,27 +138,27 @@ The quickest way to get started with Transformer Engine is by using Docker image

.. code-block:: bash
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.10-py3
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:24.12-py3
Where 23.10 is the container version. For example, 23.10 for the October 2023 release.
Where 24.12 is the container version. For example, 24.12 for the Dec 2024 release.

pip
^^^^^^^^^^^^^^^^^^^^
To install the latest stable version of Transformer Engine,
Install from `Transformer Engine's PyPI <https://pypi.org/project/transformer-engine/>`_,

.. code-block:: bash
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable NVTE_FRAMEWORK to a comma-separated list (e.g. NVTE_FRAMEWORK=jax,pytorch,paddle).
pip install transformer_engine[pytorch]
Alternatively, the package can be directly installed from `Transformer Engine's PyPI <https://pypi.org/project/transformer-engine/>`_, e.g.
To build the latest stable version from source,

.. code-block:: bash
pip install transformer_engine[pytorch]
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable NVTE_FRAMEWORK to a comma-separated list (e.g. NVTE_FRAMEWORK=jax,pytorch).

To obtain the necessary Python bindings for Transformer Engine, the frameworks needed must be explicitly specified as extra dependencies in a comma-separated list (e.g. [jax,pytorch,paddle]). Transformer Engine ships wheels for the core library as well as the PaddlePaddle extensions. Source distributions are shipped for the JAX and PyTorch extensions.
To use the Python bindings for Transformer Engine, you need to specify the required frameworks as extra dependencies. This is done by providing a comma-separated list, such as [jax,pytorch]. The core library of Transformer Engine is available as pre-built wheels, while the JAX and PyTorch extensions are provided as source distributions.

From source
^^^^^^^^^^^
Expand Down Expand Up @@ -268,12 +240,11 @@ Transformer Engine has been integrated with popular LLM frameworks such as:
* `NVIDIA NeMo Framework <https://github.com/NVIDIA/NeMo-Megatron-Launcher>`_
* `Amazon SageMaker Model Parallel Library <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features-v2-tensor-parallelism.html>`_
* `Levanter <https://github.com/stanford-crfm/levanter>`_
* `Colossal-AI <https://github.com/hpcaitech/ColossalAI>`_
* `MaxText <https://github.com/AI-Hypercomputer/maxtext>`_
* `Hugging Face Nanotron <https://github.com/huggingface/nanotron>`_ - Coming soon!
* `Colossal-AI <https://github.com/hpcaitech/ColossalAI>`_ - Coming soon!
* `PeriFlow <https://github.com/friendliai/periflow-python-sdk>`_ - Coming soon!
* `GPT-NeoX <https://github.com/EleutherAI/gpt-neox>`_ - Coming soon!


Contributing
============

Expand All @@ -288,6 +259,16 @@ Papers
* `Megatron-LM sequence parallel <https://arxiv.org/pdf/2205.05198.pdf>`_
* `FP8 Formats for Deep Learning <https://arxiv.org/abs/2209.05433>`_

Blogs
======
* [12/2023] `New NVIDIA NeMo Framework Features and NVIDIA H200 <https://developer.nvidia.com/blog/new-nvidia-nemo-framework-features-and-nvidia-h200-supercharge-llm-training-performance-and-versatility/>`_
* [11/2023] `Inflection-2: The Next Step Up <https://inflection.ai/inflection-2>`_
* [11/2023] `Unleashing The Power Of Transformers With NVIDIA Transformer Engine <https://lambdalabs.com/blog/unleashing-the-power-of-transformers-with-nvidia-transformer-engine>`_
* [11/2023] `Accelerating PyTorch Training Workloads with FP8 <https://towardsdatascience.com/accelerating-pytorch-training-workloads-with-fp8-5a5123aec7d7>`_
* [09/2023] `Transformer Engine added to AWS DL Container for PyTorch Training <https://github.com/aws/deep-learning-containers/pull/3315>`_
* [06/2023] `Breaking MLPerf Training Records with NVIDIA H100 GPUs <https://developer.nvidia.com/blog/breaking-mlperf-training-records-with-nvidia-h100-gpus/>`_
* [04/2023] `Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1) <https://www.mosaicml.com/blog/coreweave-nvidia-h100-part-1>`_

Videos
======

Expand Down

0 comments on commit 99ab2af

Please sign in to comment.