diff --git a/README.rst b/README.rst
index afe110c9f9..93d793ae51 100644
--- a/README.rst
+++ b/README.rst
@@ -19,53 +19,25 @@ Latest News
* [05/2024] `Accelerating Transformers with NVIDIA cuDNN 9 `_
* [03/2024] `Turbocharged Training: Optimizing the Databricks Mosaic AI stack with FP8 `_
* [03/2024] `FP8 Training Support in SageMaker Model Parallelism Library `_
-* [12/2023] `New NVIDIA NeMo Framework Features and NVIDIA H200 `_
-
-.. image:: docs/examples/H200-NeMo-performance.png
- :width: 600
- :alt: H200
-
-* [11/2023] `Inflection-2: The Next Step Up `_
-* [11/2023] `Unleashing The Power Of Transformers With NVIDIA Transformer Engine `_
-* [11/2023] `Accelerating PyTorch Training Workloads with FP8 `_
-* [09/2023] `Transformer Engine added to AWS DL Container for PyTorch Training `_
-* [06/2023] `Breaking MLPerf Training Records with NVIDIA H100 GPUs `_
-* [04/2023] `Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1) `_
What is Transformer Engine?
===========================
.. overview-begin-marker-do-not-remove
-Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including
-using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower
-memory utilization in both training and inference. TE provides a collection of highly optimized
-building blocks for popular Transformer architectures and an automatic mixed precision-like API that
-can be used seamlessly with your framework-specific code. TE also includes a framework agnostic
-C++ API that can be integrated with other deep learning libraries to enable FP8 support for Transformers.
-
-As the number of parameters in Transformer models continues to grow, training and inference for
-architectures such as BERT, GPT and T5 become very memory and compute-intensive. Most deep learning
-frameworks train with FP32 by default. This is not essential, however, to achieve full accuracy for
-many deep learning models. Using mixed-precision training, which combines single-precision (FP32)
-with lower precision (e.g. FP16) format when training a model, results in significant speedups with
-minimal differences in accuracy as compared to FP32 training. With Hopper GPU
-architecture FP8 precision was introduced, which offers improved performance over FP16 with no
-degradation in accuracy. Although all major deep learning frameworks support FP16, FP8 support is
-not available natively in frameworks today.
-
-TE addresses the problem of FP8 support by providing APIs that integrate with popular Large Language
-Model (LLM) libraries. It provides a Python API consisting of modules to easily build a Transformer
-layer as well as a framework-agnostic library in C++ including structs and kernels needed for FP8 support.
-Modules provided by TE internally maintain scaling factors and other values needed for FP8 training, greatly
-simplifying mixed precision training for users.
-
-Highlights
-==========
-
-* Easy-to-use modules for building Transformer layers with FP8 support
-* Optimizations (e.g. fused kernels) for Transformer models
-* Support for FP8 on NVIDIA Hopper and NVIDIA Ada GPUs
-* Support for optimizations across all precisions (FP16, BF16) on NVIDIA Ampere GPU architecture generations and later
+Transformer Engine (TE) accelerates Transformer models on NVIDIA GPUs with optimized LLM kernels and 8-bit floating point (FP8) on Hopper and Ada GPUs.
+
+* FP8 Performance
+ FP8 delivers significant speedups in both training and inference compared to FP16/BF16, with no degradation in accuracy.
+* Simplified Mixed Precision
+ TE handles the complexities of FP8 training, including scaling factors, making it easy to integrate with existing workflows.
+* Accelerated Kernels
+ TE includes highly optimized kernels for FlashAttention, PagedAttention, and more, with support for cuDNN and FlashAttention-2.
+* Parallelism Support
+ TE is designed to work seamlessly with parallelism strategies, including Data Parallelism (DP), Fully Sharded Data Parallelism (FSDP), Tensor Parallelism (TP), Pipeline Parallelism (PP), Expert Parallelism (EP), and Context Parallelism (CP).
+* Reduced Memory Usage
+ Using FP8 model weights leads to lower memory consumption, enabling larger models to be trained and deployed efficiently.
+* Framework Support
+ TE supports PyTorch, JAX, and C++ APIs, and is integrated with popular LLM frameworks.
Examples
========
@@ -166,27 +138,27 @@ The quickest way to get started with Transformer Engine is by using Docker image
.. code-block:: bash
- docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.10-py3
+ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:24.12-py3
-Where 23.10 is the container version. For example, 23.10 for the October 2023 release.
+Where 24.12 is the container version. For example, 24.12 for the Dec 2024 release.
pip
^^^^^^^^^^^^^^^^^^^^
-To install the latest stable version of Transformer Engine,
+Install from `Transformer Engine's PyPI `_,
.. code-block:: bash
- pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
-
-This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable NVTE_FRAMEWORK to a comma-separated list (e.g. NVTE_FRAMEWORK=jax,pytorch,paddle).
+ pip install transformer_engine[pytorch]
-Alternatively, the package can be directly installed from `Transformer Engine's PyPI `_, e.g.
+To build the latest stable version from source,
.. code-block:: bash
- pip install transformer_engine[pytorch]
+ pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
+
+This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable NVTE_FRAMEWORK to a comma-separated list (e.g. NVTE_FRAMEWORK=jax,pytorch).
-To obtain the necessary Python bindings for Transformer Engine, the frameworks needed must be explicitly specified as extra dependencies in a comma-separated list (e.g. [jax,pytorch,paddle]). Transformer Engine ships wheels for the core library as well as the PaddlePaddle extensions. Source distributions are shipped for the JAX and PyTorch extensions.
+To use the Python bindings for Transformer Engine, you need to specify the required frameworks as extra dependencies. This is done by providing a comma-separated list, such as [jax,pytorch]. The core library of Transformer Engine is available as pre-built wheels, while the JAX and PyTorch extensions are provided as source distributions.
From source
^^^^^^^^^^^
@@ -268,12 +240,11 @@ Transformer Engine has been integrated with popular LLM frameworks such as:
* `NVIDIA NeMo Framework `_
* `Amazon SageMaker Model Parallel Library `_
* `Levanter `_
+* `Colossal-AI `_
+* `MaxText `_
* `Hugging Face Nanotron `_ - Coming soon!
-* `Colossal-AI `_ - Coming soon!
-* `PeriFlow `_ - Coming soon!
* `GPT-NeoX `_ - Coming soon!
-
Contributing
============
@@ -288,6 +259,16 @@ Papers
* `Megatron-LM sequence parallel `_
* `FP8 Formats for Deep Learning `_
+Blogs
+======
+* [12/2023] `New NVIDIA NeMo Framework Features and NVIDIA H200 `_
+* [11/2023] `Inflection-2: The Next Step Up `_
+* [11/2023] `Unleashing The Power Of Transformers With NVIDIA Transformer Engine `_
+* [11/2023] `Accelerating PyTorch Training Workloads with FP8 `_
+* [09/2023] `Transformer Engine added to AWS DL Container for PyTorch Training `_
+* [06/2023] `Breaking MLPerf Training Records with NVIDIA H100 GPUs `_
+* [04/2023] `Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1) `_
+
Videos
======