diff --git a/README.rst b/README.rst index afe110c9f9..93d793ae51 100644 --- a/README.rst +++ b/README.rst @@ -19,53 +19,25 @@ Latest News * [05/2024] `Accelerating Transformers with NVIDIA cuDNN 9 `_ * [03/2024] `Turbocharged Training: Optimizing the Databricks Mosaic AI stack with FP8 `_ * [03/2024] `FP8 Training Support in SageMaker Model Parallelism Library `_ -* [12/2023] `New NVIDIA NeMo Framework Features and NVIDIA H200 `_ - -.. image:: docs/examples/H200-NeMo-performance.png - :width: 600 - :alt: H200 - -* [11/2023] `Inflection-2: The Next Step Up `_ -* [11/2023] `Unleashing The Power Of Transformers With NVIDIA Transformer Engine `_ -* [11/2023] `Accelerating PyTorch Training Workloads with FP8 `_ -* [09/2023] `Transformer Engine added to AWS DL Container for PyTorch Training `_ -* [06/2023] `Breaking MLPerf Training Records with NVIDIA H100 GPUs `_ -* [04/2023] `Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1) `_ What is Transformer Engine? =========================== .. overview-begin-marker-do-not-remove -Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including -using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower -memory utilization in both training and inference. TE provides a collection of highly optimized -building blocks for popular Transformer architectures and an automatic mixed precision-like API that -can be used seamlessly with your framework-specific code. TE also includes a framework agnostic -C++ API that can be integrated with other deep learning libraries to enable FP8 support for Transformers. - -As the number of parameters in Transformer models continues to grow, training and inference for -architectures such as BERT, GPT and T5 become very memory and compute-intensive. Most deep learning -frameworks train with FP32 by default. This is not essential, however, to achieve full accuracy for -many deep learning models. Using mixed-precision training, which combines single-precision (FP32) -with lower precision (e.g. FP16) format when training a model, results in significant speedups with -minimal differences in accuracy as compared to FP32 training. With Hopper GPU -architecture FP8 precision was introduced, which offers improved performance over FP16 with no -degradation in accuracy. Although all major deep learning frameworks support FP16, FP8 support is -not available natively in frameworks today. - -TE addresses the problem of FP8 support by providing APIs that integrate with popular Large Language -Model (LLM) libraries. It provides a Python API consisting of modules to easily build a Transformer -layer as well as a framework-agnostic library in C++ including structs and kernels needed for FP8 support. -Modules provided by TE internally maintain scaling factors and other values needed for FP8 training, greatly -simplifying mixed precision training for users. - -Highlights -========== - -* Easy-to-use modules for building Transformer layers with FP8 support -* Optimizations (e.g. fused kernels) for Transformer models -* Support for FP8 on NVIDIA Hopper and NVIDIA Ada GPUs -* Support for optimizations across all precisions (FP16, BF16) on NVIDIA Ampere GPU architecture generations and later +Transformer Engine (TE) accelerates Transformer models on NVIDIA GPUs with optimized LLM kernels and 8-bit floating point (FP8) on Hopper and Ada GPUs. + +* FP8 Performance + FP8 delivers significant speedups in both training and inference compared to FP16/BF16, with no degradation in accuracy. +* Simplified Mixed Precision + TE handles the complexities of FP8 training, including scaling factors, making it easy to integrate with existing workflows. +* Accelerated Kernels + TE includes highly optimized kernels for FlashAttention, PagedAttention, and more, with support for cuDNN and FlashAttention-2. +* Parallelism Support + TE is designed to work seamlessly with parallelism strategies, including Data Parallelism (DP), Fully Sharded Data Parallelism (FSDP), Tensor Parallelism (TP), Pipeline Parallelism (PP), Expert Parallelism (EP), and Context Parallelism (CP). +* Reduced Memory Usage + Using FP8 model weights leads to lower memory consumption, enabling larger models to be trained and deployed efficiently. +* Framework Support + TE supports PyTorch, JAX, and C++ APIs, and is integrated with popular LLM frameworks. Examples ======== @@ -166,27 +138,27 @@ The quickest way to get started with Transformer Engine is by using Docker image .. code-block:: bash - docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.10-py3 + docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:24.12-py3 -Where 23.10 is the container version. For example, 23.10 for the October 2023 release. +Where 24.12 is the container version. For example, 24.12 for the Dec 2024 release. pip ^^^^^^^^^^^^^^^^^^^^ -To install the latest stable version of Transformer Engine, +Install from `Transformer Engine's PyPI `_, .. code-block:: bash - pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable - -This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable NVTE_FRAMEWORK to a comma-separated list (e.g. NVTE_FRAMEWORK=jax,pytorch,paddle). + pip install transformer_engine[pytorch] -Alternatively, the package can be directly installed from `Transformer Engine's PyPI `_, e.g. +To build the latest stable version from source, .. code-block:: bash - pip install transformer_engine[pytorch] + pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable + +This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable NVTE_FRAMEWORK to a comma-separated list (e.g. NVTE_FRAMEWORK=jax,pytorch). -To obtain the necessary Python bindings for Transformer Engine, the frameworks needed must be explicitly specified as extra dependencies in a comma-separated list (e.g. [jax,pytorch,paddle]). Transformer Engine ships wheels for the core library as well as the PaddlePaddle extensions. Source distributions are shipped for the JAX and PyTorch extensions. +To use the Python bindings for Transformer Engine, you need to specify the required frameworks as extra dependencies. This is done by providing a comma-separated list, such as [jax,pytorch]. The core library of Transformer Engine is available as pre-built wheels, while the JAX and PyTorch extensions are provided as source distributions. From source ^^^^^^^^^^^ @@ -268,12 +240,11 @@ Transformer Engine has been integrated with popular LLM frameworks such as: * `NVIDIA NeMo Framework `_ * `Amazon SageMaker Model Parallel Library `_ * `Levanter `_ +* `Colossal-AI `_ +* `MaxText `_ * `Hugging Face Nanotron `_ - Coming soon! -* `Colossal-AI `_ - Coming soon! -* `PeriFlow `_ - Coming soon! * `GPT-NeoX `_ - Coming soon! - Contributing ============ @@ -288,6 +259,16 @@ Papers * `Megatron-LM sequence parallel `_ * `FP8 Formats for Deep Learning `_ +Blogs +====== +* [12/2023] `New NVIDIA NeMo Framework Features and NVIDIA H200 `_ +* [11/2023] `Inflection-2: The Next Step Up `_ +* [11/2023] `Unleashing The Power Of Transformers With NVIDIA Transformer Engine `_ +* [11/2023] `Accelerating PyTorch Training Workloads with FP8 `_ +* [09/2023] `Transformer Engine added to AWS DL Container for PyTorch Training `_ +* [06/2023] `Breaking MLPerf Training Records with NVIDIA H100 GPUs `_ +* [04/2023] `Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1) `_ + Videos ======