Skip to content

Commit 6b815f8

Browse files
pggPLpre-commit-ci[bot]greptile-apps[bot]ksivaman
authored andcommitted
Docs fix (#2301)
* init Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * lines lenght Signed-off-by: Pawel Gadzinski <[email protected]> * fix Signed-off-by: Pawel Gadzinski <[email protected]> * fix Signed-off-by: Pawel Gadzinski <[email protected]> * fix Signed-off-by: Pawel Gadzinski <[email protected]> * subtitle --- fix in many files: Signed-off-by: Pawel Gadzinski <[email protected]> * cross entropy _input -> input rename Signed-off-by: Pawel Gadzinski <[email protected]> * cross entropy _input -> input rename Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * a lot of small fixes Signed-off-by: Pawel Gadzinski <[email protected]> * torch_version() change Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add missing module and fix warnings Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * fix Signed-off-by: Pawel Gadzinski <[email protected]> * removed training whitespace: Signed-off-by: Pawel Gadzinski <[email protected]> * Update docs/api/pytorch.rst Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <[email protected]> * Fix import Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix more imports Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix NumPy docstring parameter spacing and indentation - Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide - Fix inconsistent indentation in cpu_offload.py docstring - Modified 51 Python files across transformer_engine/pytorch Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
1 parent 981e65e commit 6b815f8

File tree

86 files changed

+1609
-1364
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+1609
-1364
lines changed

.github/workflows/docs.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,10 @@ jobs:
2222
sudo apt-get install -y pandoc graphviz doxygen
2323
export GIT_SHA=$(git show-ref --hash HEAD)
2424
- name: 'Build docs'
25-
run: |
25+
run: | # SPHINXOPTS="-W" errors out on warnings
2626
doxygen docs/Doxyfile
2727
cd docs
28-
make html
28+
make html SPHINXOPTS="-W"
2929
- name: 'Upload docs'
3030
uses: actions/upload-artifact@v4
3131
with:

docs/api/jax.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
See LICENSE for license information.
55

66
Jax
7-
=======
7+
===
88

99
Pre-defined Variable of Logical Axes
1010
------------------------------------
@@ -20,11 +20,11 @@ Variables are available in `transformer_engine.jax.sharding`.
2020

2121

2222
Checkpointing
23-
------------------------------------
23+
-------------
2424
When using checkpointing with Transformer Engine JAX, please be aware of the checkpointing policy being applied to your model. Any JAX checkpointing policy using `dot`, such as `jax.checkpoint_policies.dots_with_no_batch_dims`, may not work with GEMMs provided by Transformer Engine as they do not always use the `jax.lax.dot_general` primitive. Instead, you can use `transformer_engine.jax.checkpoint_policies.dots_and_te_gemms_with_no_batch_dims` or similar policies that are designed to work with Transformer Engine's GEMMs and `jax.lax.dot_general` GEMMs. You may also use any JAX policies that do not filter by primitive, such as `jax.checkpoint_policies.save_only_these_names` or `jax.checkpoint_policies.everything_saveable`.
2525

2626
Modules
27-
------------------------------------
27+
-------
2828
.. autoapiclass:: transformer_engine.jax.flax.TransformerLayerType
2929
.. autoapiclass:: transformer_engine.jax.MeshResource()
3030

docs/api/pytorch.rst

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
44
See LICENSE for license information.
55

6-
pyTorch
6+
PyTorch
77
=======
88

99
.. autoapiclass:: transformer_engine.pytorch.Linear(in_features, out_features, bias=True, **kwargs)
@@ -37,16 +37,23 @@ pyTorch
3737
.. autoapiclass:: transformer_engine.pytorch.CudaRNGStatesTracker()
3838
:members: reset, get_states, set_states, add, fork
3939

40-
.. autoapifunction:: transformer_engine.pytorch.fp8_autocast
41-
42-
.. autoapifunction:: transformer_engine.pytorch.fp8_model_init
4340

4441
.. autoapifunction:: transformer_engine.pytorch.autocast
4542

4643
.. autoapifunction:: transformer_engine.pytorch.quantized_model_init
4744

4845
.. autoapifunction:: transformer_engine.pytorch.checkpoint
4946

47+
48+
.. autoapifunction:: transformer_engine.pytorch.make_graphed_callables
49+
50+
.. autoapifunction:: transformer_engine.pytorch.get_cpu_offload_context
51+
52+
.. autoapifunction:: transformer_engine.pytorch.parallel_cross_entropy
53+
54+
Recipe availability
55+
-------------------
56+
5057
.. autoapifunction:: transformer_engine.pytorch.is_fp8_available
5158

5259
.. autoapifunction:: transformer_engine.pytorch.is_mxfp8_available
@@ -63,9 +70,8 @@ pyTorch
6370

6471
.. autoapifunction:: transformer_engine.pytorch.get_default_recipe
6572

66-
.. autoapifunction:: transformer_engine.pytorch.make_graphed_callables
67-
68-
.. autoapifunction:: transformer_engine.pytorch.get_cpu_offload_context
73+
Mixture of Experts (MoE) functions
74+
----------------------------------
6975

7076
.. autoapifunction:: transformer_engine.pytorch.moe_permute
7177

@@ -75,17 +81,20 @@ pyTorch
7581

7682
.. autoapifunction:: transformer_engine.pytorch.moe_sort_chunks_by_index
7783

78-
.. autoapifunction:: transformer_engine.pytorch.parallel_cross_entropy
79-
8084
.. autoapifunction:: transformer_engine.pytorch.moe_sort_chunks_by_index_with_probs
8185

86+
87+
Communication-computation overlap
88+
---------------------------------
89+
8290
.. autoapifunction:: transformer_engine.pytorch.initialize_ub
8391

8492
.. autoapifunction:: transformer_engine.pytorch.destroy_ub
8593

8694
.. autoapiclass:: transformer_engine.pytorch.UserBufferQuantizationMode
8795
:members: FP8, NONE
8896

97+
8998
Quantized tensors
9099
-----------------
91100

@@ -133,3 +142,10 @@ Tensor saving and restoring functions
133142
.. autoapifunction:: transformer_engine.pytorch.prepare_for_saving
134143

135144
.. autoapifunction:: transformer_engine.pytorch.restore_from_saved
145+
146+
Deprecated functions
147+
--------------------
148+
149+
.. autoapifunction:: transformer_engine.pytorch.fp8_autocast
150+
151+
.. autoapifunction:: transformer_engine.pytorch.fp8_model_init

docs/conf.py

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,11 @@
6161
]
6262

6363
templates_path = ["_templates"]
64-
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
64+
exclude_patterns = [
65+
"_build",
66+
"Thumbs.db",
67+
"sphinx_rtd_theme",
68+
]
6569

6670
source_suffix = ".rst"
6771

@@ -94,10 +98,31 @@
9498
("Values", "params_style"),
9599
("Graphing parameters", "params_style"),
96100
("FP8-related parameters", "params_style"),
101+
("Quantization parameters", "params_style"),
97102
]
98103

99104
breathe_projects = {"TransformerEngine": root_path / "docs" / "doxygen" / "xml"}
100105
breathe_default_project = "TransformerEngine"
101106

102107
autoapi_generate_api_docs = False
103108
autoapi_dirs = [root_path / "transformer_engine"]
109+
autoapi_ignore = ["*test*"]
110+
111+
112+
# There are 2 warnings about the same namespace (transformer_engine) in two different c++ api
113+
# docs pages. This seems to be the only way to suppress these warnings.
114+
def setup(app):
115+
"""Custom Sphinx setup to filter warnings."""
116+
import logging
117+
118+
# Filter out duplicate C++ declaration warnings
119+
class DuplicateDeclarationFilter(logging.Filter):
120+
def filter(self, record):
121+
message = record.getMessage()
122+
if "Duplicate C++ declaration" in message and "transformer_engine" in message:
123+
return False
124+
return True
125+
126+
# Apply filter to Sphinx logger
127+
logger = logging.getLogger("sphinx")
128+
logger.addFilter(DuplicateDeclarationFilter())

docs/debug.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
44
See LICENSE for license information.
5+
56
Precision debug tools
6-
==============================================
7+
=====================
78

89
.. toctree::
910
:caption: Precision debug tools

docs/debug/1_getting_started.rst

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
See LICENSE for license information.
55

66
Getting started
7-
==============
7+
===============
88

99
.. note::
1010

@@ -38,7 +38,7 @@ To start debugging, one needs to create a configuration YAML file. This file lis
3838
one - ``UserProvidedPrecision`` - is a custom feature implemented by the user. Nvidia-DL-Framework-Inspect inserts features into the layers according to the config.
3939

4040
Example training script
41-
----------------------
41+
-----------------------
4242

4343
Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using synthetic data.
4444

@@ -81,7 +81,7 @@ We will demonstrate two debug features on the code above:
8181
2. Logging statistics for other GEMM operations, such as gradient statistics for data gradient GEMM within the LayerNormLinear sub-layer of the TransformerLayer.
8282

8383
Config file
84-
----------
84+
-----------
8585

8686
We need to prepare the configuration YAML file, as below
8787

@@ -114,7 +114,8 @@ We need to prepare the configuration YAML file, as below
114114
Further explanation on how to create config files is in the :doc:`next part of the documentation <2_config_file_structure>`.
115115

116116
Adjusting Python file
117-
--------------------
117+
---------------------
118+
118119

119120
.. code-block:: python
120121
@@ -145,7 +146,8 @@ In the modified code above, the following changes were made:
145146
3. Added ``debug_api.step()`` after each of the forward-backward pass.
146147
147148
Inspecting the logs
148-
------------------
149+
-------------------
150+
149151
150152
Let's look at the files with the logs. Two files will be created:
151153
@@ -213,7 +215,8 @@ The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-
213215
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm iteration=000004 value=130776.7969
214216
215217
Logging using TensorBoard
216-
------------------------
218+
-------------------------
219+
217220
218221
Precision debug tools support logging using `TensorBoard <https://www.tensorflow.org/tensorboard>`_. To enable it, one needs to pass the argument ``tb_writer`` to the ``debug_api.initialize()``. Let's modify ``train.py`` file.
219222

docs/debug/2_config_file_structure.rst

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,14 @@
44
See LICENSE for license information.
55

66
Config File Structure
7-
====================
7+
=====================
88

99
To enable debug features, create a configuration YAML file to specify the desired behavior, such as determining which GEMMs (General Matrix Multiply operations) should run in higher precision rather than FP8 and defining which statistics to log.
1010
Below, we outline how to structure the configuration YAML file.
1111

1212
General Format
13-
-------------
13+
--------------
14+
1415

1516
A config file can have one or more sections, each containing settings for specific layers and features:
1617

@@ -55,7 +56,8 @@ Sections may have any name and must contain:
5556
3. Additional fields describing features for those layers.
5657

5758
Layer Specification
58-
------------------
59+
-------------------
60+
5961

6062
Debug layers can be identified by a ``name`` parameter:
6163

@@ -89,7 +91,8 @@ Examples:
8991
(...)
9092
9193
Names in Transformer Layers
92-
--------------------------
94+
---------------------------
95+
9396

9497
There are three ways to assign a name to a layer in the Transformer Engine:
9598

@@ -154,7 +157,7 @@ Below is an example ``TransformerLayer`` with four linear layers that can be inf
154157
155158
156159
Structured Configuration for GEMMs and Tensors
157-
---------------------------------------------
160+
----------------------------------------------
158161

159162
Sometimes a feature is parameterized by a list of tensors or by a list of GEMMs.
160163
There are multiple ways of describing this parameterization.
@@ -216,7 +219,7 @@ We can use both structs for tensors and GEMMs. The tensors_struct should be nest
216219
gemm_feature_param1: value
217220
218221
Enabling or Disabling Sections and Features
219-
------------------------------------------
222+
-------------------------------------------
220223

221224
Debug features can be enabled or disabled with the ``enabled`` keyword:
222225

docs/debug/3_api_debug_setup.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ Please refer to the Nvidia-DL-Framework-Inspect `documentation <https://github.c
1111
Below, we outline the steps for debug initialization.
1212

1313
initialize()
14-
-----------
14+
------------
15+
1516

1617
Must be called once on every rank in the global context to initialize Nvidia-DL-Framework-Inspect.
1718

@@ -34,7 +35,7 @@ Must be called once on every rank in the global context to initialize Nvidia-DL-
3435
log_dir="./log_dir")
3536
3637
set_tensor_reduction_group()
37-
--------------------------
38+
----------------------------
3839

3940
Needed only for logging tensor stats. In multi-GPU training, activation and gradient tensors are distributed across multiple nodes. This method lets you specify the group for the reduction of stats; see the `reduction group section <./4_distributed.rst#reduction-groups>`_ for more details.
4041

@@ -61,7 +62,7 @@ If the tensor reduction group is not specified, then statistics are reduced acro
6162
# activation/gradient tensor statistics are reduced along pipeline_parallel_group
6263
6364
set_weight_tensor_tp_group_reduce()
64-
---------------------------------
65+
-----------------------------------
6566

6667
By default, weight tensor statistics are reduced within the tensor parallel group. This function allows you to disable that behavior; for more details, see `reduction group section <./4_distributed.rst#reduction-groups>`_.
6768

docs/debug/3_api_features.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
See LICENSE for license information.
55

66
Debug features
7-
==========
7+
==============
88

99
.. autoapiclass:: transformer_engine.debug.features.log_tensor_stats.LogTensorStats
1010
.. autoapiclass:: transformer_engine.debug.features.log_fp8_tensor_stats.LogFp8TensorStats

docs/debug/4_distributed.rst

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
See LICENSE for license information.
55

66
Distributed training
7-
===================
7+
====================
88

99
Nvidia-Pytorch-Inspect with Transformer Engine supports multi-GPU training. This guide describes how to run it and how the supported features work in the distributed setting.
1010

@@ -14,7 +14,8 @@ To use precision debug tools in multi-GPU training, one needs to:
1414
2. If one wants to log stats, one may want to invoke ``debug_api.set_tensor_reduction_group`` with a proper reduction group.
1515

1616
Behavior of the features
17-
-----------------------
17+
------------------------
18+
1819

1920
In a distributed setting, **DisableFP8GEMM** and **DisableFP8Layer** function similarly to the single-GPU case, with no notable differences.
2021

@@ -28,7 +29,8 @@ In a distributed setting, **DisableFP8GEMM** and **DisableFP8Layer** function si
2829
Logging-related features are more complex and will be discussed further in the next sections.
2930

3031
Reduction groups
31-
--------------
32+
----------------
33+
3234

3335
In setups with tensor, data, or pipeline parallelism, some tensors are distributed across multiple GPUs, requiring a reduction operation to compute statistics for these tensors.
3436

@@ -65,15 +67,16 @@ Below, we illustrate configurations for a 4-node setup with tensor parallelism s
6567

6668

6769
Microbatching
68-
-----------
70+
-------------
71+
6972

7073
Let's dive into how statistics collection works with microbatching. By microbatching, we mean invoking multiple ``forward()`` calls for each ``debug_api.step()``. The behavior is as follows:
7174

7275
- For weight tensors, the stats remain the same for each microbatch because the weight does not change.
7376
- For other tensors, the stats are accumulated.
7477

7578
Logging to files and TensorBoard
76-
------------------------------
79+
--------------------------------
7780

7881
In a single-node setup with ``default_logging_enabled=True``, all logs are saved by default to ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log``. In multi-GPU training, each node writes its reduced statistics to its unique file, named ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-i.log`` for rank i. Because these logs contain reduced statistics, the logged values are identical for all nodes within a reduction group.
7982

0 commit comments

Comments
 (0)