[PyTorch] cuda graph support #575

ksivaman · 2023-12-22T14:08:33Z

This PR adds the following features (high-level):

make_graphed_callables API similar to the PyTorch API with some additional arguments for FP8 usage. Support for fp8 weight caching via existing is_first_microbatchargument is also retained.
Restructuring and amax reduction logic with a simpler design and handling of various parallelisms with minimal book-keeping compared to the previous approach.
Forward and backward amaxes are reduced within the scope of current iteration, solving numerous bugs w.r.t. checkpointing and removing the need to save global buffers.
Support for nested/multiple FP8 autocast contexts with different recipes and distributed groups.
Amax reductions are module independent and happen at at autocast level. This also resolves numerous bugs and allows for support for MoE/LoRA like models.
Redesign of transposes for Float8Tensor that makes the transposes persistent for graph capture. Also fixes use cases for the vanilla optimizers (non fp8-distopt).
The scaling inverses for weight tensors are no longer frozen when caching weights across microbatches.

transformer_engine/pytorch/graph.py

transformer_engine/pytorch/float8_tensor.py

timmoon10 · 2024-03-11T22:59:27Z

transformer_engine/pytorch/float8_tensor.py

-    def _reset_caches(self) -> None:
-        """Reset cached values
-
-        Should be called after any in-place operation.
-
-        """
-        self._transpose = None


Removing the automatic cache clearing makes using the transpose cache a much more manual and dangerous process. Consider something like:

matmul(x, w.transpose(0, 1)) w -= learning_rate * w.grad matmul(x, w.transpose(0, 1))

Previously we could just set update_cache="lazy". Now there needs to be manual logic to figure out the microbatch step, or else it will provide the stale values.

In this example, caching is not used, so a fresh transpose will be computed each time.

If caching is used, it is reasonable to expect the user to know when to reuse a cached value and when to force recompute. This is consistent with our design of is_first_microbatch argument to the forward for module APIs.

Note: we use 2 args cache and update_cache to support this logic.

I think we're overfitting to the Linear weight use-case. For example, in #707 I want to pass Float8Tensors between ops as inputs or dgrads:

class DbiasCastTranspose: def backward(self, dy): db = dy.sum(dim=0) dx = cast_transpose(dy) # Creates Float8Tensor with transpose cache return dx, db class FP8Linear: # Part of FP8 attention def backward(self, dy): if not isinstance(dy, Float8Tensor): dy = Float8Tensor.to_float8(dy) dx = Float8Tensor(...) # No transpose cache fp8_gemm(w.transpose()._data, dy.transpose()._data, out=dx._data) dw = fp8_gemm(x, dy) return dx, dw

FP8Linear has no idea where its input came from. Maybe it's from DbiasCastTranspose (Float8Tensor with cached transpose), FP8Linear (Float8Tensor without cached transpose), or a non-FP8 op. Our current approach with lazy transpose caching gives us a lot of flexibility and I think we should abandon it only when really necessary.

I suppose this is not precisely relevant since it doesn't involve in-place operations, but a more general statement about the design of Float8Tensor.

transformer_engine/pytorch/float8_tensor.py

transformer_engine/common/include/transformer_engine/cast_transpose_noop.h

transformer_engine/common/recipe/__init__.py

transformer_engine/pytorch/attention.py

ksivaman · 2024-03-23T06:55:36Z

/te-ci

ksivaman · 2024-03-25T05:29:58Z

/te-ci pytorch

transformer_engine/pytorch/cpp_extensions/normalization.py

ksivaman · 2024-03-27T06:05:12Z

/te-ci pytorch

timmoon10 · 2024-03-27T08:32:32Z

#735 has some improvements to the Float8Tensor transpose function, which should reduce the divergence with #707. If there are no issues, we should merge that branch into this PR.

transformer_engine/pytorch/float8_tensor.py

transformer_engine/pytorch/fp8.py

transformer_engine/pytorch/module/base.py

tests/pytorch/test_cuda_graphs.py

transformer_engine/pytorch/graph.py

tests/pytorch/test_cuda_graphs.py

ksivaman · 2024-04-09T06:23:04Z

/te-ci pytorch

Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Vasudevan Rengasamy <[email protected]> Co-authored-by: Charlene Yang <[email protected]>

ksivaman · 2024-04-10T07:20:59Z

/te-ci pytorch

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

ksivaman · 2024-04-11T04:23:00Z

/te-ci pytorch

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

ksivaman · 2024-04-12T02:54:29Z

/te-ci pytorch

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

ksivaman · 2024-04-12T03:56:50Z

/te-ci pytorch

* FP8 cuda graphs Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Vasudevan Rengasamy <[email protected]> Co-authored-by: Charlene Yang <[email protected]> * Fix numerics Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * exclude torch compile from numerics tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * More numerics fixes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix CI Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * rm fusion from unfused path Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Vasudevan Rengasamy <[email protected]> Co-authored-by: Charlene Yang <[email protected]>

* FP8 cuda graphs Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Vasudevan Rengasamy <[email protected]> Co-authored-by: Charlene Yang <[email protected]> * Fix numerics Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * exclude torch compile from numerics tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * More numerics fixes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix CI Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * rm fusion from unfused path Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Vasudevan Rengasamy <[email protected]> Co-authored-by: Charlene Yang <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

ksivaman marked this pull request as draft December 22, 2023 14:08

timmoon10 reviewed Jan 12, 2024

View reviewed changes

transformer_engine/pytorch/graph.py Outdated Show resolved Hide resolved

ksivaman closed this Jan 30, 2024

ksivaman force-pushed the fp8_cuda_graphs branch from 46b509a to bd7fd0a Compare January 30, 2024 21:11

ksivaman reopened this Jan 30, 2024

ksivaman force-pushed the fp8_cuda_graphs branch from 5d5e52c to 8cb93ff Compare February 2, 2024 06:22

ksivaman force-pushed the fp8_cuda_graphs branch from f4c8b6f to 374867a Compare February 15, 2024 01:57

timmoon10 mentioned this pull request Mar 1, 2024

[PyTorch] Use dummy amax for Float8Tensor cast #693

Merged

timmoon10 requested changes Mar 11, 2024

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

timmoon10 self-requested a review March 11, 2024 22:29

timmoon10 reviewed Mar 11, 2024

View reviewed changes

timmoon10 reviewed Mar 12, 2024

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

transformer_engine/common/include/transformer_engine/cast_transpose_noop.h Show resolved Hide resolved

timmoon10 reviewed Mar 12, 2024

View reviewed changes

transformer_engine/common/recipe/__init__.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/attention.py Outdated Show resolved Hide resolved

timmoon10 mentioned this pull request Mar 21, 2024

[PyTorch] Debug amax reductions in eval mode and async amax reductions #728

Closed

ksivaman marked this pull request as ready for review March 23, 2024 05:42

ksivaman changed the title ~~[WIP ] PyTorch FP8 cuda graphs~~ [PyTorch] cuda graph support Mar 23, 2024

ksivaman force-pushed the fp8_cuda_graphs branch from d0aa61c to bb5b4d6 Compare March 23, 2024 06:15

timmoon10 reviewed Mar 26, 2024

View reviewed changes

transformer_engine/pytorch/cpp_extensions/normalization.py Outdated Show resolved Hide resolved

timmoon10 mentioned this pull request Mar 27, 2024

[PyTorch] Fix Float8Tensor transpose caching in #575 #735

Closed

ksivaman marked this pull request as draft March 27, 2024 23:59

ptrendx added the 1.6.0 label Apr 2, 2024

ksivaman force-pushed the fp8_cuda_graphs branch from eff5d27 to 32e070c Compare April 2, 2024 20:02

timmoon10 requested changes Apr 3, 2024

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

transformer_engine/pytorch/float8_tensor.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

timmoon10 reviewed Apr 4, 2024

View reviewed changes

transformer_engine/pytorch/fp8.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/fp8.py Show resolved Hide resolved

transformer_engine/pytorch/module/base.py Show resolved Hide resolved

ptrendx reviewed Apr 4, 2024

View reviewed changes

tests/pytorch/test_cuda_graphs.py Outdated Show resolved Hide resolved

timmoon10 requested changes Apr 5, 2024

View reviewed changes

transformer_engine/pytorch/graph.py Outdated Show resolved Hide resolved

tests/pytorch/test_cuda_graphs.py Show resolved Hide resolved

FP8 cuda graphs

31dc133

Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Vasudevan Rengasamy <[email protected]> Co-authored-by: Charlene Yang <[email protected]>

ksivaman force-pushed the fp8_cuda_graphs branch from db6a812 to 31dc133 Compare April 10, 2024 06:55

ksivaman added 2 commits April 10, 2024 17:16

Fix numerics

19d0bd4

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

exclude torch compile from numerics tests

86e9162

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

gedoensmax mentioned this pull request Apr 10, 2024

CPU Overhead of te.Linear FP8 Layers #761

Open

ksivaman added 3 commits April 10, 2024 22:58

More numerics fixes

4cbda4a

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Merge branch 'main' into fp8_cuda_graphs

020a870

Fix tests

798ea11

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Fix CI

3c50a17

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

ksivaman force-pushed the fp8_cuda_graphs branch from 9944150 to 3c50a17 Compare April 12, 2024 02:54

rm fusion from unfused path

3bdadcf

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

ksivaman merged commit 73f8d90 into NVIDIA:main Apr 12, 2024
18 of 20 checks passed

ksivaman mentioned this pull request Apr 15, 2024

[PyTorch] Don't use autograd hook for bwd reduction #781

Merged

This was referenced Apr 24, 2024

[PyTorch] Miscellanous fixes for FP8 DPA module #804

Merged

Handle the scaling factor when amax is too tiny that leads to an infinite scale #786

Merged

timmoon10 mentioned this pull request Apr 27, 2024

[PyTorch] Refactor FP8 workspaces in linear modules #820

Merged

ksivaman mentioned this pull request Apr 30, 2024

Avoid amax roll for non-run modules #825

Merged

4 tasks

timmoon10 mentioned this pull request May 3, 2024

[PyTorch] Update FP8 recipe test to handle recipe changes #834

Merged

11 tasks

timmoon10 mentioned this pull request May 25, 2024

[PyTorch] Add CUDA graph tests with FP8 weight caching #869

Merged

11 tasks

Marks101 mentioned this pull request Jul 26, 2024

[PyTorch] Bug in FP8 buffer update causing training instabilities #1047

Open

timmoon10 mentioned this pull request Nov 9, 2024

[PyTorch] Remove special handling for FP8 params in FP8 recipe infrastructure #1326

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] cuda graph support #575

[PyTorch] cuda graph support #575

ksivaman commented Dec 22, 2023 •

edited

Loading

timmoon10 Mar 11, 2024 •

edited

Loading

ksivaman Mar 12, 2024

ksivaman Mar 12, 2024

ksivaman Mar 12, 2024

timmoon10 Mar 12, 2024 •

edited

Loading

ksivaman commented Mar 23, 2024

ksivaman commented Mar 25, 2024

ksivaman commented Mar 27, 2024

timmoon10 commented Mar 27, 2024 •

edited

Loading

ksivaman commented Apr 9, 2024

ksivaman commented Apr 10, 2024

ksivaman commented Apr 11, 2024

ksivaman commented Apr 12, 2024

ksivaman commented Apr 12, 2024

[PyTorch] cuda graph support #575

[PyTorch] cuda graph support #575

Conversation

ksivaman commented Dec 22, 2023 • edited Loading

timmoon10 Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

ksivaman Mar 12, 2024

Choose a reason for hiding this comment

ksivaman Mar 12, 2024

Choose a reason for hiding this comment

ksivaman Mar 12, 2024

Choose a reason for hiding this comment

timmoon10 Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

ksivaman commented Mar 23, 2024

ksivaman commented Mar 25, 2024

ksivaman commented Mar 27, 2024

timmoon10 commented Mar 27, 2024 • edited Loading

ksivaman commented Apr 9, 2024

ksivaman commented Apr 10, 2024

ksivaman commented Apr 11, 2024

ksivaman commented Apr 12, 2024

ksivaman commented Apr 12, 2024

ksivaman commented Dec 22, 2023 •

edited

Loading

timmoon10 Mar 11, 2024 •

edited

Loading

timmoon10 Mar 12, 2024 •

edited

Loading

timmoon10 commented Mar 27, 2024 •

edited

Loading