[PyTorch] Non-reentrant mode for activation recompute #670

denera · 2024-02-15T21:59:07Z

Existing implementation for te.distributed.checkpoint() is hard-coded to mimic torch.utils.checkpoint.checkpoint(..., use_reentrant=True) and does not support use_reentrant=False mode. This requires at least one input tensor to the forward pass to have requires_grad=True enabled, which is not possible when the input to the checkpointed module is not a leaf node.

This PR implements support for use_reentrant=False using a pair of nested torch.autograd.graph.saved_tensor_hooks(pack, unpack) contexts. The logical sequence is like this:

The outer pack_hook(x) intercepts the ctx.save_for_backward(x, ...) calls in the forward pass. Here, we discard the activation tensors we would normally save and replace their nodes in the computation graph with integer indexes for a list of recomputed tensors (to be recomputed later).
Autograd engine triggers the outer unpack_hook(idx) to populate ctx.saved_tensors in the backward pass.
If the list of recomputed tensors is empty (idx==0), the outer unpack_hook(idx) triggers the forward recompute. Within the recompute, an inner pack_hook(x) interceptsthe ctx.save_for_backward(x, ...) calls to stash the detached activations into the recomputed tensors list.
Otherwise, if the activations have already been recomputed (idx >= 1) the outer unpack_hook(idx) simply returns the activation tensor from the index and clears it from the list of recomputed tensors.
The inner unpack_hook(idx) is never executed.

Signed-off-by: Alp Dener <[email protected]>

… consistent with other TE API Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Alp Dener <[email protected]>

…eckpointing in non-reentrant mode Signed-off-by: Alp Dener <[email protected]>

denera · 2024-02-17T06:52:56Z

/te-ci pytorch

ksivaman · 2024-02-19T04:53:35Z

For future reference, code is copied from native PyTorch non-re-entrant mode here. @denera Is there a reason we've changed the names for some of the implementations? E.g. _checkpoint_hook → _CheckpointHook. If it's purely for naming, it might be better for us to make an exception here to be consistent with reference implementation.

denera · 2024-02-19T05:44:59Z

@ksivaman There's no particular reason for the naming beyond consistency with TE conventions (PascalCase for classes and snake_case for functions). I think making an exception here to remain consistent with original PyTorch source is a good idea. I'll make the change.

…mentation Signed-off-by: Alp Dener <[email protected]>

Signed-off-by: Alp Dener <[email protected]>

denera · 2024-02-21T15:17:04Z

/te-ci pytorch

ksivaman · 2024-02-22T05:13:09Z

transformer_engine/pytorch/distributed.py

-@contextmanager
-def activation_recompute_forward(
-    activation_recompute: bool = False,
-    recompute_phase: bool = False,
-) -> None:
+class activation_recompute_forward(AbstractContextManager, ContextDecorator):


Curious, why?

The native PyTorch re-entrant checkpoint has an option to provide a context function that returns (forward_ctx, recompute_ctx) objects, which are then combined with the native _checkpoint_hook and _recomputation_hook within PyTorch. Re-implementing our activation_recompute_forward() as a class instead of a function made this workflow easier/cleaner for me.

Of course this PR does not use the native PyTorch checkpoint, but I've confirmed in my limited testing that it does work when supplied with the right context function. The caveat is that you have to set the RNG states once in the beginning correctly for all the devices, and then make sure the modules never tamper with the RNG state themselves. The preserve_rng_state=True option in PyTorch's native checkpointing makes sure that the initially correct RNG states are preserved through the checkpoint and recompute.

I kept this out of the PR because:

It does not work with TE modules/models that use CudaRNGStateManager. Supporting that requires TE to implement its own non-reentrant checkpoint, which I've done in this PR.

TE modules that implement custom forward/backward ops do not benefit from the early stopping feature in the PyTorch checkpoint because they call ctx.save_for_backward() only once, with all the saved tensors passed in bulk. The internal bookkeeping for early stopping relies on this being called separately for each tensor that needs to be saved.

Since I already did this conversion, I left it in this PR in preparation for the future possibility that we might figure out a way to eliminate CudaRNGStateManager and perhaps restructure how we use ctx.save_for_backward() in custom forward/backward ops. That would let us completely get rid of TE's own checkpoint implementation and just use PyTorch's native API.

Signed-off-by: Alp Dener <[email protected]>

denera · 2024-02-22T23:18:28Z

/te-ci pytorch

denera · 2024-02-23T06:30:26Z

@ksivaman @ptrendx The new non-reentrant checkpoint has a CI failure on test_numerics.py::test_gpt_full_activation_recompute with float32 dtype, no fp8, and batch size 1 only. The exact same test is passing with float16 and bfloat16 types, and the float32 type passes batch size 2 test too. Very odd, and unfortunately I haven't been able to reproduce it either with manual testing on the same nodes.

This particular test was actually one of several batch size 1 failures for activation recompute in previous CI runs. @ptrendx guessed that they may be due to nondeterminism from the bias-GELU fusion and turning that off for non-reentrant checkpointing resolved all the failures except this specific one. So it looks like I'm still missing something here but I haven't been able to narrow it down.

Any ideas what might be happening here?

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

…mparison Signed-off-by: Alp Dener <[email protected]>

denera · 2024-02-23T19:29:51Z

/te-ci pytorch

ksivaman

LGTM

ptrendx · 2024-02-24T00:44:42Z

For the record - I don't agree with merging this at this point. The change to pass the test is not applicable to most workloads using recomputation and the reason of the failure is not really understood. I was talking to @denera offline and we have some more ideas for debugging the underlying problem - please open a follow-up PR if you find the problem to be caused by the checkpointing logic.

denera · 2024-02-24T03:29:19Z

@ptrendx I didn't reach out to @ksivaman fast enough after our talk to hold off on merging. Sorry!

Fortunately, the previous reentrant checkpoint is still the default option for TE checkpointing so any existing app should not see any changes on their end.

In the meantime, I'm suspecting that the pack/unpack hooks in the non-reentrant checkpoint cause the recompute to recover the wrong tensor whenever the compile cache limit triggers a recompile of the hooks. I see the recomputed tensor counter getting reset when it shouldn't be. I think I have a solution for that and I will file a PR early next week.

denera added the enhancement New feature or request label Feb 15, 2024

denera added 3 commits February 16, 2024 15:24

added non-reentrant mode support to TE checkpoint

1f2a594

Signed-off-by: Alp Dener <[email protected]>

updated get_cuda_rng_tracker kwarg to get_rng_state_tracker to remain…

879b50b

… consistent with other TE API Signed-off-by: Alp Dener <[email protected]>

docstring cleanup

f4e6fce

Signed-off-by: Alp Dener <[email protected]>

denera force-pushed the databricks/non-reentrant-checkpoint branch from 45ee94f to f4e6fce Compare February 16, 2024 15:26

added mechanism to disable bias_gelu_nvfusion in LayerNormMLP when ch…

b91c487

…eckpointing in non-reentrant mode Signed-off-by: Alp Dener <[email protected]>

refactored checkpoint and recompute hook names to match PyTorch imple…

4b6fced

…mentation Signed-off-by: Alp Dener <[email protected]>

denera self-assigned this Feb 21, 2024

Merge branch 'main' into databricks/non-reentrant-checkpoint

b1c4bb2

Signed-off-by: Alp Dener <[email protected]>

denera force-pushed the databricks/non-reentrant-checkpoint branch from 42eb675 to b1c4bb2 Compare February 21, 2024 01:39

denera requested review from ksivaman and ptrendx February 21, 2024 01:39

Fixed incorrect reference before assignment

8aea00a

Signed-off-by: Alp Dener <[email protected]>

ksivaman reviewed Feb 22, 2024

View reviewed changes

denera added 3 commits February 22, 2024 19:31

fixed argument error in calling native PyTorch checkpoint

5f7b9a1

Signed-off-by: Alp Dener <[email protected]>

Merge branch 'main' into databricks/non-reentrant-checkpoint

12f3c88

Signed-off-by: Alp Dener <[email protected]>

fixed linting errors for missing docstrings

8996dbd

Signed-off-by: Alp Dener <[email protected]>

ksivaman and others added 2 commits February 23, 2024 00:16

Fix lint

1fc28a3

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

bias GELU fusion consistency between checkpoint test and reference co…

4e08341

…mparison Signed-off-by: Alp Dener <[email protected]>

ksivaman approved these changes Feb 24, 2024

View reviewed changes

ksivaman merged commit 82bc797 into NVIDIA:main Feb 24, 2024
17 of 20 checks passed

ksivaman added the 1.5.0 label Feb 24, 2024

ksivaman mentioned this pull request Feb 26, 2024

[PyTorch] Fix non-reentrant checkpointing test #685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Non-reentrant mode for activation recompute #670

[PyTorch] Non-reentrant mode for activation recompute #670

denera commented Feb 15, 2024 •

edited

Loading

denera commented Feb 17, 2024

ksivaman commented Feb 19, 2024

denera commented Feb 19, 2024

denera commented Feb 21, 2024

ksivaman Feb 22, 2024

denera Feb 22, 2024

denera commented Feb 22, 2024

denera commented Feb 23, 2024

denera commented Feb 23, 2024

ksivaman left a comment

ptrendx commented Feb 24, 2024

denera commented Feb 24, 2024

[PyTorch] Non-reentrant mode for activation recompute #670

[PyTorch] Non-reentrant mode for activation recompute #670

Conversation

denera commented Feb 15, 2024 • edited Loading

denera commented Feb 17, 2024

ksivaman commented Feb 19, 2024

denera commented Feb 19, 2024

denera commented Feb 21, 2024

ksivaman Feb 22, 2024

Choose a reason for hiding this comment

denera Feb 22, 2024

Choose a reason for hiding this comment

denera commented Feb 22, 2024

denera commented Feb 23, 2024

denera commented Feb 23, 2024

ksivaman left a comment

Choose a reason for hiding this comment

ptrendx commented Feb 24, 2024

denera commented Feb 24, 2024

denera commented Feb 15, 2024 •

edited

Loading