Fix cuda graph capture for grouped gemm #1345

xrennvidia · 2024-11-21T22:13:09Z

Description

Cuda graph does not work with Grouped GEMM.
The saved forward activations are corrupted before bwd_graph is replayed. Explicitly setting retain_graph=True can hold the activations and fix the issue.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Xiaowei Ren <[email protected]>

xrennvidia · 2024-11-21T23:10:47Z

/te-ci pytorch

timmoon10

Wouldn't we expect this to increase memory usage?

I see that torch.cuda.make_graphed_callables doesn't set retain_graph=True:
https://github.com/pytorch/pytorch/blob/c25b201583fc28243b87c460a2f18e2531a676e7/torch/cuda/graphs.py#L326-L336
We want to match plain PyTorch as much as possible unless there is a good reason to introduce divergence. If this is MoE-specific, perhaps we could add a kwarg like retain_graph_in_backward that is False by default.

Signed-off-by: Xiaowei Ren <[email protected]>

xrennvidia · 2024-11-26T04:14:15Z

/te-ci pytorch

timmoon10

LGTM

Signed-off-by: Xiaowei Ren <[email protected]>

xrennvidia · 2024-11-26T19:18:54Z

/te-ci pytorch

xrennvidia added 3 commits November 21, 2024 13:54

retain_graph=True for grouped gemm

4ac0e71

Signed-off-by: Xiaowei Ren <[email protected]>

Merge branch 'main' into xren/cg_fix_grouped_gemm

e5e44a4

remove an unnecessary retain_graph=True

f1e66c4

Signed-off-by: Xiaowei Ren <[email protected]>

xrennvidia requested a review from timmoon10 November 22, 2024 02:01

timmoon10 reviewed Nov 22, 2024

View reviewed changes

xrennvidia added 2 commits November 25, 2024 20:05

Merge branch 'main' into xren/cg_fix_grouped_gemm

4ba2b7f

make retain_graph in graph capture configurable

444db9f

Signed-off-by: Xiaowei Ren <[email protected]>

timmoon10 approved these changes Nov 26, 2024

View reviewed changes

typo fix

f9f27ab

Signed-off-by: Xiaowei Ren <[email protected]>

xrennvidia merged commit a132ac4 into NVIDIA:main Nov 27, 2024
14 of 15 checks passed

xrennvidia deleted the xren/cg_fix_grouped_gemm branch November 29, 2024 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cuda graph capture for grouped gemm #1345

Fix cuda graph capture for grouped gemm #1345

xrennvidia commented Nov 21, 2024 •

edited

Loading

xrennvidia commented Nov 21, 2024

timmoon10 left a comment

xrennvidia commented Nov 26, 2024

timmoon10 left a comment

xrennvidia commented Nov 26, 2024

Fix cuda graph capture for grouped gemm #1345

Fix cuda graph capture for grouped gemm #1345

Conversation

xrennvidia commented Nov 21, 2024 • edited Loading

Description

Type of change

Checklist:

xrennvidia commented Nov 21, 2024

timmoon10 left a comment

Choose a reason for hiding this comment

xrennvidia commented Nov 26, 2024

timmoon10 left a comment

Choose a reason for hiding this comment

xrennvidia commented Nov 26, 2024

xrennvidia commented Nov 21, 2024 •

edited

Loading