Documentation for cpu offloading #2520

pggPL · 2025-12-16T12:27:02Z

Description

Add docs for cpu offloading, must be merged after #2343 . Please review only file with cpu offloading doc: features -> other optimizations -> cpu offloading in the docs menu.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…m low precision training Signed-off-by: Pawel Gadzinski <[email protected]>

Signed-off-by: Pawel Gadzinski <[email protected]>

… add GPU checks Changes: - Remove optimizer code from all recipe examples (keep only forward/backward) - Fix Format imports (use Format.E4M3 instead of string 'E4M3') - Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16) - Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4 - Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling) - Add global_shard_guard for TransformerLayer examples in JAX - Fix fused_layers_jax.py return tuple unpacking - Update memory_usage JAX examples with dynamic GPU measurement - Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage) - Update performance_considerations.rst for JAX differences - Delete unused .out files and fp8_autocast_jax.py Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

- Restructure sections: move example to 'CPU Offloading in Transformer Engine', create separate 'Default Offloading Scheduling' section - Add intro paragraphs explaining when each mode is enabled - Clarify scheduling algorithm with event-based offloading details - Document ManualOffloadSynchronizer methods with accurate stream synchronization behavior - Add use case for manual mode (pipeline parallelism, custom scheduling) - Improve Caveats section with PyTorch hooks link and clearer explanations - Use documentation-style language throughout - Fix grammatical issues and trailing whitespace Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps · 2025-12-16T12:30:39Z

Greptile Overview

Greptile Summary

This PR adds comprehensive documentation for CPU offloading, a memory optimization technique that moves activation tensors between GPU and CPU memory to reduce GPU memory usage during training.

Key additions:

Clear explanation of CPU offloading mechanics (asynchronous transfers, comparison with activation checkpointing)
Hardware considerations section comparing PCIe Gen5 (128 GB/s) vs NVLink-C2C (900 GB/s) bandwidth
Detailed API documentation for get_cpu_offload_context() with two modes: default scheduling and manual synchronization
Three complete code examples: basic usage, manual synchronization, and CUDA graphs integration
Visual diagrams illustrating scheduling behavior with optimal overlap and stall scenarios
Important caveats section covering heuristic activation detection and memory layout changes

The documentation is well-structured, technically accurate, and provides practical guidance for users implementing CPU offloading in their training pipelines.

Confidence Score: 5/5

This PR is safe to merge with no issues found
The CPU offloading documentation is comprehensive, technically accurate, and well-structured. Code examples are correct and aligned with the actual API implementation. The documentation includes proper warnings about limitations, clear visual diagrams, and practical usage examples. All files are documentation-only with no runtime code changes.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
docs/features/other_optimizations/cpu_offloading/cpu_offloading.rst	5/5	Comprehensive CPU offloading documentation with clear explanations, diagrams, and code examples
docs/features/other_optimizations/cpu_offloading/pytorch_basic_offload_example.py	5/5	Clear basic example demonstrating default scheduling mode
docs/features/other_optimizations/cpu_offloading/pytorch_manual_offload_example.py	5/5	Well-structured example showing manual synchronization mode
docs/features/other_optimizations/cpu_offloading/pytorch_cuda_graphs_example.py	5/5	Demonstrates CUDA graphs integration with CPU offloading

Sequence Diagram

sequenceDiagram
    participant User
    participant Model
    participant Context as cpu_offload_context
    participant Sync as sync_function
    participant GPU
    participant CPU
    participant Stream as offload_stream

    User->>Model: Initialize layers & get_cpu_offload_context()
    Model->>User: Returns (context, sync_function)
    
    Note over User,CPU: Forward Pass
    loop For each layer
        User->>Context: with cpu_offload_context:
        Context->>GPU: Capture tensors saved for backward
        User->>Model: layer.forward(x)
        Model->>GPU: Compute forward pass
        GPU-->>Context: Store activations
        Context->>Stream: Queue async GPU→CPU copy
        Stream->>CPU: Transfer activations (async)
        User->>Sync: sync_function(x)
        Sync->>GPU: Return output tensor
    end
    
    Note over User,CPU: Backward Pass
    User->>GPU: loss.backward()
    loop For each layer (reverse order)
        GPU->>Stream: Check if activation ready
        Stream->>GPU: Wait for CPU→GPU reload
        CPU->>GPU: Transfer activations (async)
        GPU->>GPU: Compute layer backward
    end

greptile-apps

_{51 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

pggPL and others added 11 commits December 15, 2025 16:17

Code drop: Update recipes documentation and remove custom recipes fro…

659d002

…m low precision training Signed-off-by: Pawel Gadzinski <[email protected]>

Fix SVG css import path for diagrams

4195719

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

14b4b9a

for more information, see https://pre-commit.ci

Fix JAX memory usage .out files with correct output

50629fd

Signed-off-by: Pawel Gadzinski <[email protected]>

doc drop

cdc63aa

Signed-off-by: Pawel Gadzinski <[email protected]>

Rename example files to include pytorch in filename

2129b0f

Signed-off-by: Pawel Gadzinski <[email protected]>

Remove .DS_Store files

2fab56a

Signed-off-by: Pawel Gadzinski <[email protected]>

fxi

514cacb

Signed-off-by: Pawel Gadzinski <[email protected]>

fxi

14e9cee

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL added the documentation Improvements or additions to documentation label Dec 16, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

ef0ea40

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Dec 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Documentation for cpu offloading #2520

Documentation for cpu offloading #2520

Uh oh!

pggPL commented Dec 16, 2025

Uh oh!

greptile-apps bot commented Dec 16, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Documentation for cpu offloading #2520

Are you sure you want to change the base?

Documentation for cpu offloading #2520

Uh oh!

Conversation

pggPL commented Dec 16, 2025

Description

Type of change

Checklist:

Uh oh!

greptile-apps bot commented Dec 16, 2025

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant