Skip to content

Conversation

@pggPL
Copy link
Collaborator

@pggPL pggPL commented Dec 16, 2025

Description

Add docs for cpu offloading, must be merged after #2343 . Please review only file with cpu offloading doc: features -> other optimizations -> cpu offloading in the docs menu.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

pggPL and others added 11 commits December 15, 2025 16:17
…m low precision training

Signed-off-by: Pawel Gadzinski <[email protected]>
… add GPU checks

Changes:
- Remove optimizer code from all recipe examples (keep only forward/backward)
- Fix Format imports (use Format.E4M3 instead of string 'E4M3')
- Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16)
- Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4
- Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling)
- Add global_shard_guard for TransformerLayer examples in JAX
- Fix fused_layers_jax.py return tuple unpacking
- Update memory_usage JAX examples with dynamic GPU measurement
- Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage)
- Update performance_considerations.rst for JAX differences
- Delete unused .out files and fp8_autocast_jax.py

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
- Restructure sections: move example to 'CPU Offloading in Transformer Engine', create separate 'Default Offloading Scheduling' section
- Add intro paragraphs explaining when each mode is enabled
- Clarify scheduling algorithm with event-based offloading details
- Document ManualOffloadSynchronizer methods with accurate stream synchronization behavior
- Add use case for manual mode (pipeline parallelism, custom scheduling)
- Improve Caveats section with PyTorch hooks link and clearer explanations
- Use documentation-style language throughout
- Fix grammatical issues and trailing whitespace

Signed-off-by: Pawel Gadzinski <[email protected]>
@pggPL pggPL added the documentation Improvements or additions to documentation label Dec 16, 2025
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 16, 2025

Greptile Overview

Greptile Summary

This PR adds comprehensive documentation for CPU offloading, a memory optimization technique that moves activation tensors between GPU and CPU memory to reduce GPU memory usage during training.

Key additions:

  • Clear explanation of CPU offloading mechanics (asynchronous transfers, comparison with activation checkpointing)
  • Hardware considerations section comparing PCIe Gen5 (128 GB/s) vs NVLink-C2C (900 GB/s) bandwidth
  • Detailed API documentation for get_cpu_offload_context() with two modes: default scheduling and manual synchronization
  • Three complete code examples: basic usage, manual synchronization, and CUDA graphs integration
  • Visual diagrams illustrating scheduling behavior with optimal overlap and stall scenarios
  • Important caveats section covering heuristic activation detection and memory layout changes

The documentation is well-structured, technically accurate, and provides practical guidance for users implementing CPU offloading in their training pipelines.

Confidence Score: 5/5

  • This PR is safe to merge with no issues found
  • The CPU offloading documentation is comprehensive, technically accurate, and well-structured. Code examples are correct and aligned with the actual API implementation. The documentation includes proper warnings about limitations, clear visual diagrams, and practical usage examples. All files are documentation-only with no runtime code changes.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
docs/features/other_optimizations/cpu_offloading/cpu_offloading.rst 5/5 Comprehensive CPU offloading documentation with clear explanations, diagrams, and code examples
docs/features/other_optimizations/cpu_offloading/pytorch_basic_offload_example.py 5/5 Clear basic example demonstrating default scheduling mode
docs/features/other_optimizations/cpu_offloading/pytorch_manual_offload_example.py 5/5 Well-structured example showing manual synchronization mode
docs/features/other_optimizations/cpu_offloading/pytorch_cuda_graphs_example.py 5/5 Demonstrates CUDA graphs integration with CPU offloading

Sequence Diagram

sequenceDiagram
    participant User
    participant Model
    participant Context as cpu_offload_context
    participant Sync as sync_function
    participant GPU
    participant CPU
    participant Stream as offload_stream

    User->>Model: Initialize layers & get_cpu_offload_context()
    Model->>User: Returns (context, sync_function)
    
    Note over User,CPU: Forward Pass
    loop For each layer
        User->>Context: with cpu_offload_context:
        Context->>GPU: Capture tensors saved for backward
        User->>Model: layer.forward(x)
        Model->>GPU: Compute forward pass
        GPU-->>Context: Store activations
        Context->>Stream: Queue async GPU→CPU copy
        Stream->>CPU: Transfer activations (async)
        User->>Sync: sync_function(x)
        Sync->>GPU: Return output tensor
    end
    
    Note over User,CPU: Backward Pass
    User->>GPU: loss.backward()
    loop For each layer (reverse order)
        GPU->>Stream: Check if activation ready
        Stream->>GPU: Wait for CPU→GPU reload
        CPU->>GPU: Transfer activations (async)
        GPU->>GPU: Compute layer backward
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

51 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant