Skip to content

Conversation

@CldStlkr
Copy link

Add checkpointing functionality for workflow resume capability

Description

  • Implemented restart functionality for workflows.
  • While this adds significant functionality, much of the implementation follows consistent patterns across the three workflow types.
  • This implementation does not implement the IO pattern that @NickGeneva mentioned in the issue thread, but this should be a good base to expand upon.

Core Changes:

  1. New Checkpoint Module (earth2studio/utils/checkpoint.py)
    • save_checkpoint() - Saves simulation state, coordinates, RNG states
    • load_checkpoint() - Restores saved state with device handling
    • validate_checkpoint_compatibility() - Validates that checkpoint works with current model
    • should_checkpoint() - Decision logic for when to save
  2. Enhanced Workflows (earth2studio/run.py)
    • Added 3 optional parameters to all workflow functions:
      • checkpoint_path - Where to save/load checkpoints
      • checkpoint_interval - Save every N steps
      • resume_from_step Resume from specified step
    • Dual execution paths:
      • Normal: Uses existing iterators (exact same behavior)
      • Resume: Manual time-stepping to account for mid-simulation restart
  3. Comprehensive Testing (test/utils/test_checkpoint.py)
    • 25 tests covering save/load, validation, error handling
    • 90% code coverage on checkpoint utilities, can increase if needed
    • Tested CPU/CUDA compatibility edge cases

Notes

  • Zero breaking changes, works identically since checkpoint params are optional
  • Maintains reproducibility through RNG state preservation
  • Used PyTorch's save/load for file-based checkpointing
  • Prevents incompatible resumes via Coordinate System validation
  • Parameters are independent, and can be used flexibly (save-only, resume-only, or both)

This PR closes #446

Checklist

  • [ x] I am familiar with the Contributing Guidelines.
  • [ x] New or existing tests cover these changes.
  • [ x] The documentation is up to date with these changes. (Added docstring comments)
  • The CHANGELOG.md is up to date with these changes. (Will add in a new commit once changes have been reviewed)
  • [ x] An issue is linked to this pull request.

Dependencies

None - Uses existing PyTorch save/load functionality

- Implement save/resume checkpointing for deterministic, diagnostic, and ensemble workflows
- Add comprehensive test suite with 90% coverage
- Solve GPU memory constraints for long-running simulations
- Maintain full backward compatibility with existing APIs

Fixes NVIDIA#446
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀[FEA]: Adding Restart Functionality?

1 participant