Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements of the checkpoint functionality and memory occupation of DER #1567

Merged
merged 13 commits into from
Jan 25, 2024

Conversation

lrzpellegrini
Copy link
Collaborator

@lrzpellegrini lrzpellegrini commented Jan 24, 2024

The main focus of this PR is to introduce various fixes and improvements for the checkpointing functionality:

  • The checkpoint functionality now supports a simplified way to register dataset objects so that only constructor parameters are stored
    • This helper is now applied to more datasets
  • A de-duplication utility has been added to prevent the in-memory duplication of dataset objects
  • Greatly improve the performance of LazyIndices by switching to a NumPy-based storage of indices. This also greatly reduces the time taken to save and load checkpoints
    • Maximum depth before eagerifying reduced to 2
    • Added an additional unit test for LazyRange
  • Loading a checkpoint in PyTorch 1.13.* is now handled by a fixed version of torch.load(...) (which is bugged in PyTorch 1.13.*)
  • Re-introduced the device re-mapping functionality
  • Checkpointing has been moved to a separate module to prevent import issues
    • Adapted RTD documentation

Additional elements:

  • Fixes to DER memory occupation and checkpoint size:
    • FlatData can now discard (remove the reference to) unused elements (must be enabled via a specific constructor parameter)
      • Useful when storing large amounts of data as dataset data attributes (such as in DER, where logits were saved for each training exemplar)
    • Adapted DER strategy to use this functionality

Misc:

  • Added AvalancheImageNet dataset, a clone of torchvision ImageNet dataset that supports an additional meta_root parameter. Useful when working on certain HPC clusters.

@AntonioCarta
Copy link
Collaborator

Thanks, all of these changes are great. Unfortunately, I messed up the CI by trying to add support for 3.11. Maybe let's wait until I fix it before merging this PR.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 7653925810

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.4%) to 73.486%

Totals Coverage Status
Change from base Build 7653912528: -0.4%
Covered Lines: 19049
Relevant Lines: 25922

💛 - Coveralls

@AntonioCarta AntonioCarta merged commit 4776519 into ContinualAI:master Jan 25, 2024
11 of 12 checks passed
@lrzpellegrini lrzpellegrini deleted the checkpoint_improvements branch January 25, 2024 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants