-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pickling error when trying to save checkpoints with custom checkpointIO #11955
Comments
Any help with this issue would be greatly appreciated. It should be very easy to recreate with the given script. |
@jdnurme with the megatron strategy, distributed checkpointing via megatron-core is used as described here: https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/checkpoints/dist_ckpt.html the pickling errors arise from these functions being unable to be serialized by
since megatron-core dist checkpointing is built on top of PyTorch's distributed checkpoint, is your remote storage compatible with the StorageWriter and StorageReader abstractions laid out here? https://pytorch.org/docs/stable/distributed.checkpoint.html that may be an an easier approach to discuss integration vs. needing to handle all of the ShardedTensor logic introduced for distributed checkpointing |
@ananthsub thank you for the reply. Yes, the remote checkpoint code we intend to integrate with does support the StorageReader/StorageWriter interface. Can you elaborate on how we might be able to pass that checkpoint mechanism into a nemo pretrain recipe like the one above? |
@jdnurme we are working on integrating Nvidia's multi-storage client into the checkpointing flow to support object stores like GCS: https://github.com/NVIDIA/multi-storage-client - would this meet your needs if it's available? |
Yes, an integration like that would be super helpful. @ananthsub do you have a sense of when this feature will be ready? |
Describe the bug
When providing a custom checkpoint_io to my strategy during NeMo training, the torch.save call fails with a pickling error.
I'm utilizing a minimal custom Lightning CheckpointIO implementation that has been wrapped with the
@run.autoconvert
decorator. The goal is to eventually augment this implementation to save checkpoints to a remote datastore, however this code simply saves to a separate specified disc location usingtorch.save()
.Training succeeds, but during checkpoint save the torch.save call fails with a pickling error.
Steps/Code to reproduce bug
run the following code with
python filename.py
Expected behavior
I would expect this code to execute fully, but with checkpoints saved to the custom location specified in the CheckpointIO implementation.
Environment overview (please complete the following information)
Environment details
If NVIDIA docker image is used you don't need to specify these.
Additional context
If an alternate methodology is better suited for custom checkpointing with NeMo 2, please advise. The hope is that existing Lightning CheckpointIO integrations would be able to plug into NeMo directly.
The text was updated successfully, but these errors were encountered: