Fix Non-Contiguous Tensor Issue in Checkpoint Consolidation #708
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes an issue in the
consolidate_tensor_parallel_checkpoints
function where non-contiguous tensors caused aValueError
when saving consolidated model checkpoints, particularly with the Safetensors format. The fix ensures that tensors are made contiguous before and after concatenation, resolving memory layout issues during saving.Error Example:
When running the following command:
The following error occurs:
ValueError: You are trying to save a non contiguous tensor: model.layers.0.self_attn.o_proj.weight which is not allowed.
This PR solves the issue by applying the
.contiguous()
method to tensors at key points in the consolidation process, ensuring the consolidated model can be saved without errors.Motivation and Context:
When consolidating distributed model checkpoints (e.g., from model shards), tensors may not be contiguous in memory. Safetensors, which is a memory-efficient way to store models, does not allow saving non-contiguous tensors. This PR modifies the
consolidate_tensor_parallel_checkpoints
function to apply.contiguous()
to tensors to ensure that the checkpoints can be properly consolidated and saved.This change impacts the overall consolidation pipeline, as
consolidate_model_parallel_checkpoints_to_unified_checkpoint
depends on the lower-level functions to process tensors correctly.