You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to convert a model trained using ZeRO-3 and MP=8 to a universal checkpoint?
Tracing through the universal checkpointing conversion tool (ds_to_universal), the model states remained unmerged, with 8 model parallel shards per each data parallel rank. E.g., with world_size = 2048, there are 2048 model state files,zero_pp_rank_{0-255}_{0-7} before and after the conversion.
When converting a model with ZeRO <= 2, MP > 1, the model state files are merged into a single file through merge_tp_slices.
If this is not possible, how would one extract and merge only the Z3 / MP checkpointed model states (along both z3 and model parallel partitions) to a single file?
The zero_to_fp32 script does not work since it only handles ZeRO-{2,3} without model parallelism.
The text was updated successfully, but these errors were encountered:
Is it possible to convert a model trained using
ZeRO-3
andMP=8
to a universal checkpoint?Tracing through the universal checkpointing conversion tool (
ds_to_universal
), the model states remained unmerged, with 8 model parallel shards per each data parallel rank. E.g., withworld_size = 2048
, there are 2048 model state files,zero_pp_rank_{0-255}_{0-7}
before and after the conversion.When converting a model with
ZeRO <= 2, MP > 1
, the model state files are merged into a single file throughmerge_tp_slices
.If this is not possible, how would one extract and merge only the Z3 / MP checkpointed model states (along both z3 and model parallel partitions) to a single file?
The
zero_to_fp32
script does not work since it only handles ZeRO-{2,3} without model parallelism.The text was updated successfully, but these errors were encountered: