Fail to convert trained checkpoint to HF format #12124

Zhihan1996 · 2025-02-10T22:02:07Z

Describe the bug

I am pre-training a Mistral model from scratch with Nemo. I have a checkpoint saved by the trainer in the following format.

model
- context
  - model.yaml
  - io.json
  - tokenizer
- weights
  - common.pt
  - metadata.json
  - ____0_0.distcp
  - ____0_1.distcp
  - ____1_0.distcp
  - ....

I want to transfer it to a HuggingFace format. I use the official script:

from pathlib import Path

from nemo.collections.llm import export_ckpt

if __name__ == "__main__":
    export_ckpt(
        path=Path("/workspace/model"),
        target="hf",
        output_path=Path("/workspace/model_hf"),
    )

I get the following errors. It seems like a bug in NeMo's official codes, as the model checkpoint is automatically saved by the trainer.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/export_ckpt.py", line 6, in <module>
[rank0]:     export_ckpt(
[rank0]:   File "/opt/NeMo/nemo/collections/llm/api.py", line 663, in export_ckpt
[rank0]:     output = io.export_ckpt(path, target, output_path, overwrite, load_connector)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/api.py", line 229, in export_ckpt
[rank0]:     return exporter(overwrite=overwrite, output_path=_output_path)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/connector.py", line 99, in __call__
[rank0]:     to_return = self.apply(_output_path)
[rank0]:   File "/opt/NeMo/nemo/collections/llm/gpt/model/mistral.py", line 202, in apply
[rank0]:     target = self.convert_state(source, target)
[rank0]:   File "/opt/NeMo/nemo/collections/llm/gpt/model/mistral.py", line 220, in convert_state
[rank0]:     return io.apply_transforms(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/state.py", line 180, in apply_transforms
[rank0]:     assert target_orig_dtypes == extract_dtypes(_target.named_parameters()), (
[rank0]: AssertionError: dtype mismatch between source and target state dicts. Left side is {}, Right side is {'model.embed_tokens.weight': torch.float32, 'model.layers.0.self_attn.q_proj.weight': torch.float32, 'model.layers.0.self_attn.k_proj.weight': torch.float32, 'model.layers.0.self_attn.v_proj.weight': torch.float32, 'model.layers.0.self_attn.o_proj.weight': torch.float32, .......

Expected behavior

Transfer the trained model to HF format.

Environment overview (please complete the following information)

Official Nemo container (nvcr.io/nvidia/nemo:24.12).

The text was updated successfully, but these errors were encountered:

aflah02 · 2025-02-13T06:41:05Z

+1 I am facing this as well

Zhihan1996 added the bug Something isn't working label Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to convert trained checkpoint to HF format #12124

Fail to convert trained checkpoint to HF format #12124

Zhihan1996 commented Feb 10, 2025 •

edited

Loading

aflah02 commented Feb 13, 2025

Fail to convert trained checkpoint to HF format #12124

Fail to convert trained checkpoint to HF format #12124

Comments

Zhihan1996 commented Feb 10, 2025 • edited Loading

aflah02 commented Feb 13, 2025

Zhihan1996 commented Feb 10, 2025 •

edited

Loading