Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to convert trained checkpoint to HF format #12124

Open
Zhihan1996 opened this issue Feb 10, 2025 · 1 comment
Open

Fail to convert trained checkpoint to HF format #12124

Zhihan1996 opened this issue Feb 10, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@Zhihan1996
Copy link

Zhihan1996 commented Feb 10, 2025

Describe the bug

I am pre-training a Mistral model from scratch with Nemo. I have a checkpoint saved by the trainer in the following format.

  • model
    • context
      • model.yaml
      • io.json
      • tokenizer
    • weights
      • common.pt
      • metadata.json
      • ____0_0.distcp
      • ____0_1.distcp
      • ____1_0.distcp
      • ....

I want to transfer it to a HuggingFace format. I use the official script:

from pathlib import Path

from nemo.collections.llm import export_ckpt

if __name__ == "__main__":
    export_ckpt(
        path=Path("/workspace/model"),
        target="hf",
        output_path=Path("/workspace/model_hf"),
    )

I get the following errors. It seems like a bug in NeMo's official codes, as the model checkpoint is automatically saved by the trainer.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/export_ckpt.py", line 6, in <module>
[rank0]:     export_ckpt(
[rank0]:   File "/opt/NeMo/nemo/collections/llm/api.py", line 663, in export_ckpt
[rank0]:     output = io.export_ckpt(path, target, output_path, overwrite, load_connector)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/api.py", line 229, in export_ckpt
[rank0]:     return exporter(overwrite=overwrite, output_path=_output_path)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/connector.py", line 99, in __call__
[rank0]:     to_return = self.apply(_output_path)
[rank0]:   File "/opt/NeMo/nemo/collections/llm/gpt/model/mistral.py", line 202, in apply
[rank0]:     target = self.convert_state(source, target)
[rank0]:   File "/opt/NeMo/nemo/collections/llm/gpt/model/mistral.py", line 220, in convert_state
[rank0]:     return io.apply_transforms(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/NeMo/nemo/lightning/io/state.py", line 180, in apply_transforms
[rank0]:     assert target_orig_dtypes == extract_dtypes(_target.named_parameters()), (
[rank0]: AssertionError: dtype mismatch between source and target state dicts. Left side is {}, Right side is {'model.embed_tokens.weight': torch.float32, 'model.layers.0.self_attn.q_proj.weight': torch.float32, 'model.layers.0.self_attn.k_proj.weight': torch.float32, 'model.layers.0.self_attn.v_proj.weight': torch.float32, 'model.layers.0.self_attn.o_proj.weight': torch.float32, .......

Expected behavior

Transfer the trained model to HF format.

Environment overview (please complete the following information)

Official Nemo container (nvcr.io/nvidia/nemo:24.12).

@Zhihan1996 Zhihan1996 added the bug Something isn't working label Feb 10, 2025
@aflah02
Copy link

aflah02 commented Feb 13, 2025

+1 I am facing this as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants