You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HFGemmaExporter has wrong mapping of "decoder.layers.*.mlp.linear_fc1.layer_norm_weight" (it maps to "model.layers.*.post_attention_layernorm.weight" instead of "model.layers.*.pre_feedforward_layernorm.weight". As a result, pre-feedforward layer norm is not transferred (and is set to 0). This causes significantly worse performance for converted HF models.
def apply(self, output_path: Path) -> Path:
target = self.init()
source, _ = self.nemo_load(str(self))
print("Source state (Post FC 2 layernorm) dict:", [value for key, value in source.module.state_dict().items() if key.endswith("mlp.linear_fc2.post_layernorm.weight")])
print("Source state (FC 1 layernorm) dict:", [value for key, value in source.module.state_dict().items() if key.endswith("mlp.linear_fc1.layer_norm_weight")])
target = self.convert_state(source, target)
print("Target state (Post feedforward layernorm) dict after conversion:", [value for key, value in target.state_dict().items() if key.endswith(".post_feedforward_layernorm.weight")])
print("Target state (Pre feedforward layernorm) dict after conversion:", [value for key, value in target.state_dict().items() if key.endswith(".pre_feedforward_layernorm.weight")])
target = target.cpu()
target.save_pretrained(output_path)
self.tokenizer.save_pretrained(output_path)
return output_path
The output of the code above (for my continually pretrained version of Gemma 2 9B) is:
Source state (Post FC 2 layernorm) dict: [tensor([-0.4316, 0.5078, -0.2988, ..., -0.4766, -0.3887, -0.4570]), ...
Source state (FC 1 layernorm) dict: [tensor([0.2451, 5.3438, 0.1875, ..., 0.2676, 0.2598, 0.4219]), ...
Target state (Post feedforward layernorm) dict after conversion: [tensor([-0.4316, 0.5078, -0.2988, ..., -0.4766, -0.3887, -0.4570]), ...
Target state (Pre feedforward layernorm) dict after conversion: [tensor([0., 0., 0., ..., 0., 0., 0.]), ...
Not the zeros in the target pre-feedforward layer norm.
Expected behavior
All model weights are converted correctly and HF model performs the same as NeMo 2.0 model.
Environment overview (please complete the following information)
I am using official NeMo container - version 24.12. I see that the issue is still present on main branch.
The text was updated successfully, but these errors were encountered:
Describe the bug
HFGemmaExporter has wrong mapping of
"decoder.layers.*.mlp.linear_fc1.layer_norm_weight"
(it maps to"model.layers.*.post_attention_layernorm.weight"
instead of"model.layers.*.pre_feedforward_layernorm.weight"
. As a result, pre-feedforward layer norm is not transferred (and is set to 0). This causes significantly worse performance for converted HF models.Steps/Code to reproduce bug
Run Gemma2 NeMo 2.0 export to HF:
Additionally, I modified apply function of the exporter to following to show the problem explicitly:
The output of the code above (for my continually pretrained version of Gemma 2 9B) is:
Not the zeros in the target pre-feedforward layer norm.
Expected behavior
All model weights are converted correctly and HF model performs the same as NeMo 2.0 model.
Environment overview (please complete the following information)
I am using official NeMo container - version 24.12. I see that the issue is still present on main branch.
The text was updated successfully, but these errors were encountered: