Implement huggingface checkpoint loading and export#1305
Open
lshpku wants to merge 1 commit intoPaddlePaddle:developfrom
Open
Implement huggingface checkpoint loading and export#1305lshpku wants to merge 1 commit intoPaddlePaddle:developfrom
lshpku wants to merge 1 commit intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
bf45b74 to
b35dcd9
Compare
b35dcd9 to
c4c77d4
Compare
2c2fd43 to
09fdf27
Compare
1efcca7 to
99c5910
Compare
99c5910 to
d41bd78
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
实现正确的Huggingface Checkpoint加载&导出功能
由于该模型的 expert_id 是局部的,所以之前用 from_pretrained 无法正确加载 expert 权重,本PR重写了加载逻辑,现在可以正确映射 expert_id 了,21B和300B模型均可使用
简介
我们现在checkpoint有两种格式,一种是训练专用的格式,会把优化器状态也保存下来,方便进行checkpoint断点接续;另一种是unified模式,只保存模型本体,训推都能用,但是loss就不接续了,所以一般只在训练完导出给推理的时候才用
由于checkpoint格式不同,所以首次训练、断点接续和导出需要用不同的配置,如下所示:
首次训练
首次开始训练时,需在trainer_args下面将save_to_hf设为false:
(如果没有这个参数需要新增,有的话就修改,下同)
断点接续
假设之前保存了训练100个step时的checkpoint,然后停掉了,想从100 step继续训练,则在trainer_args下面指定:
最终导出
假设想导出训练500个step时的checkpoint,则在trainer_args下面指定:
(相当于假装从500 step恢复训练,但不跑任何step,不更新权重,直接保存为unified格式)
警告:
trainer_args: from_scratch: 0/1参数需全程保持一致,也就是说如果你在首次训练时用了from_scratch: 0,那后面断点接续和最终导出时也必须使用from_scratch: 0,不能改成1,反之亦然,否则你会发现加载权重的 loss 非常高!运行预训练
下载好相应模型的权重,以
ERNIE-4.5-300B-A47B-Base-Paddle为例由于下载的权重中的 config.json 是按照推理来的,一些参数甚至会报错,所以需要用本仓库中针对训练的
model_configs/ERNIE-4p5-300B-A47B/model_config.json替换掉原有 config.json在模型的yaml中,修改以下2个参数
scripts/ERNIE-4p5-300B-A47B/train_96_gpus.sh启动即可环境建议
"vocab_size": 103424,本仓库的值可能和模型不一样,以模型的为准正确性确认