Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

支持TencentPretrain #57

Open
feifeibear opened this issue Aug 26, 2021 · 5 comments
Open

支持TencentPretrain #57

feifeibear opened this issue Aug 26, 2021 · 5 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@feifeibear
Copy link
Collaborator

feifeibear commented Aug 26, 2021

TencentPretrain是TEG数据安全中心的repo,我们可以利用它们的模型结构和数据
https://git.woa.com/TencentNLP/TencentPretrain/merge_requests/61
TencentPretrain还有一个野生开源项目
https://github.com/dbiir/UER-py

@feifeibear
Copy link
Collaborator Author

feifeibear commented Sep 2, 2021

GeForce RTX 2060上用TencentPretrain的run_patrickstar.sh跑了500步,对比了一下log。
PatrickStar
Worker is training ...
| 100/ 500 steps| 6164.26 tokens/s| loss 7.15| acc: 0.045
| 200/ 500 steps| 6226.79 tokens/s| loss 6.30| acc: 0.060
| 300/ 500 steps| 6208.92 tokens/s| loss 6.17| acc: 0.077
| 400/ 500 steps| 6232.11 tokens/s| loss 5.97| acc: 0.097

PyTorch
| 100/ 500 steps| 24822.88 tokens/s| loss 7.11| acc: 0.043
| 200/ 500 steps| 24331.83 tokens/s| loss 6.25| acc: 0.063
| 300/ 500 steps| 24246.47 tokens/s| loss 6.10| acc: 0.080
| 400/ 500 steps| 24210.41 tokens/s| loss 5.92| acc: 0.094
| 500/ 500 steps| 23966.24 tokens/s| loss 5.87| acc: 0.105

感觉accuracy很相似,速度差点,不过可能是模型太小,这样派大星的overhead引起的。
派大星可以把batch增大到128,达到39072.26 tokens/s吞吐。

@feifeibear feifeibear added the documentation Improvements or additions to documentation label Sep 2, 2021
@feifeibear

This comment has been minimized.

@feifeibear

This comment has been minimized.

@feifeibear
Copy link
Collaborator Author

feifeibear commented Sep 6, 2021

一个蛋疼的问题,有人可能这样写代码,但是PatrickStar并无法区分weight tensor被两个param共享的情况。
https://git.woa.com/TencentNLP/TencentPretrain/blob/master/tencentpretrain/models/model.py#L21

针对tie weight,即第一层embedding weight和最后一层linear的weight共享参数,目前存在的问题:

  1. use_cpu_embedding和tie weight冲突,因为embedding weight在第一层被当成torch param在cpu上计算nn.Embedding,在最后一层却需要在gpu上计算,pre_forward_hook目前无法正确处理。
  2. PreprocessCtx构造模型的,chunk-tensor-index包含一个无用的tensor(来自共享后应该删除的tensor)。
  3. use_cpu_embedding=False时,收敛性不正确。我不确定现在共享参数的反向传播是否实现正确了。
    badcase复现
    https://git.woa.com/jiaruifang/TencentPretrain/merge_requests/1

@zhuzilin
Copy link
Collaborator

zhuzilin commented Sep 16, 2021

环境

1xV100

运行指令

python preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --target lm

python -m torch.distributed.launch --nproc_per_node=1 pretrain.py \
                    --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --output_model_path models/output_model.bin \
                    --config_path models/gpt2/config_patrickstar_v2.json --learning_rate 1e-4 \
                    --world_size 1 --gpu_ranks 0 \
                    --embedding word_pos --remove_embedding_layernorm \
                    --encoder transformer --mask causal --layernorm_positioning pre \
                    --target lm \
                    --total_steps 500 --batch_size 64 \
                    --fp16 --report_steps 100 \
                    --use_patrickstar

配置

{
  "emb_size": 768,
  "feedforward_size": 3072,
  "hidden_size": 768,
  "hidden_act": "gelu_fast",
  "heads_num": 4,
  "layers_num": 3,
  "max_seq_length": 1024,
  "dropout": 0.1,
  "embedding": "word_pos",
  "remove_embedding_layernorm": true,
  "encoder": "transformer",
  "mask": "causal",
  "layernorm_positioning": "pre",
  "target": "lm"
}

运行结果:

  • patrickstar use_cpu_embedding = True
    | 100/ 500 steps| 21735.86 tokens/s| loss 6.90| acc: 0.056
    | 200/ 500 steps| 24045.79 tokens/s| loss 5.90| acc: 0.106
    | 300/ 500 steps| 24777.70 tokens/s| loss 5.49| acc: 0.146
    | 400/ 500 steps| 24675.35 tokens/s| loss 5.26| acc: 0.165
    | 500/ 500 steps| 22838.04 tokens/s| loss 5.09| acc: 0.176

  • patrickstar use_cpu_embedding = False
    | 100/ 500 steps| 49792.88 tokens/s| loss 6.90| acc: 0.056
    | 200/ 500 steps| 73055.65 tokens/s| loss 5.90| acc: 0.106
    | 300/ 500 steps| 72733.26 tokens/s| loss 5.49| acc: 0.146
    | 400/ 500 steps| 71993.03 tokens/s| loss 5.26| acc: 0.165
    | 500/ 500 steps| 59033.95 tokens/s| loss 5.09| acc: 0.176

  • apex O1
    | 100/ 500 steps| 61843.22 tokens/s| loss 6.87| acc: 0.054
    | 200/ 500 steps| 98121.80 tokens/s| loss 5.83| acc: 0.107
    | 300/ 500 steps| 98702.82 tokens/s| loss 5.38| acc: 0.152
    | 400/ 500 steps| 98349.93 tokens/s| loss 5.19| acc: 0.170
    | 500/ 500 steps| 75288.15 tokens/s| loss 5.10| acc: 0.177

  • apex O2
    | 100/ 500 steps| 77366.48 tokens/s| loss 6.87| acc: 0.054
    | 200/ 500 steps| 141294.21 tokens/s| loss 5.83| acc: 0.108
    | 300/ 500 steps| 140895.76 tokens/s| loss 5.37| acc: 0.152
    | 400/ 500 steps| 141854.47 tokens/s| loss 5.18| acc: 0.171
    | 500/ 500 steps| 98582.75 tokens/s| loss 5.10| acc: 0.177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants