Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize inference performance of ERNIE on P40 GPU #165

Open
Xreki opened this issue Aug 14, 2019 · 14 comments
Open

Optimize inference performance of ERNIE on P40 GPU #165

Xreki opened this issue Aug 14, 2019 · 14 comments
Assignees

Comments

@Xreki
Copy link
Collaborator

Xreki commented Aug 14, 2019

负责人

@Xreki @zhaoyuchen2018

初始性能

  • 测试时间:2019年8月14日
  • 测试者:@Xreki
  • GPU平台信息:Tesla P40
  • 软件信息:
    • Driver Version,418.39
    • CUDA 9.0
    • cuDNN 7.5
  • Paddle commit:
commit 744279fe685dd0b8b426a686d84ad449da02366e
Author: Kevin <[email protected]>
Date:   Mon Aug 12 10:13:12 2019 +0800

    Refine embedding Api doc (#18820)
  • 测试代码:Add inference benchmark of ernie #164
  • 编译Paddle使用Docker镜像:paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
  • 编译测试程序,测试使用Docker镜像:paddlepaddle/paddle:latest-gpu-cuda9.0-cudnn7-dev
  • 测试结果:
    • GPU ratio,96%
    • Runtime,8.3554 ms/sample

NVIDIA BERT推理解决方案Faster Transformer开源了

@Xreki
Copy link
Collaborator Author

Xreki commented Aug 14, 2019

Profile结果

------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                               Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::mul                        370666      32332.5     12646.616698 (0.391142) 19685.930916 (0.608858) 0.026226    95.3523     0.0872283   0.341415
thread0::elementwise_add            561008      16561.7     14064.876774 (0.849241) 2496.820884 (0.150759)  0.017456    9.10075     0.0295213   0.174883
thread0::softmax                    60108       12311.9     4285.399399 (0.348069)  8026.520592 (0.651931)  0.18358     6.97239     0.20483     0.130007
thread0::matmul                     125225      7621.9      4999.121478 (0.655889)  2622.776986 (0.344111)  0.026823    6.93527     0.0608656   0.0804832
thread0::layer_norm                 125225      6890.62     3494.377954 (0.507121)  3396.237224 (0.492879)  0.046834    6.54613     0.0550259   0.0727612
thread0::transpose2                 240432      6706.17     4940.736832 (0.736745)  1765.436248 (0.263255)  0.022342    6.87583     0.0278922   0.0708136
thread0::dropout                    190342      4844.09     4380.131806 (0.904222)  463.956297 (0.095778)   0.019757    1.91613     0.0254494   0.051151
thread0::reshape2                   240432      1970.44     1970.427168 (0.999993)  0.013691 (0.000007)     0.005761    0.758181    0.00819542  0.0208068
thread0::scale                      70126       1559.56     1413.281218 (0.906202)  146.283474 (0.093798)   0.017739    6.40694     0.0222395   0.0164682
thread0::relu                       60108       1533.83     1155.052049 (0.753053)  378.773663 (0.246947)   0.020695    1.91584     0.0255178   0.0161964
thread0::stack                      5009        670.06      620.412429 (0.925906)   49.647450 (0.074094)    0.104312    3.43668     0.133771    0.00707548
thread0::lookup_table               15027       499.303     409.325315 (0.819793)   89.978163 (0.180207)    0.025125    0.087498    0.0332271   0.00527238
thread0::fetch                      5009        260.127     260.068816 (0.999777)   0.057970 (0.000223)     0.044476    1.01272     0.0519319   0.0027468
thread0::TensorCopy:GPU->CPU        5009        210.123     210.076980 (0.999782)   0.045869 (0.000218)     0.036519    0.996315    0.0419491   0.00221879
thread0::GpuMemcpyAsync:CPU->GPU    20036       204.476     173.512282 (0.848571)   30.963505 (0.151429)    0.007       1.77937     0.0102054   0.00215916
thread0::GpuMemcpySync:GPU->CPU     5009        203.349     194.875273 (0.958327)   8.474119 (0.041673)     0.035868    0.330009    0.0405968   0.00214726
thread0::slice                      5009        153.525     136.187490 (0.887070)   17.337539 (0.112930)    0.025069    1.98001     0.0306498   0.00162114
thread0::tanh                       5009        110.729     100.944808 (0.911638)   9.784189 (0.088362)     0.017612    1.92101     0.022106    0.00116924
thread0::feed                       20036       57.2692     57.263221 (0.999896)    0.005970 (0.000104)     0.00156     0.032812    0.00285831  0.000604732

@Xreki
Copy link
Collaborator Author

Xreki commented Aug 14, 2019

TImeline分析

  • 总体来说,GPU利用的比较满。
    image

  • 开始时GPU存在不少空闲
    image

  • stack op里面有cuMalloc和cuFree操作,并且有2次cuda stream的同步操作
    image

  • 很多的reshape和transpose操作

  • softmax是否使用cudnn?

@Xreki
Copy link
Collaborator Author

Xreki commented Aug 14, 2019

优化方案

  • 多个lookup_table操作的融合,@Xreki due to 2019年8月30日
    • 有一个输入是[00000,11111]这样的id(只有0,1两个取值),有的实现里把这部分换成matmul
      image
  • 确定模型中的reshape操作是否必须的,是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki —— 见Optimize inference performance of ERNIE on P40 GPU #165 (comment)
    • 不能直接移除

image

  • 预测时,dropout可转变成scale,@Xreki due to 2019年8月22日 —— 见Add a pass to replace dropout_op with scale_op when is_test is true Paddle#19297
  • multi-attention head多条分支合并,@zhaoyuchen2018 due to 2019年8月30日
    • 从左至右,一共3个分支,分别为:
      • 1,Transpose(Reshape(X*W0+b0))
      • 2,Transpose(Reshape(X*W1+b1))
      • 3,Transpose(Reshape(X*W2+b2))*scale -> Transpose(Reshape(X*W2*scale+b2*scale))。该分支经过变换之后,结构与第1、2分支完全一样
    • 可融合变成如下计算:Transpose(Reshape(X*[W0+W1+W2*scale]+[b0+b1+b2*scale]))
      image
    • 只需写fuse的pass,不需新增op。
    • 可适配多处计算

image

image

@Xreki
Copy link
Collaborator Author

Xreki commented Aug 14, 2019

Intel相关的一些工作

@Xreki
Copy link
Collaborator Author

Xreki commented Aug 14, 2019

确定模型中的reshape操作是否必须的,是否可以移除。很多reshape的输入输出shape看起来是一样的。@Xreki

image

I0814 10:13:53.083331 11085 operator.cc:170] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 768]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 768]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
I0814 10:13:53.083364 11085 operator.cc:993] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CPUPlace]:library_type[PLAIN]
I0814 10:13:53.083400 11085 tensor_util.cu:28] TensorCopy 1, 128, 12, 64 from CPUPlace to CPUPlace
I0814 10:13:53.083407 11085 tensor_util.cu:43] Skip copy the same data async from CPUPlace to CPUPlace
I0814 10:13:53.083420 11085 operator.cc:190] CPUPlace Op(reshape2), inputs:{Shape[], ShapeTensor[], X[fc_0.tmp_1:float[1, 128, 12, 64]({})]}, outputs:{Out[fc_0.tmp_1:float[1, 128, 12, 64]({})], XShape[reshape2_0.tmp_0:[-1]({{}})]}.
  • 从详细的log来看,reshape2输入的dims是[1, 128, 768],输出dims是[1, 128, 12, 64],因此不能直接移除。
  • 因为reshape2是inplace执行的,导致图中展示输入的dims不正确。

结论:不能直接移除

@Xreki
Copy link
Collaborator Author

Xreki commented Aug 22, 2019

优化效果汇总

  • 单位:ms/sample
版本 P40时间 (Total) 时间 (去掉第一次run) 测试时间 PR 版本描述 加速效果
0 8.36 2019-08-14 - baseline -
1 (merged) 8.22 8.14 2019-08-22 PaddlePaddle/Paddle#19297 移除dropout_op 1.7%
2 (merged) 7.37 6.96 2019-08-22 PaddlePaddle/Paddle#19346 softmax使用cudnn 10%
3 (merged) 7.20 6.82 2019-09-10 PaddlePaddle/Paddle#19687
PaddlePaddle/Paddle#19733
融合FC计算,见#165 (comment) 2.1%
4 (merged) 6.93 6.55 2019-09-11 PaddlePaddle/Paddle#19776 融合FC+elementwise_add+layer_norm,见#165 (comment) 4%
5 (merged) 6.13 5.73 2019-09-19 PaddlePaddle/Paddle#20167 PaddlePaddle/Paddle#20151 融合multi-head attention,见#165 (comment) 14%
3 7.26 6.88 2019-09-03 PaddlePaddle/Paddle#19610 优化stack_op,见#165 (comment) 1.2%

@Xreki
Copy link
Collaborator Author

Xreki commented Aug 28, 2019

在版本1的基础上,在预测了使用了fuse_elewise_add_act_pass,可以融合elementwise_add+relu的计算,并成功匹配到14个子图,但总体性能下降。

+ /paddle/build_paddle/build_docker_manylinux_cuda90/paddle/fluid/inference/tests/api/samples/ernie_tester --logtostderr --model_dir=/data/ernie/model --data=/data/ernie/seq128_data/test_ds --repeat=1 --warmup_steps=1 --use_gpu=true --use_analysis=true --print_outputs=false --profile=false
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ernie
[ RUN      ] ernie.profile
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [fuse_elewise_add_act_pass]
---  detected 14 subgraphs
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0828 08:49:37.436837  7631 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I0828 08:49:37.726634  7631 analysis_predictor.cc:475] == optimize end ==
I0828 08:49:38.710613  7631 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W0828 08:49:39.109542  7631 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
W0828 08:49:39.115780  7631 device_context.cc:269] device: 0, cuDNN Version: 7.3.
W0828 08:49:39.116070  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.116091  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.116104  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.127769  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.127822  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.127832  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.137923  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.137972  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.137981  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.147920  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.147967  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.147977  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
W0828 08:49:39.157951  7631 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
W0828 08:49:39.157996  7631 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
W0828 08:49:39.158005  7631 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
I0828 08:50:21.810214  7631 ernie_tester.cc:345] Run 5010 samples, average latency: 8.60096 ms per sample.
I0828 08:50:21.810283  7631 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 8.51942 ms per sample.
[       OK ] ernie.profile (46355 ms)
[----------] 1 test from ernie (46355 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (46356 ms total)
[  PASSED  ] 1 test.

@zhaoyuchen2018
Copy link
Contributor

zhaoyuchen2018 commented Aug 29, 2019

融合3个path的fc,reshape,transpose,scale到一个path. P40上测试结果如下:
image

融合后的图如下所示:
image
融合前:
image

profile结果

Event                               Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::mul                        250450      30086.2     10700.311709 (0.355655) 19385.885160 (0.644345) 0.027581    88.3573     0.120129    0.39598
thread0::elementwise_add            440792      13893.1     11593.975031 (0.834514) 2299.111468 (0.165486)  0.019193    12.8221     0.0315185   0.182854
thread0::matmul                     125225      7716.83     5090.761228 (0.659696)  2626.072493 (0.340304)  0.02993     7.32385     0.0616237   0.101565
thread0::layer_norm                 125225      7091.46     3754.750949 (0.529475)  3336.704465 (0.470525)  0.048728    7.12371     0.0566297   0.0933344
thread0::softmax                    60108       4259.71     1991.152624 (0.467438)  2268.559610 (0.532562)  0.063841    9.24836     0.0708676   0.0560643
thread0::transpose2                 120216      4126.7      2756.571446 (0.667985)  1370.123630 (0.332015)  0.025956    9.50509     0.0343273   0.0543136
thread0::split                      60108       2998.5      1830.882725 (0.610599)  1167.620505 (0.389401)  0.043009    19.4484     0.0498853   0.0394649
thread0::relu                       60108       1620.75     1233.311364 (0.760953)  387.433865 (0.239047)   0.021955    9.30877     0.0269639   0.0213315
thread0::reshape2                   120216      1279.38     1279.361341 (0.999987)  0.016020 (0.000013)     0.008744    1.83394     0.0106423   0.0168386
thread0::stack                      5009        673.79      610.591735 (0.906205)   63.198156 (0.093795)    0.110177    11.1131     0.134516    0.0088681
thread0::lookup_table               15027       517.377     424.928699 (0.821314)   92.448017 (0.178686)    0.025679    6.76539     0.0344298   0.00680947
thread0::fetch                      5009        342.348     342.270593 (0.999773)   0.077819 (0.000227)     0.057001    1.4229      0.0683467   0.00450583
thread0::TensorCopy:GPU->CPU        5009        261.036     260.979443 (0.999784)   0.056311 (0.000216)     0.04429     0.284401    0.0521133   0.00343563
thread0::scale                      10018       253.319     237.712034 (0.938391)   15.606845 (0.061609)    0.020368    0.161032    0.0252864   0.00333406
thread0::GpuMemcpyAsync:CPU->GPU    20036       248.089     208.310791 (0.839661)   39.778466 (0.160339)    0.008811    1.84374     0.0123822   0.00326523
thread0::GpuMemcpySync:GPU->CPU     5009        241.886     232.381808 (0.960708)   9.504255 (0.039292)     0.040716    0.390521    0.0482903   0.00318359
thread0::slice                      5009        160.703     143.249188 (0.891391)   17.453703 (0.108609)    0.02768     0.07693     0.0320828   0.0021151
thread0::tanh                       5009        117.589     107.550767 (0.914632)   10.038396 (0.085368)    0.01952     0.884425    0.0234756   0.00154765
thread0::feed                       20036       90.2748     90.263520 (0.999875)    0.011311 (0.000125)     0.002949    0.964233    0.00450563  0.00118815

@Xreki
Copy link
Collaborator Author

Xreki commented Sep 3, 2019

stackconcat的功能类似。尝试:

方法一:用concat代替stack,运行出现错误,发现stackconcat输出的维度不一样,不可行

  • stack输出的维度
I0820 09:17:21.934635 17390 operator.cc:168] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.934787 17390 operator.cc:188] CUDAPlace(0) Op(stack), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}.
.
.
.
I0820 09:17:21.935680 17390 operator.cc:168] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.
I0820 09:17:21.935708 17390 operator.cc:188] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 12, 128, 128]({})]}, outputs:{Out[tmp_2:float[1, 12, 128, 128]({})]}.
  • concat输出的维度
I0903 05:49:33.774837 17211 operator.cc:165] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:[0]({{}})]}.
I0903 05:49:33.774863 17211 operator.cc:988] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0903 05:49:33.774976 17211 operator.cc:185] CUDAPlace(0) Op(concat), inputs:{X[scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({}), scale_0.tmp_0:float[1, 128, 128]({})]}, outputs:{Out[stack_0.tmp_0:float[1, 1536, 128]({})]}.
.
.
.
I0903 05:49:33.775012 17211 operator.cc:165] CUDAPlace(0) Op(elementwise_add), inputs:{X[matmul_1.tmp_0:float[1, 12, 128, 128]({})], Y[stack_0.tmp_0:float[1, 1536, 128]({})]}, outputs:{Out[tmp_2:[0]({{}})]}.

方法二:参考concat,重新实现stack

I0903 09:57:36.458448 29283 ernie_tester.cc:345] Run 5010 samples, average latency: 7.26373 ms per sample.
I0903 09:57:36.458526 29283 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.87974 ms per sample.

以上测试结果基于版本2。

I0920 09:15:41.505820 25415 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92085 ms per sample.
I0920 09:15:41.505903 25415 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.51521 ms per sample.

以上测试结果基于版本4。

@Xreki
Copy link
Collaborator Author

Xreki commented Sep 9, 2019

实现fc的GPU kernelPaddlePaddle/Paddle#19687
FC的融合PaddlePaddle/Paddle#19733

  • fc=mul+elementwise_add+relu,主要是将fc中的elementwise_add和relu计算融合起来。
  • 测试结果,加速效果:2.1%
--- Running IR pass [fc_fuse_pass]
---  detected 12 subgraphs
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0909 06:52:40.161475  7826 fc_fuse_pass.cc:122] with_relu: 1, found_fc_count: 12
---  detected 62 subgraphs
I0909 06:52:40.214682  7826 fc_fuse_pass.cc:122] with_relu: 0, found_fc_count: 74
.
.
.
I0909 06:51:05.903086  7821 ernie_tester.cc:345] Run 5010 samples, average latency: 7.19529 ms per sample.
I0909 06:51:05.903168  7821 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.81763 ms per sample.
  • Profile
------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                               Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fc                         370666      37927.8     16796.211849 (0.442847) 21131.552513 (0.557153) 0.038417    81.8839     0.102323    0.511083
thread0::matmul                     125225      7542.57     4853.428136 (0.643472)  2689.138129 (0.356528)  0.028322    6.91776     0.0602321   0.101637
thread0::layer_norm                 125225      6785.64     3458.361240 (0.509659)  3327.277901 (0.490341)  0.046506    6.78461     0.0541876   0.0914376
thread0::transpose2                 240432      6648.82     4993.952105 (0.751103)  1654.869939 (0.248897)  0.022683    6.97236     0.0276536   0.089594
thread0::elementwise_add            190342      4846.03     3903.958820 (0.805599)  942.072546 (0.194401)   0.018709    7.06821     0.0254596   0.0653011
thread0::softmax                    60108       4155.86     1873.948821 (0.450917)  2281.914897 (0.549083)  0.062806    7.22235     0.0691399   0.056001
thread0::reshape2                   240432      2117.24     2117.212819 (0.999986)  0.030675 (0.000014)     0.006925    2.54413     0.008806    0.0285302
thread0::scale                      70126       1639.13     1402.350356 (0.855547)  236.777127 (0.144453)   0.017914    3.11279     0.023374    0.0220875
thread0::stack                      5009        651.86      601.077905 (0.922097)   50.782113 (0.077903)    0.106259    4.12809     0.130138    0.00878393
thread0::lookup_table               15027       479.71      388.084297 (0.808999)   91.625230 (0.191001)    0.024586    0.085041    0.0319232   0.00646417
thread0::fetch                      5009        333.739     333.664085 (0.999776)   0.074696 (0.000224)     0.054763    2.98519     0.0666278   0.00449719
thread0::TensorCopy:GPU->CPU        5009        264.276     264.216676 (0.999777)   0.059003 (0.000223)     0.043246    2.97079     0.0527602   0.00356116
thread0::GpuMemcpySync:GPU->CPU     5009        244.643     235.804134 (0.963870)   8.839005 (0.036130)     0.04009     1.58306     0.0488407   0.00329661
thread0::GpuMemcpyAsync:CPU->GPU    20036       226.793     195.272127 (0.861014)   31.521131 (0.138986)    0.007882    1.17277     0.0113193   0.00305608
thread0::slice                      5009        147.886     130.195979 (0.880383)   17.689569 (0.119617)    0.025414    0.103978    0.029524    0.00199278
thread0::tanh                       5009        122.009     112.002519 (0.917989)   10.006004 (0.082011)    0.019717    0.094528    0.0243579   0.00164409
thread0::feed                       20036       76.5896     76.581383 (0.999892)    0.008248 (0.000108)     0.002877    0.308487    0.0038226   0.00103206

@Xreki
Copy link
Collaborator Author

Xreki commented Sep 11, 2019

fc+elementwise_add+layer_norm融合

--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
---  detected 24 subgraphs
.
.
.
I0911 06:41:55.494792  7238 ernie_tester.cc:345] Run 5010 samples, average latency: 6.92849 ms per sample.
I0911 06:41:55.494873  7238 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 6.54723 ms per sample.
  • Profile
Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fc                                255459      24269.7     12361.669096 (0.509346) 11908.043273 (0.490654) 0.039433    100.758     0.0950043   0.346056
thread0::fused_fc_elementwise_layernorm    115207      17699.9     6439.933305 (0.363841)  11259.921061 (0.636159) 0.103025    6.90034     0.153635    0.252378
thread0::matmul                            125225      7670.31     4985.129880 (0.649925)  2685.183125 (0.350075)  0.029412    6.91239     0.0612522   0.109369
thread0::transpose2                        240432      6889.05     5254.959850 (0.762799)  1634.087984 (0.237201)  0.023907    6.72677     0.0286528   0.0982293
thread0::softmax                           60108       4204.46     1927.686603 (0.458486)  2276.771181 (0.541514)  0.064249    2.83378     0.0699484   0.0599504
thread0::reshape2                          240432      2228.86     2228.844417 (0.999994)  0.014348 (0.000006)     0.007575    3.75591     0.00927023  0.0317808
thread0::elementwise_add                   75135       2160.14     1641.503732 (0.759904)  518.641058 (0.240096)   0.019577    6.72642     0.0287502   0.030801
thread0::scale                             70126       1736.05     1498.515875 (0.863178)  237.529504 (0.136822)   0.018853    6.66338     0.0247561   0.0247539
thread0::stack                             5009        664.48      615.254962 (0.925920)   49.224601 (0.074080)    0.10776     3.887       0.132657    0.00947466
thread0::layer_norm                        10018       592.675     324.577447 (0.547648)   268.097787 (0.452352)   0.050985    6.49622     0.059161    0.00845082
thread0::lookup_table                      15027       499.198     412.769395 (0.826866)   86.428197 (0.173134)    0.025502    6.66372     0.03322     0.00711794
thread0::fetch                             5009        362.799     362.676040 (0.999660)   0.123295 (0.000340)     0.05508     0.625043    0.0724295   0.00517307
thread0::TensorCopy:GPU->CPU               5009        297.886     297.782095 (0.999652)   0.103633 (0.000348)     0.043194    0.612426    0.0594701   0.00424748
thread0::GpuMemcpySync:GPU->CPU            5009        278.55      270.104589 (0.969679)   8.445858 (0.030321)     0.039548    0.609273    0.05561     0.00397179
thread0::GpuMemcpyAsync:CPU->GPU           20036       230.007     198.920144 (0.864846)   31.086396 (0.135154)    0.008387    0.708778    0.0114797   0.00327961
thread0::slice                             5009        152.385     135.526127 (0.889365)   16.859146 (0.110635)    0.026901    0.06877     0.0304223   0.00217283
thread0::tanh                              5009        122.456     112.432469 (0.918148)   10.023237 (0.081852)    0.020671    0.070096    0.0244471   0.00174607
thread0::feed                              20036       73.4209     73.411569 (0.999873)    0.009355 (0.000127)     0.002908    0.042732    0.00366445  0.00104689

@Xreki
Copy link
Collaborator Author

Xreki commented Nov 20, 2019

multi-head attention融合

  • Before
    image

  • After
    image

--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [cudnn_placement_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [conv_affine_channel_fuse_pass]
--- Running IR pass [conv_eltwiseadd_affine_channel_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [multihead_matmul_fuse_pass]
I1120 01:40:10.299883  6510 graph_pattern_detector.cc:97] ---  detected 12 subgraphs
--- Running IR pass [fc_fuse_pass]
I1120 01:40:10.344569  6510 graph_pattern_detector.cc:97] ---  detected 12 subgraphs
I1120 01:40:10.357105  6510 graph_pattern_detector.cc:97] ---  detected 26 subgraphs
--- Running IR pass [fc_elementwise_layernorm_fuse_pass]
I1120 01:40:10.397455  6510 graph_pattern_detector.cc:97] ---  detected 24 subgraphs
--- Running IR pass [conv_elementwise_add_act_fuse_pass]
--- Running IR pass [conv_elementwise_add2_act_fuse_pass]
--- Running IR pass [conv_elementwise_add_fuse_pass]
--- Running IR pass [transpose_flatten_concat_fuse_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
I1120 01:40:10.509421  6510 ir_params_sync_among_devices_pass.cc:41] Sync params from CPU to GPU
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I1120 01:40:10.758424  6510 analysis_predictor.cc:474] ======= optimize end =======
I1120 01:40:11.764863  6510 ernie_tester.cc:210] Load 5010 samples from /data/ernie/seq128_data/test_ds
W1120 01:40:12.444176  6510 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.1, Runtime API Version: 9.0
.
.
.
I1120 01:40:43.862216  6510 ernie_tester.cc:345] Run 5010 samples, average latency: 6.40495 ms per sample.
I1120 01:40:43.862303  6510 ernie_tester.cc:350] Run 5009 samples, average latency [exclude 1 warmup steps]: 5.74544 ms per sample.
  • Profile
------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                      Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::fused_fc_elementwise_layernorm    115207      17021.4     5933.953144 (0.348617)  11087.450962 (0.651383) 0.09927     8.71612     0.147746    0.259676
thread0::multihead_matmul                  60108       11252.9     7224.504494 (0.642012)  4028.405127 (0.357988)  0.169878    6.57812     0.187212    0.171673
thread0::fc                                75135       10707.1     4357.612778 (0.406985)  6349.445604 (0.593015)  0.038466    85.6766     0.142504    0.163345
thread0::mul                               180324      9774.02     4994.708598 (0.511019)  4779.307233 (0.488981)  0.047804    9.68787     0.0542025   0.149111
thread0::fetch                             5009        4630.95     4629.392276 (0.999664)  1.554398 (0.000336)     0.055675    1.92049     0.924525    0.0706489
thread0::TensorCopy:GPU->CPU               5009        4573.44     4571.905707 (0.999664)  1.538373 (0.000336)     0.04468     1.90973     0.913045    0.0697717
thread0::GpuMemcpySync:GPU->CPU            5009        4553.78     4543.954556 (0.997843)  9.821141 (0.002157)     0.040962    1.9062      0.909119    0.0694716
thread0::stack                             5009        684.909     634.119137 (0.925844)   50.789981 (0.074156)    0.106366    2.25957     0.136736    0.0104489
thread0::layer_norm                        10018       549.522     290.654864 (0.528923)   258.867095 (0.471077)   0.048854    0.142508    0.0548535   0.00838342
thread0::lookup_table                      15027       479.099     392.001286 (0.818205)   87.097997 (0.181795)    0.024349    9.00457     0.0318826   0.00730906
thread0::elementwise_add                   15027       345.643     302.006333 (0.873752)   43.636814 (0.126248)    0.017625    3.1705      0.0230015   0.00527307
thread0::GpuMemcpyAsync:CPU->GPU           20036       229.869     198.181250 (0.862150)   31.687294 (0.137850)    0.008188    2.08116     0.0114728   0.00350684
thread0::scale                             10018       224.898     210.406303 (0.935565)   14.491370 (0.064435)    0.018757    0.143304    0.0224494   0.003431
thread0::slice                             5009        169.833     151.921321 (0.894535)   17.911378 (0.105465)    0.030184    0.072767    0.0339055   0.00259094
thread0::matmul                            5009        163.061     149.402692 (0.916239)   13.658130 (0.083761)    0.028052    0.956259    0.0325536   0.00248763
thread0::tanh                              5009        116.672     106.307159 (0.911166)   10.364377 (0.088834)    0.020194    0.067294    0.0232924   0.00177992
thread0::feed                              20036       71.6392     71.630728 (0.999882)    0.008476 (0.000118)     0.002808    0.034718    0.00357552  0.00109292
  • Timeline:fetch是在等待GPU计算完成
    image

@Xreki
Copy link
Collaborator Author

Xreki commented Feb 4, 2020

QA测试结果

  • 测试时间:2019年12月24日
  • 测试环境:
    • CUDA 9.0
    • CUDNN 7.3
    • Paddle预测库版本: a5a8d14414213fadcfcd7dc60c794d1a515a390e
    • 测试镜像: paddlepaddle/paddle_manylinux_devel:cuda9.0_cudnn7
    • GPU: Tesla P4 8G(7611MB)
    • GCC版本:镜像gcc482
  • 测试结论:
    1. ERNIE-GPU预测性能符合预期
    2. GPU和CPU对相同数据的预测结果diff<1e-5
    3. GPU同⼀个数据集反复预测,预测结果diff<1e-6
  • 性能数据:
    • Seqlen128-data

      batch_size average_latency(ms) batch_latency(ms) gpu_memory(MB)
      1 8.664 8.664 737
      10 6.388 63.878 825
      20 6.207 124.149 919
      30 6.530 195.892 985
      40 6.456 258.260 1085
      50 6.445 322.275 1135
      60 6.430 385.805 1279
      70 6.457 451.965 1283
      80 6.415 513.178 1339
      90 6.481 583.301 1431
      100 6.504 650.428 1485
    • Seqlen512-data

      batch_size average_latency(ms) batch_latency(ms) gpu_memory(MB)
      1 36.123 36.123 781
      10 31.663 316.630 1191
      20 31.750 635.002 1637
      30 32.280 968.412 2053
      40 32.130 1285.212 2431
      50 32.917 1645.835 2847
      100 34.254 3425.380 4951
      200 显存不⾜
      300 显存不⾜

@Xreki
Copy link
Collaborator Author

Xreki commented Feb 4, 2020

二期优化工作

  • 负责人: @NHZlX

  • GPU平台信息:Tesla P4

  • 软件信息:

    • CUDA 10.0
  • 优化工作

版本 P4时间 (ms) 记录时间 PR 版本描述 加速效果
0 10 2020-02-04 - baseline,CUDA10的时间比CUDA9的时间长 -
1 9.5 2020-02-04 PaddlePaddle/Paddle#22424
PaddlePaddle/Paddle#22486
将3个fc也融合到了multihead_matmul_op中,将3个小的GEMM合成一个大的GEMM计算 5%
2 8.4 2020-02-04 PaddlePaddle/Paddle#22424
PaddlePaddle/Paddle#22494
融合3个lookup_table+elementwise_add+layer_norm,实现fused_embedding_eltwise_layernorm_op 11.6%
3 7.8 2020-02-05 fused_fc_elementwise_layernorm_op中计算layer_norm,使用double类型进行block内的reduce操作,改成float类型

embedding_eltwise_layernorm_fuse_pass匹配的子图模式

image

P4的浮点计算能力

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants