-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
173 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
+++ | ||
author = "SWHL" | ||
title = "ONNXRuntime推理调优指南" | ||
date = "2023-05-07" | ||
description = "ONNXRuntime推理调优指南" | ||
tags = ["llm"] | ||
+++ | ||
|
||
|
||
<details open> | ||
<summary>目录</summary> | ||
|
||
- [引言](#引言) | ||
- [推荐常用设置](#推荐常用设置) | ||
- [`enable_cpu_mem_arena`](#enable_cpu_mem_arena) | ||
- [`enable_profiling`](#enable_profiling) | ||
- [`execution_mode`](#execution_mode) | ||
- [`inter_op_num_threads`](#inter_op_num_threads) | ||
- [`intra_op_num_threads`](#intra_op_num_threads) | ||
- [`graph_optimization_level`](#graph_optimization_level) | ||
- [FAQ](#faq) | ||
- [为什么我的模型在GPU上比在CPU上还要慢?](#为什么我的模型在gpu上比在cpu上还要慢) | ||
- [参考资料](#参考资料) | ||
</details> | ||
|
||
#### 引言 | ||
- 平时推理用的最多是ONNXRuntime,推理引擎的合适调配对推理性能有着至关重要的影响。但是有关于ONNXRuntime参数设置的资料却散落在各个地方,不能形成有效的指导意见。 | ||
- 因此,决定在这一篇文章中来梳理一下相关的设置。 | ||
- 以下参数都是来自`SessionOptions`中 | ||
- 相关测试代码可以前往[AI Studio](https://aistudio.baidu.com/aistudio/projectdetail/6109918?sUid=57084&shared=1&ts=1683438418669)查看 | ||
- 欢迎补充和指出不足之处。 | ||
|
||
#### 推荐常用设置 | ||
```python | ||
import onnxruntime as rt | ||
|
||
sess_options = rt.SessionOptions() | ||
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL | ||
sess_options.log_severity_level = 4 | ||
sess_options.enable_cpu_mem_arena = False | ||
|
||
# 其他参数,采用默认即可 | ||
``` | ||
|
||
#### [`enable_cpu_mem_arena`](https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.SessionOptions.enable_cpu_mem_arena) | ||
- 作用:启用CPU上的**memory arena**。Arena可能会为将来预先申请很多内存。如果不想使用它,可以设置为`enable_cpu_mem_area=False`,默认是`True` | ||
- 结论:建议关闭 | ||
- 开启之后,占用内存会剧增(5618.3M >> 5.3M),且持续占用,不释放;推理时间只有大约13%提升 | ||
|
||
- 测试环境: | ||
- Python: 3.7.13 | ||
- ONNXRuntime: 1.14.1 | ||
- 测试代码(来自[issue 11627](https://github.com/microsoft/onnxruntime/issues/11627),[enable_cpu_memory_area_example.zip](https://github.com/microsoft/onnxruntime/files/8772315/enable_cpu_memory_area_example.zip)) | ||
```python | ||
# pip install onnxruntime==1.14.1 | ||
# pip install memory_profiler | ||
|
||
import numpy as np | ||
import onnxruntime as ort | ||
from memory_profiler import profile | ||
|
||
|
||
@profile | ||
def onnx_prediction(model_path, input_data): | ||
ort_sess = ort.InferenceSession(model_path, sess_options=sess_options) | ||
preds = ort_sess.run(output_names=["predictions"], | ||
input_feed={"input_1": input_data})[0] | ||
return preds | ||
|
||
|
||
sess_options = ort.SessionOptions() | ||
sess_options.enable_cpu_mem_arena = False | ||
|
||
input_data = np.load('enable_cpu_memory_area_example/input.npy') | ||
print(f'input_data shape: {input_data.shape}') | ||
model_path = 'enable_cpu_memory_area_example/model.onnx' | ||
|
||
onnx_prediction(model_path, input_data) | ||
``` | ||
- Windows端 | Mac端 | Linux端 测试情况都大致相同 | ||
<details> | ||
|
||
- `enable_cpu_mem_arena=True` | ||
```bash | ||
(demo) PS G:> python .\test_enable_cpu_mem_arena.py | ||
enable_cpu_mem_arena: True | ||
input_data shape: (32, 200, 200, 1) | ||
Filename: .\test_enable_cpu_mem_arena.py | ||
|
||
Line # Mem usage Increment Occurrences Line Contents | ||
============================================================= | ||
7 69.1 MiB 69.1 MiB 1 @profile | ||
8 def onnx_prediction(model_path, input_data): | ||
9 77.2 MiB 8.1 MiB 1 ort_sess = ort.InferenceSession(model_path, sess_options=sess_options) | ||
10 77.2 MiB 0.0 MiB 1 preds = ort_sess.run(output_names=["predictions"], | ||
11 5695.5 MiB 5618.3 MiB 1 input_feed={"input_1": input_data})[0] | ||
12 5695.5 MiB 0.0 MiB 1 return preds | ||
``` | ||
- `enable_cpu_mem_arena=False` | ||
```bash | ||
(demo) PS G:> python .\test_enable_cpu_mem_arena.py | ||
enable_cpu_mem_arena: False | ||
input_data shape: (32, 200, 200, 1) | ||
Filename: .\test_enable_cpu_mem_arena.py | ||
|
||
Line # Mem usage Increment Occurrences Line Contents | ||
============================================================= | ||
7 69.1 MiB 69.1 MiB 1 @profile | ||
8 def onnx_prediction(model_path, input_data): | ||
9 76.9 MiB 7.8 MiB 1 ort_sess = ort.InferenceSession(model_path, sess_options=sess_options) | ||
10 76.9 MiB 0.0 MiB 1 preds = ort_sess.run(output_names=["predictions"], | ||
11 82.1 MiB 5.3 MiB 1 input_feed={"input_1": input_data})[0] | ||
12 82.1 MiB 0.0 MiB 1 return preds | ||
``` | ||
|
||
</details> | ||
|
||
#### `enable_profiling` | ||
- 开启这个参数,在推理时,会生成一个类似`onnxruntime_profile__2023-05-07_09-02-15.json`的日志文件,包含详细的性能数据(线程、每个运算符的延迟等)。 | ||
- 建议开启 | ||
- 示例代码: | ||
```python | ||
import onnxruntime as rt | ||
|
||
sess_options = rt.SessionOptions() | ||
sess_options.enable_profiling = True | ||
``` | ||
|
||
#### `execution_mode` | ||
- 设置运行模型的模式,包括`rt.ExecutionMode.ORT_SEQUENTIAL`和`rt.ExecutionMode.ORT_PARALLEL`。一个序列执行,一个并行。默认是序列执行 | ||
- **通常来说,当一个模型中有许多分支时,可以设置该参数为`ORT_PARALLEL`来达到更好的表现** | ||
- 当设置`sess_options.execution_mode = rt.ExecutionMode.ORT_PARALLEL`时,可以设置`sess_options.inter_op_num_threads`来控制使用线程的数量,来并行化执行(模型中各个节点之间) | ||
|
||
#### `inter_op_num_threads` | ||
- 设置并行化执行图(跨节点)时,使用的线程数。默认是0,交由onnxruntime自行决定。 | ||
- 示例代码: | ||
```python | ||
import onnxruntime as rt | ||
|
||
sess_options = rt.SessionOptions() | ||
sess_options.inter_op_num_threads = 2 | ||
``` | ||
|
||
#### `intra_op_num_threads` | ||
- 设置并行化执行图(内部节点)时,使用的线程数。默认是0,交由onnxruntime自行决定,一般会选择使用设备上所有的核。 | ||
- ⚠️ 这个值并不是越大越好,具体参考[AI Studio](https://aistudio.baidu.com/aistudio/projectdetail/6109918?sUid=57084&shared=1&ts=1683438418669)中的消融实验。 | ||
- 示例代码: | ||
```python | ||
import onnxruntime as rt | ||
|
||
sess_options = rt.SessionOptions() | ||
sess_options.intra_op_num_threads = 2 | ||
``` | ||
|
||
#### [`graph_optimization_level`](https://github.com/microsoft/onnxruntime-openenclave/blob/openenclave-public/docs/ONNX_Runtime_Graph_Optimizations.md) | ||
- 运行图时,对图中算子的优化水平。默认是开启全部算子的优化。建议采用默认值即可。 | ||
- 可选的枚举值有:`ORT_DISABLE_ALL | ORT_ENABLE_BASIC | ORT_ENABLE_EXTENDED | ORT_ENABLE_ALL` | ||
- 示例代码: | ||
```python | ||
import onnxruntime as rt | ||
|
||
sess_options = rt.SessionOptions() | ||
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL | ||
``` | ||
|
||
#### FAQ | ||
##### 为什么我的模型在GPU上比在CPU上还要慢? | ||
- 取决于所使用的执行提供者,它可能没有完全支持模型中的所有操作。回落到CPU操作可能会导致性能速度的下降。此外,即使一个操作是由CUDA execution provider实现的,由于性能的原因,它也不一定会把操作分配/放置到CUDA EP上。要想看到ORT决定的位置,请打开verbose日志并查看控制台的输出。 | ||
|
||
|
||
#### 参考资料 | ||
- [ONNX Runtime Performance Tuning](https://github.com/microsoft/onnxruntime-openenclave/blob/openenclave-public/docs/ONNX_Runtime_Perf_Tuning.md) | ||
- [Python API](https://onnxruntime.ai/docs/api/python/api_summary.html) |