Skip to content

Commit 209600b

Browse files
feat: add benchmarks and e2e T5 notebook (#92)
* feat: first version of the notebook * feat: improve notebook * feat: update README.md * fix: show error * fix: linter * feat: modify text * feat: add inductor benchmark script * feat: add info * feat: add info on AITemplate benchmark * fix: wording * feat: complete README.md * feat: lint * fix: text * fix: simple update of the notebook * fix: update T5 notebook * feat: add 364 seq len in benchmarks + move inductor on cuda 11.6 * feat: add deepspeed scores * feat: update text * feat: add notebook to create the graph + data as csv * feat: image * feat: image * feat: text * fix: engine name, graph * fix: time * fix: typo * fix: time * fix: README.md following PR review * fix: PR review * feat: improve graph * fix: typo
1 parent 3c933cc commit 209600b

File tree

15 files changed

+1388
-59
lines changed

15 files changed

+1388
-59
lines changed

README.md

Lines changed: 83 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,22 @@
22

33
---
44

5-
Kernl is a collection of optimized kernels for `transformer` models to speed-up inference and soon training.
5+
**Kernl lets you run Pytorch transformer models several times faster on GPU with a single line of code,**
6+
**and is designed to be easily hackable.**
67

7-
## Install dependencies
8+
<p align="center">
9+
<img src="./resources/images/speedup.png">
10+
</p>
11+
12+
*benchmarks ran on a 3090 RTX*
13+
14+
Kernl is the first OSS inference engine written in ~~CUDA C~~ [OpenAI Triton](https://openai.com/blog/triton/),
15+
a new language designed by OpenAI to make it easier to write GPU kernels.
16+
Each kernel is less than 200 lines of code, and is **easy to understand** and modify.
17+
18+
🎅🎄 Training support comming soon... 🤯
19+
20+
## Installation
821

922
**IMPORTANT**: This package requires `pytorch` being installed.
1023
Please install it first.
@@ -13,13 +26,11 @@ Please install it first.
1326
pip install torch -U --extra-index-url https://download.pytorch.org/whl/cu116
1427
git clone https://github.com/ELS-RD/kernl
1528
pip install -e .
16-
# or to enable all benchmarks
17-
pip install -e ".[benchmark]"
1829
```
1930

2031
This project requires `Python` >= 3.9.
2132

22-
## Use
33+
## Getting started
2334

2435
```python
2536
from kernl.model_optimization import optimize_model
@@ -46,11 +57,13 @@ Note that the original model will raise an error if you try to use it after opti
4657
pytest
4758
```
4859

60+
There are over 2K benchmarks, and they take a while to run.
61+
4962
Some rules on how `PyTest` works, in particular for benchmarks:
5063

51-
- add `-k` to filter tests/benchmarks by their name like `pytest -k benchmark` to run only tests with `benchmark`
52-
in their name
53-
- you can combine expressions in the filter: `pytest -k "benchmark and not bert"` if you want to run all benchmarks
64+
- add `-k` to filter tests/benchmarks by their name like `pytest -k benchmark` to run only tests with `benchmark`
65+
in their name
66+
- you can combine expressions in the filter: `pytest -k "benchmark and not bert"` if you want to run all benchmarks
5467
except those related to BERT
5568
- to group and compare benchmark measures, use `pytest -k benchmark --benchmark-group-by ...`:
5669
- groupinng by names: `pytest -k benchmark --benchmark-group-by fullfunc`
@@ -80,14 +93,74 @@ The easiest way to do this is to [convert the model to a fx graph](https://pytor
8093
print it with `utils.graph_report` or by printing the code `print(you_graph_module.code)`
8194

8295
Then you can use [replace_pattern](https://pytorch.org/docs/stable/fx.html#torch.fx.replace_pattern) to replace the
83-
pattern in the graph. We have our own version of `replace_pattern` with some enhancements to work with modules for
96+
pattern in the graph. We have our own version of `replace_pattern` with some enhancements to work with modules, for
8497
example. You can find examples of that in `optimizer` folder.
8598

86-
## Code formatting
99+
## Code Formatting
87100

88101
We use `black` / `isort` / `flake8` to format the code. You can run them with:
89102

90103
```shell
91104
make source_code_format
92105
make source_code_check_format
93106
```
107+
108+
## Why?
109+
110+
At Lefebvre Sarrut, we run several transformers in production, some of them being latency sensitive (search and recsys mostly).
111+
112+
We are using OnnxRuntime and TensorRT and even created
113+
[transformer-deploy](https://github.com/ELS-RD/transformer-deploy) an OSS library to share our knowledge with the community.
114+
Recently, we were testing generative languages, and we tried to accelerate them. It proves very difficult with traditional tools.
115+
116+
Basically, and to make it short, it seems to us that Onnx (the main format to feed those tools) is an interesting
117+
format with a wide range support of hardware.
118+
119+
However, its ecosystem (and mostly inference engines) has several limitations when we deal with new LLM architectures :
120+
121+
* Export to Onnx is simple for models without control flow because we can rely on tracing,
122+
but dynamic behaviors are harder to obtain (see https://ppwwyyxx.com/blog/2022/TorchScript-Tracing-vs-Scripting/ for
123+
more info, it’s about torchscript but is exactly the same for onnx).
124+
* Unlike Pytorch, both ONNX Runtime/TensorRT have not yet native support for multi GPUs tasks enabling tensor parallelism
125+
* TensorRT is not able to manage 2 dynamic axis for transformer models with the same profile.
126+
Because usually we want to be able to provide inputs of different lengths, we need to build 1 model per batch size.
127+
* Very large models are common and Onnx (as a protobuff file) has some limitations regarding its file size,
128+
requiring to store weights outsides of the model to workaround.
129+
130+
One thing very annoying is the fact that new models are never accelerated, you need to wait for someone to write custom CUDA kernels for that.
131+
132+
It’s not to say the solutions are bad, one big thing with OnnxRuntime is its multi hardware support.
133+
Regarding TensorRT, it’s really fast.
134+
135+
So we wanted something as fast as TensorRT and on Python / PyTorch, that’s why we built Kernl.
136+
137+
## How?
138+
139+
The simple rule is memory bandwidth is often the bottleneck in deep learning, to accelerate inference, memory access
140+
reduction is usually a good strategy.
141+
On short input sequence, the bottleneck is often related to the CPU overhead, it has to be removed too.
142+
Counterintuitively, to make things faster, you don’t need to be faster in computation.
143+
144+
We leverage mostly 3 technologies:
145+
146+
* [OpenAI Triton](https://triton-lang.org/): it’s a language to write GPU kernels like CUDA (not to be confused with
147+
Nvidia Triton inference server), but much more productive (at least for us).
148+
Improvement is due to the fusion of several ops, making us able to chain computations without
149+
saving intermediate results in GPU memory. We are using it to rewrite:
150+
151+
* Attention (replaced by Flash Attention),
152+
* Linear layer and their activation,
153+
* and finally Layernorm/Rmsnorm.
154+
155+
* [CUDA graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) : you may have heard that Python is slow,
156+
blablabla and to limit overhead C++/Rust should be the solution.
157+
It is true but better than low overhead is no overhead at all. That’s cuda-graphs!
158+
During a warmup step, it will save every kernel launched and their parameters, and then, with a single GPU instruction,
159+
we can replay the whole inference.
160+
161+
* [TorchDynamo](https://github.com/pytorch/torchdynamo/): this prototype from Meta helps us to cope with dynamic
162+
behavior. It’s described [here](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747),
163+
and in a few words during a warmup step it traces the model and provides a Fx graph (a static computation graph).
164+
We replace some operations of this graph with our kernels and recompile it in Python.
165+
We do that for any possible dynamic behavior we expect to have. During inference, inputs are analyzed, and the correct
166+
static graph is used. It’s really an awesome project, check their repo to know more.

conftest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ def pytest_sessionfinish(session: pytest.Session, exitstatus):
6262

6363
def check_all_close(a: torch.Tensor, b: torch.Tensor, rtol=0, atol=1e-1) -> None:
6464
"""
65-
Check that all elements of tensors a and b are close.
65+
Check that all elements of tensors a and b are within provided thresholds.
6666
"""
6767
assert a.shape == b.shape, f"Shapes don't match: {a.shape} != {b.shape}"
6868
assert a.dtype == b.dtype, f"Dtypes don't match: {a.dtype} != {b.dtype}"

experimental/benchmarks/README.md

Lines changed: 159 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# Benchmark of Third-Party Libraries
22

33
This directory contains benchmarks of third-party libraries. The benchmarks are
4-
written as simple Python script and have been run on a Nvidia 3090 RTX GPU.
4+
written as simple Python script and have been run on a Nvidia 3090 RTX GPU, 128 Gb RAM, 12 cores Intel CPU.
5+
6+
Measures are done in wall-clock time, and output tensors are kept on GPU.
57

68
## [TensorRT](https://github.com/NVIDIA/TensorRT/)
79

@@ -14,20 +16,22 @@ We rely on the Docker image `nvcr.io/nvidia/tensorrt:22.09-py3`.
1416
| batch | sequence length | Time (ms) |
1517
|-------|-----------------|-----------|
1618
| 1 | 16 | 0.0010 |
17-
| 1 | 32 | 0.0012 |
19+
| 1 | 32 | 0.0010 |
1820
| 1 | 64 | 0.0011 |
1921
| 1 | 128 | 0.0013 |
2022
| 1 | 256 | 0.0016 |
23+
| 1 | 384 | 0.0026 |
2124
| 1 | 512 | 0.0026 |
22-
| 8 | 16 | 0.0012 |
25+
| 8 | 16 | 0.0011 |
2326
| 8 | 32 | 0.0015 |
24-
| 8 | 64 | 0.0020 |
27+
| 8 | 64 | 0.0019 |
2528
| 8 | 128 | 0.0036 |
2629
| 8 | 256 | 0.0064 |
27-
| 8 | 512 | 0.0142 |
30+
| 8 | 384 | 0.0139 |
31+
| 8 | 512 | 0.0139 |
2832
| 32 | 16 | 0.0020 |
2933
| 32 | 32 | 0.0031 |
30-
| 32 | 64 | 0.0055 |
34+
| 32 | 64 | 0.0054 |
3135
| 32 | 128 | 0.0103 |
3236
| 32 | 256 | 0.0210 |
3337

@@ -37,20 +41,21 @@ We rely on the Docker image `nvcr.io/nvidia/tensorrt:22.09-py3`.
3741
docker run --rm -it --gpus all -v $(pwd):/work nvcr.io/nvidia/tensorrt:22.09-py3
3842
cd /work
3943
pip install transformers torch -U --extra-index-url https://download.pytorch.org/whl/cu116
40-
python experimental/benchmarks/tensorrt.py
44+
python experimental/benchmarks/tensorrt_.py
4145
```
4246

4347
### Notes
4448

45-
As `TensorRT` complains where there are 2 dynamic axes, we build one model per batch size.
49+
As `TensorRT` (`Myelin` code generator) complains where there are 2 dynamic axes, we build one model per batch size.
4650
Only sequence length axis is dynamic.
4751

4852
Model building takes time, around 30mn on a beefy machine.
4953

5054
Most of the code has been taken from [transformer-deploy](https://github.com/ELS-RD/transformer-deploy).
5155

52-
It is important to note that `TensorRT` is a black box and we cannot disable fast GELU, fp16 accumulation,
53-
or whatever optimization they are using.
56+
It is important to note that `TensorRT` is a black box and we are not aware of simple ways to disable fast GELU,
57+
fp16 accumulation, or whatever optimization leveraged (we know they use many of them because of the precision issues
58+
we experienced in prod with this tool).
5459

5560
## [AITemplate](https://github.com/facebookincubator/AITemplate/)
5661

@@ -72,26 +77,28 @@ Main branch commit used:
7277
| 1 | 64 | 0.0013 |
7378
| 1 | 128 | 0.0015 |
7479
| 1 | 256 | 0.0020 |
80+
| 1 | 384 | 0.0022 |
7581
| 1 | 512 | 0.0031 |
7682
| 8 | 16 | 0.0013 |
7783
| 8 | 32 | 0.0017 |
7884
| 8 | 64 | 0.0026 |
79-
| 8 | 128 | 0.0044 |
85+
| 8 | 128 | 0.0043 |
8086
| 8 | 256 | 0.0076 |
87+
| 8 | 384 | 0.0115 |
8188
| 8 | 512 | 0.0149 |
82-
| 32 | 16 | 0.0027 |
89+
| 32 | 16 | 0.0026 |
8390
| 32 | 32 | 0.0043 |
8491
| 32 | 64 | 0.0073 |
85-
| 32 | 128 | 0.0131 |
86-
| 32 | 256 | 0.0249 |
92+
| 32 | 128 | 0.0127 |
93+
| 32 | 256 | 0.0242 |
8794

8895
### Running the benchmark
8996

9097
```shell
9198
# in AITemplate root directory
9299
./docker/build.sh cuda
93100
# in kernl root directory
94-
docker run --rm -it --gpus all -v $(pwd):/work -v $(pwd):/work ait
101+
docker run --rm -it --gpus all -v $(pwd):/work -v $(pwd):/work ait
95102
cp /work/experimental/benchmarks/aitemplate.py /AITemplate/examples/03_bert/demo_new.py
96103
cd AITemplate/
97104
python3 ./examples/03_bert/demo_new.py
@@ -103,6 +110,26 @@ cat measures.txt
103110

104111
The script is based on the official [demo script](https://github.com/facebookincubator/AITemplate/tree/main/examples/03_bert).
105112

113+
The model do not support `attention_mask`, so we don't use it in benchmarks.
114+
It is important to keep in mind that `attention mask` adds operations on top of an already computation bounded kernel.
115+
Said otherwise, it would likely make it slower.
116+
Moreover, without `attention mask`, batch inference is useless right now.
117+
There is a multithreads mode which would get much more overhead than batch mode (launching n threads times more kernels).
118+
119+
**TL;DR**: numbers for AITemplate have to be taken with a grain of salt.
120+
121+
An issue has been opened [here](https://github.com/facebookincubator/AITemplate/issues/46) on the repo:
122+
123+
```cite
124+
@antinucleon
125+
The current BERT example is only used for benchmarking purposes on fixed
126+
length without mask.
127+
128+
We are currently working with CUTLASS team on a grouped Attention
129+
optimization, which will remove paddings & mask for dynamic sequences. It
130+
will appear in next CUTLASS & AIT release.
131+
```
132+
106133
We choose to use the following options:
107134

108135
* accumulation in `FP32` (instead of default `FP16`):
@@ -112,10 +139,122 @@ We choose to use the following options:
112139
FWIW, `Kernl` also have support of fast GELU but it is disabled by default.
113140
* `CUDA graphs` enabled: this technology remove kernel launching overhead, and is a good practice to use it when possible.
114141

115-
We don't use the [benchmark](https://github.com/facebookincubator/AITemplate/blob/main/examples/03_bert/benchmark_ait.py)
116-
script provided in the `AITemplate` repo because it reports GPU times through CUDA events and we want to measure the wall
117-
clock time which better match our end to end case.
118-
119-
Differences are mostly on short input shapes where CPU overhead dominates.
142+
We do not use the [benchmark](https://github.com/facebookincubator/AITemplate/blob/main/examples/03_bert/benchmark_ait.py)
143+
script provided in the `AITemplate` repo because:
144+
* it reports GPU times through CUDA events and we compare inference engines on wall-clock times which better matches our end-to-end use cases.
145+
* we are not sure of the meaning of the reported times in case of multiple threads/cuda streams used
146+
(see [here](https://github.com/facebookincubator/AITemplate/issues/44)), it doesn't match latency or throughput definition
120147

121148
CPP implementation of benchmark function is [here](https://github.com/facebookincubator/AITemplate/blob/44026ba7e7f5376a80cf0f2b333a0f25c0eeda6c/static/csrc/model_container.cpp).
149+
150+
151+
## [TorchDynamo + inductor](https://github.com/pytorch/torchdynamo)
152+
153+
### Version
154+
155+
* Triton: https://github.com/openai/triton@af76c989eb4799b015f8b288ccd8421558772e56#subdirectory=python
156+
* PyTorch (includes TorchDynamo): 1.14.0.dev20221015+cu116
157+
158+
### Results
159+
160+
| batch | sequence length | Time (ms) |
161+
|-------|-----------------|-----------|
162+
| 1 | 16 | 0.0018 |
163+
| 1 | 32 | 0.0020 |
164+
| 1 | 64 | 0.0020 |
165+
| 1 | 128 | 0.0025 |
166+
| 1 | 256 | 0.0030 |
167+
| 1 | 384 | 0.0036 |
168+
| 1 | 512 | 0.0048 |
169+
| 8 | 16 | 0.0023 |
170+
| 8 | 32 | 0.0027 |
171+
| 8 | 64 | 0.0039 |
172+
| 8 | 128 | 0.0065 |
173+
| 8 | 256 | 0.0117 |
174+
| 8 | 386 | 0.0156 |
175+
| 8 | 512 | 0.0212 |
176+
| 32 | 16 | 0.0039 |
177+
| 32 | 32 | 0.0062 |
178+
| 32 | 64 | 0.0108 |
179+
| 32 | 128 | 0.0177 |
180+
| 32 | 256 | 0.0357 |
181+
182+
### Running the benchmark
183+
184+
```shell
185+
# reuse AITemplate docker image based on nvidia/cuda:11.6.2-devel-ubuntu20.04
186+
docker run --rm -it --gpus all -v $(pwd):/work -v $(pwd):/work ait
187+
cd work/
188+
apt install git
189+
pip3 install --pre torch==1.14.0.dev20221015+cu116 --extra-index-url https://download.pytorch.org/whl/nightly/cu116 -U
190+
pip3 install git+https://github.com/openai/triton@af76c989eb4799b015f8b288ccd8421558772e56#subdirectory=python
191+
python experimental/benchmarks/inductor.py
192+
```
193+
194+
### Notes
195+
196+
`Torchinductor` is still in prototype stage, results may be different with final version.
197+
We are using the version included in PyTorch nightly for this benchmark.
198+
The dependency of this project is an older version not requiring nightly version as we only need `TorchDynamo`.
199+
Project info on https://github.com/pytorch/torchdynamo even if code is not anymore updated in this repo.
200+
201+
We tried several disabled by default optimizations but none of them worked:
202+
203+
* `config.aggressive_fusion = True`: significantly slower when enabled on our GPU.
204+
* `config.inplace_buffers = True`: crash, see https://github.com/pytorch/torchdynamo/issues/823
205+
* `config.triton.mm = "triton"` (same for `"autotune"`): crash, even when trying with `config.triton.autotune = False`
206+
207+
By default, `CUDA graphs` is enabled.
208+
209+
The last one is important and should bring some speedup when working.
210+
211+
## [Deepspeed](https://github.com/microsoft/DeepSpeed)
212+
213+
### Version
214+
215+
0.7.4
216+
217+
### Results
218+
219+
| batch | sequence length | Time (ms) |
220+
|-------|-----------------|-----------|
221+
| 1 | 16 | 0.0009 |
222+
| 1 | 32 | 0.0008 |
223+
| 1 | 64 | 0.0009 |
224+
| 1 | 128 | 0.0012 |
225+
| 1 | 256 | 0.0019 |
226+
| 1 | 384 | 0.0023 |
227+
| 1 | 512 | 0.0032 |
228+
| 8 | 16 | 0.0011 |
229+
| 8 | 32 | 0.0016 |
230+
| 8 | 64 | 0.0025 |
231+
| 8 | 128 | 0.0051 |
232+
| 8 | 256 | 0.0106 |
233+
| 8 | 384 | 0.0161 |
234+
| 8 | 512 | 0.0219 |
235+
| 32 | 16 | 0.0025 |
236+
| 32 | 32 | 0.0050 |
237+
| 32 | 64 | 0.0097 |
238+
| 32 | 128 | 0.0176 |
239+
| 32 | 256 | 0.0374 |
240+
241+
### Running the benchmark
242+
243+
```shell
244+
# reuse AITemplate docker image based on nvidia/cuda:11.6.2-devel-ubuntu20.04
245+
docker run --rm -it --gpus all -v $(pwd):/work -v $(pwd):/work ait
246+
pip install deepspeed transformers
247+
deepspeed --num_gpus 1 experimental/benchmarks/deepspeed_.py --deepspeed
248+
```
249+
250+
### Notes
251+
252+
The benchmark script is built over the
253+
[one](https://github.com/microsoft/DeepSpeed/blob/master/benchmarks/inference/bert-bench.py) provided in the
254+
`deepspeed` repo.
255+
256+
We rebuilt a model for each shape to leverage CUDA Graphs and get best possible performances.
257+
In real scenario, we would need something on top of it to handle multiple graphs.
258+
259+
Model got its weights completly converted to fp16 (instead of doing mixed precision) as it is done that way in the
260+
benchmark script.

0 commit comments

Comments
 (0)