You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
from kernl.model_optimization import optimize_model
@@ -46,11 +57,13 @@ Note that the original model will raise an error if you try to use it after opti
46
57
pytest
47
58
```
48
59
60
+
There are over 2K benchmarks, and they take a while to run.
61
+
49
62
Some rules on how `PyTest` works, in particular for benchmarks:
50
63
51
-
- add `-k` to filter tests/benchmarks by their name like `pytest -k benchmark` to run only tests with `benchmark`
52
-
in their name
53
-
- you can combine expressions in the filter: `pytest -k "benchmark and not bert"` if you want to run all benchmarks
64
+
- add `-k` to filter tests/benchmarks by their name like `pytest -k benchmark` to run only tests with `benchmark`
65
+
in their name
66
+
- you can combine expressions in the filter: `pytest -k "benchmark and not bert"` if you want to run all benchmarks
54
67
except those related to BERT
55
68
- to group and compare benchmark measures, use `pytest -k benchmark --benchmark-group-by ...`:
56
69
- groupinng by names: `pytest -k benchmark --benchmark-group-by fullfunc`
@@ -80,14 +93,74 @@ The easiest way to do this is to [convert the model to a fx graph](https://pytor
80
93
print it with `utils.graph_report` or by printing the code `print(you_graph_module.code)`
81
94
82
95
Then you can use [replace_pattern](https://pytorch.org/docs/stable/fx.html#torch.fx.replace_pattern) to replace the
83
-
pattern in the graph. We have our own version of `replace_pattern` with some enhancements to work with modules for
96
+
pattern in the graph. We have our own version of `replace_pattern` with some enhancements to work with modules, for
84
97
example. You can find examples of that in `optimizer` folder.
85
98
86
-
## Code formatting
99
+
## Code Formatting
87
100
88
101
We use `black` / `isort` / `flake8` to format the code. You can run them with:
89
102
90
103
```shell
91
104
make source_code_format
92
105
make source_code_check_format
93
106
```
107
+
108
+
## Why?
109
+
110
+
At Lefebvre Sarrut, we run several transformers in production, some of them being latency sensitive (search and recsys mostly).
111
+
112
+
We are using OnnxRuntime and TensorRT and even created
113
+
[transformer-deploy](https://github.com/ELS-RD/transformer-deploy) an OSS library to share our knowledge with the community.
114
+
Recently, we were testing generative languages, and we tried to accelerate them. It proves very difficult with traditional tools.
115
+
116
+
Basically, and to make it short, it seems to us that Onnx (the main format to feed those tools) is an interesting
117
+
format with a wide range support of hardware.
118
+
119
+
However, its ecosystem (and mostly inference engines) has several limitations when we deal with new LLM architectures :
120
+
121
+
* Export to Onnx is simple for models without control flow because we can rely on tracing,
122
+
but dynamic behaviors are harder to obtain (see https://ppwwyyxx.com/blog/2022/TorchScript-Tracing-vs-Scripting/ for
123
+
more info, it’s about torchscript but is exactly the same for onnx).
124
+
* Unlike Pytorch, both ONNX Runtime/TensorRT have not yet native support for multi GPUs tasks enabling tensor parallelism
125
+
* TensorRT is not able to manage 2 dynamic axis for transformer models with the same profile.
126
+
Because usually we want to be able to provide inputs of different lengths, we need to build 1 model per batch size.
127
+
* Very large models are common and Onnx (as a protobuff file) has some limitations regarding its file size,
128
+
requiring to store weights outsides of the model to workaround.
129
+
130
+
One thing very annoying is the fact that new models are never accelerated, you need to wait for someone to write custom CUDA kernels for that.
131
+
132
+
It’s not to say the solutions are bad, one big thing with OnnxRuntime is its multi hardware support.
133
+
Regarding TensorRT, it’s really fast.
134
+
135
+
So we wanted something as fast as TensorRT and on Python / PyTorch, that’s why we built Kernl.
136
+
137
+
## How?
138
+
139
+
The simple rule is memory bandwidth is often the bottleneck in deep learning, to accelerate inference, memory access
140
+
reduction is usually a good strategy.
141
+
On short input sequence, the bottleneck is often related to the CPU overhead, it has to be removed too.
142
+
Counterintuitively, to make things faster, you don’t need to be faster in computation.
143
+
144
+
We leverage mostly 3 technologies:
145
+
146
+
*[OpenAI Triton](https://triton-lang.org/): it’s a language to write GPU kernels like CUDA (not to be confused with
147
+
Nvidia Triton inference server), but much more productive (at least for us).
148
+
Improvement is due to the fusion of several ops, making us able to chain computations without
149
+
saving intermediate results in GPU memory. We are using it to rewrite:
150
+
151
+
* Attention (replaced by Flash Attention),
152
+
* Linear layer and their activation,
153
+
* and finally Layernorm/Rmsnorm.
154
+
155
+
*[CUDA graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) : you may have heard that Python is slow,
156
+
blablabla and to limit overhead C++/Rust should be the solution.
157
+
It is true but better than low overhead is no overhead at all. That’s cuda-graphs!
158
+
During a warmup step, it will save every kernel launched and their parameters, and then, with a single GPU instruction,
159
+
we can replay the whole inference.
160
+
161
+
*[TorchDynamo](https://github.com/pytorch/torchdynamo/): this prototype from Meta helps us to cope with dynamic
162
+
behavior. It’s described [here](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747),
163
+
and in a few words during a warmup step it traces the model and provides a Fx graph (a static computation graph).
164
+
We replace some operations of this graph with our kernels and recompile it in Python.
165
+
We do that for any possible dynamic behavior we expect to have. During inference, inputs are analyzed, and the correct
166
+
static graph is used. It’s really an awesome project, check their repo to know more.
The script is based on the official [demo script](https://github.com/facebookincubator/AITemplate/tree/main/examples/03_bert).
105
112
113
+
The model do not support `attention_mask`, so we don't use it in benchmarks.
114
+
It is important to keep in mind that `attention mask` adds operations on top of an already computation bounded kernel.
115
+
Said otherwise, it would likely make it slower.
116
+
Moreover, without `attention mask`, batch inference is useless right now.
117
+
There is a multithreads mode which would get much more overhead than batch mode (launching n threads times more kernels).
118
+
119
+
**TL;DR**: numbers for AITemplate have to be taken with a grain of salt.
120
+
121
+
An issue has been opened [here](https://github.com/facebookincubator/AITemplate/issues/46) on the repo:
122
+
123
+
```cite
124
+
@antinucleon
125
+
The current BERT example is only used for benchmarking purposes on fixed
126
+
length without mask.
127
+
128
+
We are currently working with CUTLASS team on a grouped Attention
129
+
optimization, which will remove paddings & mask for dynamic sequences. It
130
+
will appear in next CUTLASS & AIT release.
131
+
```
132
+
106
133
We choose to use the following options:
107
134
108
135
* accumulation in `FP32` (instead of default `FP16`):
@@ -112,10 +139,122 @@ We choose to use the following options:
112
139
FWIW, `Kernl` also have support of fast GELU but it is disabled by default.
113
140
*`CUDA graphs` enabled: this technology remove kernel launching overhead, and is a good practice to use it when possible.
114
141
115
-
We don't use the [benchmark](https://github.com/facebookincubator/AITemplate/blob/main/examples/03_bert/benchmark_ait.py)
116
-
script provided in the `AITemplate` repo because it reports GPU times through CUDA events and we want to measure the wall
117
-
clock time which better match our end to end case.
118
-
119
-
Differences are mostly on short input shapes where CPU overhead dominates.
142
+
We do not use the [benchmark](https://github.com/facebookincubator/AITemplate/blob/main/examples/03_bert/benchmark_ait.py)
143
+
script provided in the `AITemplate` repo because:
144
+
* it reports GPU times through CUDA events and we compare inference engines on wall-clock times which better matches our end-to-end use cases.
145
+
* we are not sure of the meaning of the reported times in case of multiple threads/cuda streams used
146
+
(see [here](https://github.com/facebookincubator/AITemplate/issues/44)), it doesn't match latency or throughput definition
120
147
121
148
CPP implementation of benchmark function is [here](https://github.com/facebookincubator/AITemplate/blob/44026ba7e7f5376a80cf0f2b333a0f25c0eeda6c/static/csrc/model_container.cpp).
0 commit comments