ELS-RD
diff --git a/‎README.md‎
Lines changed: 83 additions & 10 deletions b/‎README.md‎
Lines changed: 83 additions & 10 deletions
diff --git a/‎conftest.py‎
Lines changed: 1 addition & 1 deletion b/‎conftest.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎experimental/benchmarks/README.md‎
Lines changed: 159 additions & 20 deletions b/‎experimental/benchmarks/README.md‎
Lines changed: 159 additions & 20 deletions
@@ -2,9 +2,22 @@
 
 ---
 
-Kernl is a collection of optimized kernels for `transformer` models to speed-up inference and soon training.
+**Kernl lets you run Pytorch transformer models several times faster on GPU with a single line of code,** 
+**and is designed to be easily hackable.**
 
-## Install dependencies
+<p align="center">
+  <img src="./resources/images/speedup.png">
+</p>
+
+*benchmarks ran on a 3090 RTX*
+
+Kernl is the first OSS inference engine written in ~~CUDA C~~ [OpenAI Triton](https://openai.com/blog/triton/), 
+a new language designed by OpenAI to make it easier to write GPU kernels.  
+Each kernel is less than 200 lines of code, and is **easy to understand** and modify.
+
+🎅🎄 Training support comming soon... 🤯
+
+## Installation
 
 **IMPORTANT**: This package requires `pytorch` being installed.  
 Please install it first.
@@ -13,13 +26,11 @@ Please install it first.
 pip install torch -U --extra-index-url https://download.pytorch.org/whl/cu116
 git clone https://github.com/ELS-RD/kernl
 pip install -e .
-# or to enable all benchmarks
-pip install -e ".[benchmark]"
 ```
 
 This project requires `Python` >= 3.9.
 
-## Use
+## Getting started
 
 ```python
 from kernl.model_optimization import optimize_model
@@ -46,11 +57,13 @@ Note that the original model will raise an error if you try to use it after opti
 pytest
 ```
 
+There are over 2K benchmarks, and they take a while to run.
+
 Some rules on how `PyTest` works, in particular for benchmarks:
 
-- add `-k` to filter tests/benchmarks by their name like `pytest -k benchmark` to run only tests with `benchmark` 
- in their name
-- you can combine expressions in the filter: `pytest -k "benchmark and not bert"` if you want to run all benchmarks 
+- add `-k` to filter tests/benchmarks by their name like `pytest -k benchmark` to run only tests with `benchmark`
+  in their name
+- you can combine expressions in the filter: `pytest -k "benchmark and not bert"` if you want to run all benchmarks
   except those related to BERT
 - to group and compare benchmark measures, use `pytest -k benchmark --benchmark-group-by ...`:
   - groupinng by names: `pytest -k benchmark --benchmark-group-by fullfunc`
@@ -80,14 +93,74 @@ The easiest way to do this is to [convert the model to a fx graph](https://pytor
 print it with `utils.graph_report` or by printing the code `print(you_graph_module.code)`
 
 Then you can use [replace_pattern](https://pytorch.org/docs/stable/fx.html#torch.fx.replace_pattern) to replace the
-pattern in the graph. We have our own version of `replace_pattern` with some enhancements to work with modules for
+pattern in the graph. We have our own version of `replace_pattern` with some enhancements to work with modules, for
 example. You can find examples of that in `optimizer` folder.
 
-## Code formatting
+## Code Formatting
 
 We use `black` / `isort` / `flake8` to format the code. You can run them with:
 
 ```shell
 make source_code_format
 make source_code_check_format
 ```
+
+## Why?
+
+At Lefebvre Sarrut, we run several transformers in production, some of them being latency sensitive (search and recsys mostly).
+
+We are using OnnxRuntime and TensorRT and even created 
+[transformer-deploy](https://github.com/ELS-RD/transformer-deploy) an OSS library to share our knowledge with the community.  
+Recently, we were testing generative languages, and we tried to accelerate them. It proves very difficult with traditional tools.
+
+Basically, and to make it short, it seems to us that Onnx (the main format to feed those tools) is an interesting 
+format with a wide range support of hardware. 
+
+However, its ecosystem (and mostly inference engines) has several limitations when we deal with new LLM architectures :
+
+* Export to Onnx is simple for models without control flow because we can rely on tracing, 
+  but dynamic behaviors are harder to obtain (see https://ppwwyyxx.com/blog/2022/TorchScript-Tracing-vs-Scripting/ for 
+  more info, it’s about torchscript but is exactly the same for onnx).
+* Unlike Pytorch, both ONNX Runtime/TensorRT have not yet native support for multi GPUs tasks enabling tensor parallelism
+* TensorRT is not able to manage 2 dynamic axis for transformer models with the same profile. 
+  Because usually we want to be able to provide inputs of different lengths, we need to build 1 model per batch size.
+* Very large models are common and Onnx (as a protobuff file) has some limitations regarding its file size, 
+  requiring to store weights outsides of the model to workaround.
+
+One thing very annoying is the fact that new models are never accelerated, you need to wait for someone to write custom CUDA kernels for that.
+
+It’s not to say the solutions are bad, one big thing with OnnxRuntime is its multi hardware support.  
+Regarding TensorRT, it’s really fast.
+
+So we wanted something as fast as TensorRT and on Python / PyTorch, that’s why we built Kernl.
+
+## How?
+
+The simple rule is memory bandwidth is often the bottleneck in deep learning, to accelerate inference, memory access 
+reduction is usually a good strategy. 
+On short input sequence, the bottleneck is often related to the CPU overhead, it has to be removed too. 
+Counterintuitively, to make things faster, you don’t need to be faster in computation.
+
+We leverage mostly 3 technologies:
+
+* [OpenAI Triton](https://triton-lang.org/): it’s a language to write GPU kernels like CUDA (not to be confused with 
+  Nvidia Triton inference server), but much more productive (at least for us). 
+  Improvement is due to the fusion of several ops, making us able to chain computations without
+  saving intermediate results in GPU memory. We are using it to rewrite:
+
+  * Attention (replaced by Flash Attention),
+  * Linear layer and their activation,
+  * and finally Layernorm/Rmsnorm.
+
+* [CUDA graphs](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) : you may have heard that Python is slow,
+  blablabla and to limit overhead C++/Rust should be the solution.
+  It is true but better than low overhead is no overhead at all. That’s cuda-graphs!
+  During a warmup step, it will save every kernel launched and their parameters, and then, with a single GPU instruction,
+  we can replay the whole inference.
+
+* [TorchDynamo](https://github.com/pytorch/torchdynamo/): this prototype from Meta helps us to cope with dynamic
+  behavior. It’s described [here](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747),
+  and in a few words during a warmup step it traces the model and provides a Fx graph (a static computation graph).
+  We replace some operations of this graph with our kernels and recompile it in Python.
+  We do that for any possible dynamic behavior we expect to have. During inference, inputs are analyzed, and the correct
+  static graph is used. It’s really an awesome project, check their repo to know more.
@@ -62,7 +62,7 @@ def pytest_sessionfinish(session: pytest.Session, exitstatus):
 
 def check_all_close(a: torch.Tensor, b: torch.Tensor, rtol=0, atol=1e-1) -> None:
     """
-    Check that all elements of tensors a and b are close.
+    Check that all elements of tensors a and b are within provided thresholds.
     """
     assert a.shape == b.shape, f"Shapes don't match: {a.shape} != {b.shape}"
     assert a.dtype == b.dtype, f"Dtypes don't match: {a.dtype} != {b.dtype}"
 
@@ -1,7 +1,9 @@
 # Benchmark of Third-Party Libraries
 
 This directory contains benchmarks of third-party libraries. The benchmarks are
-written as simple Python script and have been run on a Nvidia 3090 RTX GPU.
+written as simple Python script and have been run on a Nvidia 3090 RTX GPU, 128 Gb RAM, 12 cores Intel CPU.
+
+Measures are done in wall-clock time, and output tensors are kept on GPU.
 
 ## [TensorRT](https://github.com/NVIDIA/TensorRT/)
 
@@ -14,20 +16,22 @@ We rely on the Docker image `nvcr.io/nvidia/tensorrt:22.09-py3`.
 | batch | sequence length | Time (ms) |
 |-------|-----------------|-----------|
  | 1     | 16              | 0.0010    |
- | 1     | 32              | 0.0012    |
+ | 1     | 32              | 0.0010    |
  | 1     | 64              | 0.0011    |
  | 1     | 128             | 0.0013    |
  | 1     | 256             | 0.0016    |
+| 1     | 384             | 0.0026    |
  | 1     | 512             | 0.0026    |
- | 8     | 16              | 0.0012    |
+ | 8     | 16              | 0.0011    |
  | 8     | 32              | 0.0015    |
- | 8     | 64              | 0.0020    |
+ | 8     | 64              | 0.0019    |
  | 8     | 128             | 0.0036    |
  | 8     | 256             | 0.0064    |
- | 8     | 512             | 0.0142    |
+| 8     | 384             | 0.0139    |
+ | 8     | 512             | 0.0139    |
  | 32    | 16              | 0.0020    |
  | 32    | 32              | 0.0031    |
- | 32    | 64              | 0.0055    |
+ | 32    | 64              | 0.0054    |
  | 32    | 128             | 0.0103    |
  | 32    | 256             | 0.0210    |
 
@@ -37,20 +41,21 @@ We rely on the Docker image `nvcr.io/nvidia/tensorrt:22.09-py3`.
 docker run --rm -it --gpus all -v $(pwd):/work nvcr.io/nvidia/tensorrt:22.09-py3
 cd /work
 pip install transformers torch -U --extra-index-url https://download.pytorch.org/whl/cu116
-python experimental/benchmarks/tensorrt.py
+python experimental/benchmarks/tensorrt_.py
 ```
 
 ### Notes
 
-As `TensorRT` complains where there are 2 dynamic axes, we build one model per batch size.
+As `TensorRT` (`Myelin` code generator) complains where there are 2 dynamic axes, we build one model per batch size.
 Only sequence length axis is dynamic.
 
 Model building takes time, around 30mn on a beefy machine.
 
 Most of the code has been taken from [transformer-deploy](https://github.com/ELS-RD/transformer-deploy).
 
-It is important to note that `TensorRT` is a black box and we cannot disable fast GELU, fp16 accumulation, 
-or whatever optimization they are using.
+It is important to note that `TensorRT` is a black box and we are not aware of simple ways to disable fast GELU, 
+fp16 accumulation, or whatever optimization leveraged (we know they use many of them because of the precision issues 
+we experienced in prod with this tool).
 
 ## [AITemplate](https://github.com/facebookincubator/AITemplate/)
 
@@ -72,26 +77,28 @@ Main branch commit used:
 | 1     | 64              | 0.0013    |
 | 1     | 128             | 0.0015    |
 | 1     | 256             | 0.0020    |
+| 1     | 384             | 0.0022    |
 | 1     | 512             | 0.0031    |
 | 8     | 16              | 0.0013    |
 | 8     | 32              | 0.0017    |
 | 8     | 64              | 0.0026    |
-| 8     | 128             | 0.0044    |
+| 8     | 128             | 0.0043    |
 | 8     | 256             | 0.0076    |
+| 8     | 384             | 0.0115    |
 | 8     | 512             | 0.0149    |
-| 32    | 16              | 0.0027    |
+| 32    | 16              | 0.0026    |
 | 32    | 32              | 0.0043    |
 | 32    | 64              | 0.0073    |
-| 32    | 128             | 0.0131    |
-| 32    | 256             | 0.0249    |
+| 32    | 128             | 0.0127    |
+| 32    | 256             | 0.0242    |
 
 ### Running the benchmark
 
 ```shell
 # in AITemplate root directory
 ./docker/build.sh cuda
 # in kernl root directory
-docker run --rm -it  --gpus all -v $(pwd):/work -v $(pwd):/work ait
+docker run --rm -it --gpus all -v $(pwd):/work -v $(pwd):/work ait
 cp /work/experimental/benchmarks/aitemplate.py /AITemplate/examples/03_bert/demo_new.py
 cd AITemplate/
 python3 ./examples/03_bert/demo_new.py
@@ -103,6 +110,26 @@ cat measures.txt
 
 The script is based on the official [demo script](https://github.com/facebookincubator/AITemplate/tree/main/examples/03_bert).
 
+The model do not support `attention_mask`, so we don't use it in benchmarks.
+It is important to keep in mind that `attention mask` adds operations on top of an already computation bounded kernel.
+Said otherwise, it would likely make it slower.
+Moreover, without `attention mask`, batch inference is useless right now.  
+There is a multithreads mode which would get much more overhead than batch mode (launching n threads times more kernels).
+
+**TL;DR**: numbers for AITemplate have to be taken with a grain of salt.
+
+An issue has been opened [here](https://github.com/facebookincubator/AITemplate/issues/46) on the repo:
+
+```cite
+@antinucleon
+The current BERT example is only used for benchmarking purposes on fixed
+length without mask.
+
+We are currently working with CUTLASS team on a grouped Attention
+optimization, which will remove paddings & mask for dynamic sequences. It
+will appear in next CUTLASS & AIT release.
+```
+
 We choose to use the following options:
 
 * accumulation in `FP32` (instead of default `FP16`): 
@@ -112,10 +139,122 @@ We choose to use the following options:
   FWIW, `Kernl` also have support of fast GELU but it is disabled by default.
 * `CUDA graphs` enabled: this technology remove kernel launching overhead, and is a good practice to use it when possible.
 
-We don't use the [benchmark](https://github.com/facebookincubator/AITemplate/blob/main/examples/03_bert/benchmark_ait.py)
-script provided in the `AITemplate` repo because it reports GPU times through CUDA events and we want to measure the wall 
-clock time which better match our end to end case.
-
-Differences are mostly on short input shapes where CPU overhead dominates. 
+We do not use the [benchmark](https://github.com/facebookincubator/AITemplate/blob/main/examples/03_bert/benchmark_ait.py)
+script provided in the `AITemplate` repo because:
+* it reports GPU times through CUDA events and we compare inference engines on wall-clock times which better matches our end-to-end use cases.
+* we are not sure of the meaning of the reported times in case of multiple threads/cuda streams used
+  (see [here](https://github.com/facebookincubator/AITemplate/issues/44)), it doesn't match latency or throughput definition
 
 CPP implementation of benchmark function is [here](https://github.com/facebookincubator/AITemplate/blob/44026ba7e7f5376a80cf0f2b333a0f25c0eeda6c/static/csrc/model_container.cpp).
+
+
+## [TorchDynamo + inductor](https://github.com/pytorch/torchdynamo)
+
+### Version
+
+* Triton: https://github.com/openai/triton@af76c989eb4799b015f8b288ccd8421558772e56#subdirectory=python
+* PyTorch (includes TorchDynamo): 1.14.0.dev20221015+cu116
+
+### Results
+
+| batch | sequence length | Time (ms) |
+|-------|-----------------|-----------|
+| 1     | 16              | 0.0018    |
+| 1     | 32              | 0.0020    |
+| 1     | 64              | 0.0020    |
+| 1     | 128             | 0.0025    |
+| 1     | 256             | 0.0030    |
+| 1     | 384             | 0.0036    |
+| 1     | 512             | 0.0048    |
+| 8     | 16              | 0.0023    |
+| 8     | 32              | 0.0027    |
+| 8     | 64              | 0.0039    |
+| 8     | 128             | 0.0065    |
+| 8     | 256             | 0.0117    |
+| 8     | 386             | 0.0156    |
+| 8     | 512             | 0.0212    |
+| 32    | 16              | 0.0039    |
+| 32    | 32              | 0.0062    |
+| 32    | 64              | 0.0108    |
+| 32    | 128             | 0.0177    |
+| 32    | 256             | 0.0357    |
+
+### Running the benchmark
+
+```shell
+# reuse AITemplate docker image based on nvidia/cuda:11.6.2-devel-ubuntu20.04
+docker run --rm -it --gpus all -v $(pwd):/work -v $(pwd):/work ait
+cd work/
+apt install git
+pip3 install --pre torch==1.14.0.dev20221015+cu116 --extra-index-url https://download.pytorch.org/whl/nightly/cu116 -U
+pip3 install git+https://github.com/openai/triton@af76c989eb4799b015f8b288ccd8421558772e56#subdirectory=python
+python experimental/benchmarks/inductor.py
+```
+
+### Notes
+
+`Torchinductor` is still in prototype stage, results may be different with final version.  
+We are using the version included in PyTorch nightly for this benchmark.  
+The dependency of this project is an older version not requiring nightly version as we only need `TorchDynamo`.
+Project info on https://github.com/pytorch/torchdynamo even if code is not anymore updated in this repo.
+
+We tried several disabled by default optimizations but none of them worked:
+
+* `config.aggressive_fusion = True`: significantly slower when enabled on our GPU.
+* `config.inplace_buffers = True`: crash, see https://github.com/pytorch/torchdynamo/issues/823
+* `config.triton.mm = "triton"` (same for `"autotune"`): crash, even when trying with `config.triton.autotune = False`
+
+By default, `CUDA graphs` is enabled.
+
+The last one is important and should bring some speedup when working.
+
+## [Deepspeed](https://github.com/microsoft/DeepSpeed)
+
+### Version
+
+0.7.4
+
+### Results
+
+| batch | sequence length | Time (ms) |
+|-------|-----------------|-----------|
+| 1     | 16              | 0.0009    |
+| 1     | 32              | 0.0008    |
+| 1     | 64              | 0.0009    |
+| 1     | 128             | 0.0012    |
+| 1     | 256             | 0.0019    |
+| 1     | 384             | 0.0023    |
+| 1     | 512             | 0.0032    |
+| 8     | 16              | 0.0011    |
+| 8     | 32              | 0.0016    |
+| 8     | 64              | 0.0025    |
+| 8     | 128             | 0.0051    |
+| 8     | 256             | 0.0106    |
+| 8     | 384             | 0.0161    |
+| 8     | 512             | 0.0219    |
+| 32    | 16              | 0.0025    |
+| 32    | 32              | 0.0050    |
+| 32    | 64              | 0.0097    |
+| 32    | 128             | 0.0176    |
+| 32    | 256             | 0.0374    |
+
+### Running the benchmark
+
+```shell
+# reuse AITemplate docker image based on nvidia/cuda:11.6.2-devel-ubuntu20.04
+docker run --rm -it --gpus all -v $(pwd):/work -v $(pwd):/work ait
+pip install deepspeed transformers
+deepspeed --num_gpus 1 experimental/benchmarks/deepspeed_.py --deepspeed
+```
+
+### Notes
+
+The benchmark script is built over the 
+[one](https://github.com/microsoft/DeepSpeed/blob/master/benchmarks/inference/bert-bench.py) provided in the 
+`deepspeed` repo.
+
+We rebuilt a model for each shape to leverage CUDA Graphs and get best possible performances.  
+In real scenario, we would need something on top of it to handle multiple graphs.
+
+Model got its weights completly converted to fp16 (instead of doing mixed precision) as it is done that way in the 
+benchmark script.