In case you're interested in optimizing the memory usage, latency or throughput of a PyTorch model served with TorchServe, this is the guide for you.
We have also created a quick checklist here for extra things to try outside of what is covered on this page. You can find the checklist here.
There are many tricks to optimize PyTorch models for production including but not limited to distillation, quantization, fusion, pruning, setting environment variables and we encourage you to benchmark and see what works best for you.
In general it's hard to optimize models and the easiest approach can be exporting to some runtime like ORT, TensorRT, IPEX or FasterTransformer. We have many examples for how to integrate these runtimes on the TorchServe github page. If your favorite runtime is not supported please feel free to open a PR.
Starting with PyTorch 2.0, torch.compile
provides out of the box speed up ( ~1.8x) for a large number of models. You can refer to this dashboard which tracks this on a nightly basis.
Models which have been fully optimized with torch.compile
show performance improvements up to 10x
When using smaller batch sizes, using mode="reduce-overhead"
with torch.compile
can give improved performance as it makes use of CUDA graphs
You can find all the examples of torch.compile
with TorchServe here
Details regarding torch.compile
GenAI examples can be found in this link
TorchServe has native support for ONNX models which can be loaded via ORT for both accelerated CPU and GPU inference. ONNX operates a bit differently from a regular PyTorch model in that when you're running the conversion you need to explicitly set and name your input and output dimensions. See this example.
At a high level what TorchServe allows you to do is
- Package serialized ONNX weights
torch-model-archiver --serialized-file model.onnx ...
- Load those weights from
base_handler.py
usingort_session = ort.InferenceSession(self.model_pt_path, providers=providers, sess_options=sess_options)
which supports reasonable defaults for both CPU and GPU inference - Allow you define custom pre and post processing functions to pass in data in the format your onnx model expects with a custom handler
To use ONNX with GPU on TorchServe Docker, we need to build an image with NVIDIA CUDA runtime as the base image as shown here
TorchServe also supports models optimized via TensorRT. To leverage the TensorRT runtime you can convert your model by following these instructions and once you're done you'll have serialized weights which you can load with torch.jit.load()
.
After a conversion there is no difference in how PyTorch treats a Torchscript model vs a TensorRT model.
Better Transformer from PyTorch implements a backwards-compatible fast path of torch.nn.TransformerEncoder
for Transformer Encoder Inference and does not require model authors to modify their models. BetterTransformer improvements can exceed 2x in speedup and throughput for many common execution scenarios.
You can find more information on Better Transformer here and here.
The main settings you should vary if you're trying to improve the performance of TorchServe from the config.properties
are the batch_size
and batch_delay
. A larger batch size means a higher throughput at the cost of lower latency.
The second most important settings are number of workers and number of gpus which will have a dramatic impact on CPU and GPU performance.
TorchServe exposes configurations that allow the user to configure the number of worker threads on CPU and GPUs. There is an important config property that can speed up the server depending on the workload. Note: the following property has bigger impact under heavy workloads.
If working with TorchServe on a CPU you can improve performance by setting the following in your config.properties
:
cpu_launcher_enable=true
cpu_launcher_args=--use_logical_core
These settings improve performance significantly through launcher core pinning. The theory behind this improvement is discussed in this blog which can be quickly summarized as:
- In a hyperthreading enabled system, avoid logical cores by setting thread affinity to physical cores only via core pinning.
- In a multi-socket system with NUMA, avoid cross-socket remote memory access by setting thread affinity to a specific socket via core pinning.
There is a config property called number_of_gpu
that tells the server to use a specific number of GPUs per model. In cases where we register multiple models with the server, this will apply to all the models registered. If this is set to a low value (ex: 0 or 1), it will result in under-utilization of GPUs. On the contrary, setting to a high value (>= max GPUs available on the system) results in as many workers getting spawned per model. Clearly, this will result in unnecessary contention for GPUs and can result in sub-optimal scheduling of threads to GPU.
ValueToSet = (Number of Hardware GPUs) / (Number of Unique Models)
While NVIDIA GPUs allow multiple processes to run on CUDA kernels, this comes with its own drawbacks namely:
- The execution of the kernels is generally serialized
- Each processes creates its own CUDA context which occupies additional GPU memory
To get around these drawbacks, you can utilize the NVIDIA Multi-Process Service (MPS) to increase performance. You can find more information on how to utilize NVIDIA MPS with TorchServe here.
The NVIDIA Data Loading Library (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It can be used as a portable drop-in replacement for built in data loaders and data iterators in popular deep learning frameworks. DALI provides a collection of highly optimized building blocks for loading and processing image, video and audio data. You can find an example of DALI optimization integration with TorchServe here.
To make comparing various model and TorchServe configurations easier to compare, we've added a few helper scripts that output performance data like p50, p90, p99 latency in a clean report here and mostly require you to determine some configuration either via JSON or YAML. You can find more information on TorchServe benchmarking here.
TorchServe has native support for the PyTorch profiler which will help you find performance bottlenecks in your code.
If you created a custom handle
or initialize
method overwriting the BaseHandler, you must define the self.manifest
attribute to be able to run _infer_with_profiler
.
export ENABLE_TORCH_PROFILER=TRUE
Visit this link to learn more about the PyTorch profiler.
For some insight into fine tuning TorchServe performance in an application, take a look at this article. The case study shown here uses the Animated Drawings App form Meta to improve TorchServe Performance.
We have also created a quick checklist here for extra things to try outside of what is covered on this page. You can find the checklist here.