Releases: pytorch/serve
TorchServe v0.12.0 Release Notes
Highlights Include
- GenAI updates
- No code LLM deployments with TorchServe + vLLM & TensorRT-LLM using
ts.llm_launcher
script - OpenAI API support for TorchServe + vLLM
- Integration of TensorRT-LLM engine
- Stateful Inference on AWS Sagemaker (see blog)
- No code LLM deployments with TorchServe + vLLM & TensorRT-LLM using
- Support for
linux-aarch64
- CI & nightly regression added
- Publish docker & KServe images
- PyTorch updates
- Support for PyTorch 2.4
- Deprecation of TorchText
PyTorch Updates
- upgrade to PyTorch 2.4 & deprecation of TorchText by @agunapal in #3289
- Resnet152 batch inference torch.compile example by @andrius-meta in #3259
- squeezenet torch.compile example by @wdvr in #3277
GenAI
- Implement stateful inference session timeout by @namannandan in #3263
- Use Case: Enhancing LLM Serving with Torch Compiled RAG on AWS Graviton by @agunapal in #3276
- Feature add openai api for vllm integration by @mreso in #3287
- Set vllm multiproc method to spawn by @mreso in #3310
- TRT LLM Integration with LORA by @agunapal in #3305
- Bump vllm from 0.5.0 to 0.5.5 in /examples/large_models/vllm by @dependabot in #3321
- Use startup time in async worker thread instead of worker timeout by @mreso in #3315
- Rename vllm dockerfile by @mreso in #3330
Support for linux-aarch64
- Adding Graviton Regression test CI by @udaij12 in #3273
- adding graviton docker image release by @udaij12 in #3313
- Fixing kserve nightly for arm64 by @udaij12 in #3319
- Docker aarch by @udaij12 in #3323
Documentation
- Security doc update by @udaij12 in #3256
- Remove compile note for hpu by @RafLit in #3271
- doc update of the rag usecase blog by @agunapal in #3280
- Add some hints for java devs by @mreso in #3282
- add TorchServe with Intel® Extension for PyTorch* guidance by @jingxu10 in #3285
- Update quickstart llm docker in serve/readme; added ts.llm_launcher example by @mreso in #3300
- typo fixes in HF Transformers example by @EFord36 in #3307
- docs: update WaveGlow links by @emmanuel-ferdman in #3317
- Fix typo: "a asynchronous" -> "an asynchronous" by @tadayosi in #3314
- Fix typo: vesion -> version, succsesfully -> successfully by @tadayosi in #3322
Improvements and Bug Fixing
- Bump torchserve from 0.10.0 to 0.11.0 in /examples/large_models/ipex_llm_int8 by @dependabot in #3257
- add JDK17 compatible groovy dependency for frontend log4j ScriptFilter by @lanxih in #3235
- Leave response and sendError when request is canceled by @slashvar in #3267
- add kserve gpu tests by @rohithkrn in #3283
- Configurable startup time by @Isalia20 in #3262
- Add REPO_URL in Dockerfile to allow docker builds from contributor repos by @mreso in #3291
- Fix docker repo url in github action workflow by @mreso in #3293
- Fix docker ci repo_url by @mreso in #3294
- Fix/docker repo url3 by @mreso in #3297
- Remove debug step in docker ci by @mreso in #3298
- Fix wild card in extra files by @mreso in #3304
- Example to demonstrate building a custom endpoint plugin by @namannandan in #3306
- Benchmark fix by @udaij12 in #3316
- Update TS version to 0.12.0 by @agunapal in #3318
- Clear up neuron cache by @chen3933 in #3326
- Fix Dockerfile fore renamed forks by @mreso in #3327
- Load all models including targz by @m10an in #3329
- fix for snapshot variables missing/null by @udaij12 in #3328
New Contributors
- @andrius-meta made their first contribution in #3259
- @slashvar made their first contribution in #3267
- @RafLit made their first contribution in #3271
- @wdvr made their first contribution in #3277
- @Isalia20 made their first contribution in #3262
- @jingxu10 made their first contribution in #3285
- @EFord36 made their first contribution in #3307
- @emmanuel-ferdman made their first contribution in #3317
- @tadayosi made their first contribution in #3314
- @m10an made their first contribution in #3329
Platform Support
Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe requires Python >= 3.8 and JDK17.
GPU Support Matrix
TorchServe version | PyTorch version | Python | Stable CUDA | Experimental CUDA |
---|---|---|---|---|
0.12.0 | 2.4.0 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.11.1 | 2.3.0 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.11.0 | 2.3.0 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.10.0 | 2.2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.9.0 | 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.8.0 | 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 |
0.7.0 | 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 |
Inferentia2 Support Matrix
TorchServe version | PyTorch version | Python | Neuron SDK |
---|---|---|---|
0.12.0 | 2.1 | >=3.8, <=3.11 | 2.18.2+ |
0.11.1 | 2.1 | >=3.8, <=3.11 | 2.18.2+ |
0.11.0 | 2.1 | >=3.8, <=3.11 | 2.18.2+ |
0.10.0 | 1.13 | >=3.8, <=3.11 | 2.16+ |
0.9.0 | 1.13 | >=3.8, <=3.11 | 2.13.2+ |
TorchServe v0.11.1 Release Notes
This is the release of TorchServe v0.11.1.
Highlights Include
- Security Updates
- Token Authorization: TorchServe enforces token authorization by default which requires the correct token to be provided when calling a HTTP/S or gRPC API. This is a security feature which addresses the concern of unauthorized API calls. This is applicable in the scenario where an unauthorized user may try to access a running TorchServe instance. The default behavior is to enable this feature which creates a key file with the appropriate tokens to be used for API calls. Users have the option to disable this feature to prevent token authorization from being required for API calls. For more details, refer to the token authorization documentation: https://github.com/pytorch/serve/blob/master/docs/token_authorization_api.md
- Model API Control: TorchServe disables the ability to register and delete models using HTTP/S or gRPC API calls by default once TorchServe is running. This is a security feature which addresses the concern of unintended registration and deletion of models once TorchServe has started. This is applicable in the scenario where a user may upload malicious code to the model server in the form of a model or where a user may delete a model that is being used. The default behavior prevents users from registering or deleting models once TorchServe is running. Model API control can be enabled to allow users to register and delete models using the TorchServe model load and delete APIs. For more details, refer to the model API control documentation: https://github.com/pytorch/serve/blob/master/docs/model_api_control.md
- PyTorch 2.x updates
- Standardized
torch.compile
configuration - Added examples for
tensorrt
&hpu
backends
- Standardized
- GenAI updates
- Support continuous batching in sequence batch streaming
- Asynchronous backend worker communication for continuous batching
- No code LLM deployment
- Support for Intel GPUs
Security Updates
- Adding model-control-mode by @udaij12 in #3165
- Enable Token Authorization by default by @udaij12 in #3163
- Updating night CIs to account for model control and token auth by @udaij12 in #3188
- Adding token auth and model api to workflow and https by @udaij12 in #3234
- Enable token authorization and model control for gRPC by @namannandan in #3238
PyTorch 2.x Updates
- torch compile config standardization update by @agunapal in #3166
- Token Authorization fixes by @udaij12 in #3192
- Changing mar file for Bert torch compile by @udaij12 in #3175
- Fixing torch compile benchmark by @udaij12 in #3179
- Add support for hpu_backend and Resnet50 compile example by @wozna in #3182
- Update image_classifier/densenet-161 to include torch.compile by @lzcemma in #3200
- TensorRT example with torch.compile by @agunapal in #3203
- Update documentation for vgg16 to use torch.compile by @ijkilchenko in #3211
- BERT with torch.compile by @agunapal in #3201
- T5 Translation with torch.compile & TensorRT backend by @agunapal in #3223
- Adjust Resnet50 hpu example by @wozna in #3219
GenAI
- Support continuous batching in sequence batch streaming case by @lxning in #3160
- GPT-FAST-MIXTRAL-MOE integration by @alex-kharlamov in #3151
- clean a jobGroup immediately when it finished by @lxning in #3222
- Asynchronous worker communication and vllm integration by @mreso in #3146
- Add single command LLM deployment by @mreso in #3209
- TensorRT-LLM Engine integration by @agunapal in #3228
- Adds torch.compile documentation to alexnet example readme by @crmdias in #3227
Support for Intel GPUs
- Torchserve support for Intel GPUs by @krish-navulla in #3132
- Torchserve Metrics support for Intel GPUs enabled by @krish-navulla in #3141
Documentation
- Update supported TS version in security documentation by @namannandan in #3144
- Update performance documentation by @agunapal in #3159
- model archiver example to multi-line by @GeeCastro in #3155
- fix broken llm deployment link by @msaroufim in #3214
- Security documentation update by @udaij12 in #3183
Improvements and Bug Fixing
- workaround for compile example failure by @agunapal in #3190
- Fix Inf2 benchmark by @namannandan in #3177
- Make a copy of the torchtext utils to remove dependency by @agunapal in #3076
- Pinning setuptools version by @udaij12 in #3152
- Fixing Regression test CI GPU and CPU by @udaij12 in #3147
- Fixing docker CI by @udaij12 in #3194
- Replace pkg_resources.packaging by @udaij12 in #3187
- Kserve ci fix by @udaij12 in #3196
- Benchmark numpy fix by @udaij12 in #3197
- Add workflow dispatch trigger to nightly builds by @agunapal in #3250
- Bug fix for kserve build issue and fixing nightly tests by @agunapal in #3251
- Remove vllm dependency to not bloat docker image size by @agunapal in #3245
- Kserve fix ray & setuptools dependency issue by @udaij12 in #3205
- clean a jobGroup immediately when it finished by @lxning in #3222
- Updating examples for security tags by @udaij12 in #3224
- Fix/llm launcher disable token by @mreso in #3230
- Example update by @udaij12 in #3231
- Updating docker cuda and github branch by @udaij12 in #3233
- Reduce severity of xpu-smi logging by @namannandan in #3239
- Upgrade kserve dependencies by @agunapal in #3246
- Fix/vllm dependency by @mreso in #3249
- Copy remote branch entrypoint to compile and production image stages by @lanxih in #3213
- Fix Condition Checking for Intel GPUs Enabling by @Kanya-Mo in #3220
New Contributors
- @alex-kharlamov made their first contribution in #3151
- @lzcemma made their first contribution in #3200
- @wozna made their first contribution in #3182
- @krish-navulla made their first contribution in #3132
- @ijkilchenko made their first contribution in #3211
- @lanxih made their first contribution in #3213
- @Kanya-Mo made their first contribution in #3220
- @crmdias made their first contribution in #3227
Platform Support
Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe requires Python >= 3.8 and JDK17.
GPU Support Matrix
TorchServe version | PyTorch version | Python | Stable CUDA | Experimental CUDA |
---|---|---|---|---|
0.11.1 | 2.3.0 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.11.0 | 2.3.0 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.10.0 | 2.2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.9.0 | 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.8.0 | 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 |
0.7.0 | 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 |
Inferentia2 Support Matrix
TorchServe version | PyTorch version | Python | Neuron SDK |
---|---|---|---|
0.11.1 | 2.1 | >=3.8, <=3.11 | 2.18.2+ |
0.11.0 | 2.1 | >=3.8, <=3.11 | 2.18.2+ |
0.10.0 | 1.13 | >=3.8, <=3.11 | 2.16+ |
0.9.0 | 1.13 | >=3.8, <=3.11 | 2.13.2+ |
TorchServe v0.11.0 Release Notes
This is the release of TorchServe v0.11.0.
Highlights Include
- GenAI inference optimizations showcasing
torch.compile
with OpenVINO backend for Stable Diffusion- Intel IPEX for Llama
- Experimental support for Apple MPS and linux-aarch64
- Security bug fixing
GenAI
- Upgraded LLama2 examples to Llama3
- Examples for LoRA and Mistral #3077 @lxning
- IPEX LLM serving example with Intel AMX #3068 @bbhattar
- Integration of Intel Openvino with TorchServe using
torch.compile
. Example showcase ofopenvino
torch.compile
backend with Stable Diffusion #3116 @suryasidd - Enabling retrieval of guaranteed sequential order of input sequences with low latency for stateful inference via HTTP extending this previously gRPC-only feature #3142 @lxning
Linux aarch64 Support:
TorchServe adds support for linux-aarch64 and shows an example working on AWS Graviton. This provides users with a new platform alternative for serving models on CPU.
Apple Silicon Support:
- TorchServe now includes support MPS backend on apple silicon #3048 @udaij12 @agunapal
- Added TorchServe quickstart chatbot example #3003 @agunapal
XGBoost Support:
With the XGBoost Classifier example, we show how to deploy any pickled model with TorchServe.
Security
The ability to bypass allowed_urls using relative paths has been fixed by ensuring preemptive check for relative paths prior to copying the model archive to the model store directory. Also, the default gRPC inference and management addresses are now set to localhost(127.0.0.1) to reduce scope of default access to gRPC endpoints.
- Fixed allowed_urls filter bypass #3082 @udaij12 @msaroufim
- Fixed GRPC address assignment to localhost by default #3083 @namannandan
C++ Backend
Documentation
- Updated SECURITY.md #3038, #3041, #3043, #3046 #3084 @msaroufim @diogoteles08 @udaij12 @lxning @namannandan
- Updated PT2 examples readme #3029 @chauhang
- Updated Resnet18 torch.compile readme #3130 @SimonTong22
- Updated doc-automation.yml #3105 @svekars
Improvements and Bug Fixing
- Supported PyTorch 2.3 #3109 @agunapal
- Applied Jsonify customized metadata on management API #3059 @harshita-meena
- Accepted empty version in GRPC management API #3095 @harshita-meena
- Added test template #3140 @mreso
- Logged entire stdout and stderr for terminated backend worker process #3036 @namannandan
- Increased test timeout for test_handler_traceback_logging #3113 @namannandan
- Supported gRPC max connection age configuration #3121 @namannandan
- Updated deprecated TorchVision and PyTorch APIs #3074 @kit1980 @agunapal
- Supported Installation from source for a specific branch with docker #3055 @agunapal
- Workaround for kserve nightly failure #3079 @agunapal
- Disabled mac arm64 tests #3057 @agunapal
- Fixed CI and Regression workflows for MAC Arm64 #3128 @namannandan
- Included missing model configuration values in describe model API response #3122 @namannandan
Platform Support
Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support Matrix
TorchServe version | PyTorch version | Python | Stable CUDA | Experimental CUDA |
---|---|---|---|---|
0.11.0 | 2.3.0 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.10.0 | 2.2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.9.0 | 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.8.0 | 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 |
0.7.0 | 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 |
Inferentia2 Support Matrix
TorchServe version | PyTorch version | Python | Neuron SDK |
---|---|---|---|
0.11.0 | 2.1 | >=3.8, <=3.11 | 2.18.2+ |
0.10.0 | 1.13 | >=3.8, <=3.11 | 2.16+ |
0.9.0 | 1.13 | >=3.8, <=3.11 | 2.13.2+ |
TorchServe v0.10.0 Release Notes
This is the release of TorchServe v0.10.0.
Highlights include
- Extended support for PyTorch 2.x inference
- C++ backend
- GenAI fast series
torch.compile
showcase examples - Token authentication support for enhanced security.
C++ Backend
TorchServe presented the experimental C++ backend at the PyTorch Conference 2022. Similar to the Python backend, C++ backend also runs as a process and utilizes the BaseHandler to define APIs for customizing the handler. By providing a backend and handler written in pure C++ for TorchServe, it is now possible to deploy PyTorch models without any Python overhead. This release officially promoted the experimental branch to the master branch and included additional examples and Docker images for development.
- Refactored C++ backend branch and promoted it to master #2840 #2927 #2937 #2953 #2975 #2980 #2958 #3006 #3012 #3014 #3018 @mreso
- C++ backend examples:
a. Example Baby Llama #2903 #2911 @shrinath-suresh @mreso
b. Example Llama2 #2904 @shrinath-suresh @mreso - C++ dev Docker for CPU and GPU #2976 #3015 @namannandan
torch.compile
With the launch of PT2 Inference at the PyTorch Conference 2023, we have added several key examples showcasing out-of-box speedups for torch.compile
and AOT Compile. Since there is no new development being done in TorchScript, starting this release, TorchServe is preparing the migration path for customers to switch from TorchScript to torch.compile
.
GenAI torch.compile series
The fast series GenAI models - GPTFast, SegmentAnythingFast, DiffusionFast with 3-10x speedups using torch.compile and native PyTorch optimizations:
- Example GPT Fast #2815 #2834 #2935 @mreso and deployment with KServe #2966 #2895 @agunapal
- Example Segment Anything Fast #2802 @agunapal
- Example Diffusion Fast #2902 @agunapal
Cold start problem solution
To address cold start problems, there is an example included to show how torch._export.aot_load
(experimental API) can be used to load a pre-compiled model. TorchServe has also started benchmarking models with torch.compile
and tracking their performance compared to TorchScript.
The new TorchServe C++ backend also includes torch.compile and AOTInductor related examples for ResNet50, BERT and Llama2.
-
torch.compile
a. Exampletorch.compile
with image classifier model densenet161 #2915 @agunapal
b. Exampletorch._export.aot_compile
with image classification model ResNet-18 #2832 #2906 #2932 #2948 @agunapal
c. Example torch inductor fx graph caching with image classification model densenet161 #2925 @agunapal -
C++ AOTInductor
a. Example AOT Inductor with Llama2 #2913 @mreso
b. Example AOT Inductor with ResNet-50 #2944 @lxning
c. Example AOT Inductor with BERTSequenceClassification #2931 @lxning
Gen AI
- Supported sequence batching for stateful inference in gRPC bi-directional streaming #2513 @lxning
- The fast series Gen AI models using
torch.compile
and native PyTorch optimizations. - Example Mistral 7B with vLLM #2781 @agunapal
- Example PyTorch native tensor parallel with Llama2 with continuous batching #2709 @mreso @HamidShojanazeri
- Supported inf2 Neuronx transformer continuous batching for both no coding style and advanced customers with Llama2-70B example #2803 #3016 @lxning
- Example deepspeed mii fastgen with Llama2-13B #2779 @lxning
Security
TorchServe has implemented token authentication for management and inference APIs. This is an optional config and can be enabled using torchserve-endpoint-plugin
. This plugin can be downloaded from maven. This further strengthens TorchServe’s capability as a secure model serving solution. The security features of TorchServe are documented here
Apple Silicon Support
TorchServe is now supported on Apple Silicon mac. The current support is for CPU only. We have also posted an RFC for the deprecation of x86 mac support.
- Include arm64 mac in CI workflows #2934 @udaij12
- Conda binaries build support #3013 @udaij12
- Adding support for regression tests for binaries #3019 @udaij12
KServe Updates
While serving large models, model loading can take some time even though the pod is running. Even though TorchServe is up, the worker is not ready till the model is loaded. To address this, TorchServe now sets the model ready status in KServe after the model has been loaded on workers. TorchServe also includes native open inference protocol support in gRPC. This is an experiment feature.
- Supported native KServe open inference protocol in gRPC #2609 @andyi2it
- Refactored TorchServe configuration in KServe #2995 @sgaist
- Improved KServe protocol version handling #2957 @sgaist
- Updated KServe test script to return model version #2973 @agunapal
- Set model status using TorchServe API in KServe #1878 @byeongjokim
- Supported no-archive model archiver in KServe #2839 @agunapal
- How to deploy MNIST using KServe with minikube #2718 @agunapal
- Changes to support no-model archive mode with KServe #2839 @agunpal
Metrics Updates
In order to extend backwards compatibility support for metrics, auto-detection of backend metrics enables the flexibility to publish custom model metrics without having to explicitly specify them in the metrics configuration file. Furthermore, a customized script to collect system metrics is also now supported.
- Supported backend metrics auto-detection #2769 @namannandan
- Fixed backend metrics backward compatible #2816 @namannandan
- Supported customized system metrics script via config.properties #3000 @lxning
Improvements and Bug Fixing
- Supported PyTorch 2.2.1 #2959 #2972 and Release version updated #3010 @agunapal
- Enabled option of installing model's 3rd party dependency in Python virtual environment via model config yaml file #2910 #2946 #2954 @namannandan
- Fixed worker auto recovery #2746 @mreso
- Fixed worker thread write and flush incomplete #2833 @lxning
- Fixed the priority of parameters defined in register curl vs model-config.yaml #2858 @lxning
- Refactored sanity check with pytest #2221 @mreso
- Fixed model state if runtime is null from model archiver #2928 @mreso
- Refactored benchmark script for LLM benchmark integration #2897 @mreso
- Added pytest for tensor parallel #2741 @mreso
- Fixed continuous batching unit test #2847 @mreso
- Added separate pytest for send_intermediate_prediction_response #2896 @mreso
- Fixed GPU ID in GPT Fast handler #2872 @sachanub
- Added model archiver API #2751 @GeeCastro
- Updated torch.compile in BaseHandler to accept kwargs via model config yaml file #2796 @eballesteros
- Integrated pytorch-probot into the TorchServe #2725 @atalman
- Added queue time in benchmark report #2854 @sachanub
- Replaced no_grad with inference_mode in BaseHandler #2804 @bryant1410
- Fixed env var CUDA_VERSION conflict in Dockerfile #2807 @rsbowman-striveworks
- Fixed var USE_CUDA_VERSION in Dockerfile #2982 @fyang93
- Fixed BASE_IMAGE for k8s docker image #2808 @rsbowman-striveworks
- Fixed workflow store path in config.properties overwritten by the default workflow path #2792 @udaij12
- Removed invalid warning log #2867 @lxning
- Updated PyTorch nightly url and CPU version in install_dependency.py #2971 #3011 @agunapal
- Deprecated Dockerfile.dev, build dev and prod docker image from single source Dockerfile #2782 @sachanub
- Updated transformers version to >= 4.34.0 #2703 @agunapal
- Fixed Neuronx requirements #2887 #2900 @namannandan
- Added neuron SDK installation in install_dependencies.py #2893 @mreso
- Updated ResNet-152 example output #2745 @sachanub
- Clarified that "Not Accepted" is a valid classification in Huggingface_Transformers Sequence Classification example #2786 @nathanweeks
- Added dead link checking in md files #2984 @mreso
- Added comments in model_service_worker.py #2809 @InakiRaba91
- Enabled a new github workflow or updated an existing workflow #2726 #2732 #2737 #2734 #2750 #2767 #2778 #2792 #2835 #2846 #2848 #2855 #2856 #2859 #2864 #2863 #2891 #2938 #2939 #2961 #2960 #2964 #3009 @agunapal @udaij12 @namannandan @sachanub
Documentation
- Updated security readme #2773 #3020 @agunapal @udaij12
- Added security readme to TorchServe site #2784 @sekyondaMeta
- Refactor the README.md #2729 @chauhang
- Updated git clone instruction in gRPC api documentation #2799 @bryant1410
- Highlighted code in README #2805 @bryant1410
- Fixed typos in the README.md #2806 #2871 @bryant1410 @rafijacSense
- Fixed dead links in documentation #2936 @agunapal
Platform Support
Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support Matrix
TorchServe version | PyTorch version | Python | Stable CUDA | Experimental CUDA |
---|---|---|---|---|
0.10.0 | 2.2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.9.0 | 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.8.0 | 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 |
0.7.0 | 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN ... |
TorchServe v0.9.0 Release Notes
This is the release of TorchServe v0.9.0.
Security
Our security process is documented here
We rely heavily on automation to improve the security of torchserve
namely by
- On a monthly basis updating our
gradle
andpip
dependencies - Docker scanning via Snyk
- Code analysis via CodeQL
A key point to remember is that torchserve
will allow you to configure things in an unsecure way so make sure to read our security docs and relevant security warnings to make sure your product is secure in production. In general we do not encourage you to download untrusted mar files from the internet, running a .mar
file effectively is running arbitrary python code so make sure to unzip mar files and validate whether they are doing anything suspicious.
Code scanning fixes
- Used Sha-256 in ziputils #2629 @msaroufim
- Verified default hostname in Test #2631 @msaroufim
- Fixed zip slip error #2634 @msaroufim
- Used string array as Process arguments input #2632 #2635 @msaroufim
- Enabled Netty HTTP header validation as default #2630 @msaroufim
- Verified 3rd party package installation path #2687 @lxning
- Allowed url validation #2685 @lxning including
- Disabled loading TS_ALLOWED_URLS from env by default.
- Moved the model url validation to last step.
- Sanity check model archive name to guard Uncontrolled data used in path expression
Address configuration updates
- Updated default address from 0.0.0.0 to 127.0.0.1 #2624 #2704 @namannandan @agunapal
- Bind container ports to localhost ports #2646 @namannandan
Documentation improvements
- Updated security readme #2643 #2690 @msaroufim @agunapal
- Updated security guidance in docker readme #2669 @agunapal
Dependency improvements
- Created dependabot.yml #2642 #2675 @msaroufim
- Bumped packaging from 23.1 to 23.2
- Bumped pygit2 from 1.21.1 to 1.13.1
- Bumped com.github.spotbugs from 4.0.2 to 5.1.3
- Bumped ONNX from 1.14.0 to 1.14.1
- Bumped Pillow from 9.3.0 to 10.0.1
- Bumped Bump com.amazonaws:DynamoDBLocal from 1.13.2 to 2.0.0
- Upgraded node to version 18 #2663 @agunapal
Blogs
- High performance Llama 2 deployments with AWS Inferentia2 using TorchServe
- ML Model Server Resource Saving - Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance
- Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs
New Features
- Support PyTorch 2.1.0 and Python 3.11 #2621 #2691 #2697 @agunapal
- Supported continous batching for LLM inference #2628 @mreso @lxning
- Supported dynamically loading 3rd party package on SageMaker Multi-Model Endpoint #2535 @lxning
- Added DALI handler to handle preprocess and updated Nvidia DALI example #2485 @jagadeeshi2i
New Examples
- Deploy Llama2 on Inferentia2 #2458 @namannandan
- Using TorchServe on SageMaker Inf2.24xlarge with Llama2-13B @lxning
- PyTorch tensor parallel on Llama2 example #2623 #2689 @HamidShojanazeri
- Enabled better transformer (ie. flash attention 2) on Llama2 #2700 @HamidShojanazeri @lxning
- Llama2 Chatbot on Mac #2618 @agunapal
- ASR speech recognition example #2047 @husenzhang
Improvements
- Fixed typo in BaseHandler #2547 @a-ys
- Create merge_queue workflow for CI #2548 @msaroufim
- Fixed typo in artifact terminology unification #2551 @park12sj
- Added env hints in model_service_worker #2540 @ZachOBrien
- Refactor conda build scripts to publish all binaries #2561 @agunapal
- Fixed response return type in KServe #2566 @jagadeeshi2i
- Added torchserve-kfs nightly build #2574 @jagadeeshi2i
- Added regression for all CPU binaries #2562 @agunapal
- Updated CICD runners #2586 #2597 #2636 #2627 #2677 #2710 #2696 @agunapal @msaroufim
- Upgraded newman version to 5.3.2 #2598 #2603 @agunapal
- Updated opt benchmark config for inf2 #2617 @namannandan
- Added ModelRequestEncoderTest #2580 @abergmeier
- Added manually dispatch workflow #2686 @msaroufim
- Updated test wheels with PyTorch 2.1.0 #2684 @agunapal
- Allowed parallel level = 1 to run in torchrun mode #2608 @lxning
- Fixed metric unit assignment backward compatibility #2693 @namannandan
Documentation
- Updated MPS readme #2543 @sekyondaMeta
- Updated large model inference readme #2542 @sekyondaMeta
- Fixed bash snippets in examples/image_classifier/mnist/Docker.md #2345 @dmitsf
- Fixed typo in kubernetes/autoscale.md #2393 @CandiedCode
- Fixed path in examples/image_classifier/resnet_18/README.md #2568 @udaij12
- Model Loading Guidance #2592 @agunapal
- Updated Metrics readme #2560 @sekyondaMeta
- Display nightly workflow status badge in README #2619 #2666 @agunapal @msaroufim
- Update torch.compile information in examples/pt2/README.md #2706 @agunapal
- Deploy model using TorchServe on SageMaker tutorial @lxning
Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support
Torch 2.1.0 + Cuda 11.8, 12.1
Torch 2.0.1 + Cuda 11.7
Torch 2.0.0 + Cuda 11.7
Torch 1.13 + Cuda 11.7
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2
TorchServe v0.8.2 Release Notes
This is the release of TorchServe v0.8.2.
Security
- Updated snakeyaml version to v2 #2523 @nskool
- Added warning about model allowed urls when default value is applied #2534 @namannandan
Custom metrics backwards compatibility
add_metric
is now backwards compatible with versions [< v0.6.1] but the default metric type is inferred to beCOUNTER
. If the metric is of a different type, it will need to be specified in the call toadd_metric
as follows:
metrics.add_metric(name='GenericMetric', value=10, unit='count', dimensions=[...], metric_type=MetricTypes.GAUGE)
- When upgrading from versions [v0.6.1 - v0.8.1] to v0.8.2, replace the call to
add_metric
withadd_metric_to_cache
. - All custom metrics updated in the custom handler will need to be included in the metrics configuration file for them to be emitted by Torchserve. This is shown here.
- A detailed upgrade guide is included in the metrics documentation.
New Features
- Supported KServe GPRC v2 #2176 @jagadeeshi2i
- Supported K8S session affinity #2519 @jagadeeshi2i
New Examples
-
Example LLama v2 70B chat using HuggingFace Accelerate #2494 @lxning @HamidShojanazeri @agunapal
-
large model example OPT-6.7B on Inferentia2 #2399 @namannandan
- This example demonstrates how NeuronX compiles the model , detects neuron core availability and runs the inference.
-
DeepSpeed deferred init with OPT-30B #2419 @agunapal
- This PR added feature
deferred model init
in OPT-30B example by leveraging DeepSpeed new version. This feature is able to significantly reduce model loading latency.
- This PR added feature
-
Torch TensorRT example #2483 @agunapal
- This PR uses Resnet-50 as an example to demonstrate Torch TensorRT.
-
K8S mnist example using minikube #2323 @agunapal
- This example shows how to use a pre-trained custom MNIST model to performing real time Digit recognition via K8S.
-
Example for custom metrics #2516 @namannandan
-
Example for object detection with ultralytics YOLO v8 model #2508 @agunapal
Improvements
- Migrated publishing torchserve-plugins-sdk from Maven JCenter to Maven Central #2429 #2422 @namannandan
- Fixed download model from S3 presigned URL #2416 @namannandan
- Enabled opt-6.7b benchmark on inf2 #2400 @namannandan
- Added job Queue Status in describe API #2464 @namannandan
- Added add_metric API to be backward compatible #2525 @namannandan
- Upgraded nvidia base image version to
nvidia/cuda:11.7.1-base-ubuntu20.04
in GPU docker image #2442 @agunapal - Added Docker regression tests in CI #2403 @agunapal
- Updated release version #2533 @agunapal
- Upgraded default cuda to 11.8 in docker image build #2489 @agunapal
- Updated docker nightly build parameters #2493 @agunapal
- Added path to save ab benchmark profile graph in benchmark report #2451 @agunapal
- Added profile information for benchmark #2470 @agunapal
- Fixed manifest null in base handler #2488 @pedrogengo
- Fixed batching input in DALI example #2455 @jagadeeshi2i
- Fixed metrcis for K8S setup #2473 @jagadeeshi2i
- Fixed kserve storage optional package in Dockerfile #2537 @jagadeeshi2i
- Fixed typo in ModelConfig.java comments #2506 @arnavmehta7
- Fixed netty direct buffer issues in torchserve-plugins-sdk #2511 @marrodion
- Fixed typo in ts/context.py comments #2536 @ethankim00
- Fixed Server error when gRPC client close connection unexpectedly #2420 @lxning
Documentation
- Updated large model documentation #2468 @sekyondaMeta
- Updated Sphinx landing page and requirements #2428 #2520 @sekyondaMeta
- Updated G analytics in docs #2449 @sekyondaMeta
- Added performance checklist in docs #2526 @sekyondaMeta
- Added performance guidance in FAQ #2524 @sekyondaMeta
- Added instruction for embedding handler examples #2431 @sidharthrajaram
- Updated PyPi description #2445 @bryanwweber @agunapal
- Updated Better Transformer README #2474 @HamidShojanazeri
- Fixed typo in microbatching README #2484 @InakiRaba91
- Fixed broken link in kubernetes AKS README #2490 @agunapal
- Fixed lint error #2497 @ankithagunapal
- Updated instructions for building GPU docker image for ONNX #2435 @agunapal
Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support
Torch 2.0.1 + Cuda 11.7, 11.8
Torch 2.0.0 + Cuda 11.7, 11.8
Torch 1.13 + Cuda 11.7, 11.8
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2
TorchServe v0.8.1 Release Notes
This is the release of TorchServe v0.8.1.
New Features
- Supported microbatch in handler to parallel process a batch request from frontend. #2210 @mreso
Because pre- and post- processing are often carried out on the CPU the GPU sits idle until the two CPU bound steps are executed and the worker receives a new batch. Microbatch in handler is able to parallel process inference, pre- and post- processing for a batch request from frontend.
- Supported job ticket #2350 @lxning
This feature help with use cases where inference latency can be high, such as generative models, auto regressive decoder models like chatGPT. Applications can take effective actions, for example, routing the rejected request to a different server, or scaling up model server capacity, based on the business requirements.
New Examples
- Notebook example of TorchServe on SageMaker MME(multiple model endpoint). @lxning
This example demonstrates creative content assisted by generative AI by using TorchServe on SageMaker MME.
Improvements
-
Upgraded to PyTorch 2.0.1 #2374 @namannandan
-
Significant reduction in Docker Image Size
- Reduce GPU docker image size by 3GB #2392 @agunapal
- Reduced dependency installation time and decrease docker image size #2364 @mreso
GPU pytorch/torchserve 0.8.1-gpu 04eef250c14e 4 hours ago 2.34GB pytorch/torchserve 0.8.0-gpu 516bb13a3649 4 weeks ago 5.86GB pytorch/torchserve 0.6.0-gpu fb6d4b85847d 12 months ago 2.13GB
CPU pytorch/torchserve 0.8.1-cpu 68a3fcae81af 4 hours ago 662MB pytorch/torchserve 0.8.0-cpu 958ef6dacea2 4 weeks ago 2.37GB pytorch/torchserve 0.6.0-cpu af91330a97bd 12 months ago 496MB
-
Updated CPU information for IPEX #2372 @min-jean-cho
-
Fixed inf2 example handler #2378 @namannandan
-
Added inf2 nightly benchmark #2283 @namannandan
-
Fixed archiver tgz format model directory structure mismatch on SageMaker #2405 @lxning
-
Fixed model archiver to fail if extra files are missing #2212 @mreso
-
Fixed device type setting in model config yaml #2408 @lxning
-
Fixed batchsize in config.properties not honored #2382 @lxning
-
Upgraded torchrun argument names and fixed backend tcp port connection #2377 @lxning
-
Fixed error thrown while loading multiple models in KServe #2235 @jagadeeshi2i
-
Fixed KServe fastapi migration issues #2175 @jagadeeshi2i
-
Added type annotation in model_server.py #2384 @josephcalise
-
Speed up unit test by removing sleep in start/stop torchserve #2383 @mreso
-
Enabled ONNX CI test #2363 @msaroufim
-
Removed session_mocker usage to prevent test cross talking #2375 @mreso
-
Enabled regression test in CI #2370 @msaroufim
-
Fixed regression test failures #2371 @namannandan
-
Bump up transformers version from 4.28.1 to 4.30.0 #2410
Documentation
-
Fixed links in FAQ #2351 @sekyondaMeta
-
Fixed broken links in index.md #2329 @sekyondaMeta
Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support
Torch 2.0.1 + Cuda 11.7, 11.8
Torch 2.0.0 + Cuda 11.7, 11.8
Torch 1.13 + Cuda 11.7, 11.8
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2
TorchServe v0.8.0 Release Notes
This is the release of TorchServe v0.8.0.
New Features
- Supported large model inference in distributed environment #2193 #2320 #2209 #2215 #2310 #2218 @lxning @HamidShojanazeri
TorchServe added the deep integration to support large model inference. It provides PyTorch native large model inference solution by integrating PiPPy. It also provides the flexibility and extensibility to support other popular libraries such as Microsoft Deepspeed, and HuggingFace Accelerate.
To improve UX in Generative AI inference, TorchServe allows for sending intermediate token response to client side by supporting GRPC server side streaming and HTTP 1.1 chunked encoding .
By leveraging torch.compile
it's now possible to run torchserve using XLA which is optimized for both GPU and TPU deployments.
- Implemented New Metrics platform #2199 #2190 #2165 @namannandan @lxning
TorchServe fully supports metrics in Prometheus mode or Log mode. Both frontend and backend metrics can be configured in a central metrics YAML file.
Added config-file option for model config to model archiver tool. Users is able to flexibly define customized parameters in this YAML file, and easily access them in backend handler via variable context.model_yaml_config. This new feature also made TorchServe easily support the other new features and enhancements.
- Refactored PT2.0 support #2222 @msaroufim
We've refactored our model optimization utilities, improved logging to help debug compilation issues. We've also now deprecated compile.json
in favor of using the new YAML config format, follow our guide here to learn more https://github.com/pytorch/serve/blob/master/examples/pt2/README.md the main difference is while archiving a model instead of passing in compile.json
via --extra-files
we can pass in a --config-file model_config.yaml
By default, TorchServe uses a round-robin algorithm to assign GPUs to a worker on a host. Starting from v0.8.0, TorchServe allows users to define deviceIds in the model_config.yaml. to assign GPUs to a model.
TorchServe supports hybrid mode on a GPU host. Users are able to define deviceType in model config YAML file to deploy a model on CPU of a GPU host.
TorchServe allows users to define clientTimeoutInMills in a model config YAML file. TorchServe calculates the expired timestamp of an incoming inference request if clientTimeoutInMills is set, and drops the request once it is expired.
Supported maxRetryTimeoutInSec, which defines the max maximum time window of recovering a dead backend worker of a model, in model config YAML file. The default value is 5 min. Users are able to adjust it in model config YAML file. The ping endpoint returns 200 if all models have enough healthy workers (ie, equal or larger the minWorkers); otherwise returns 500.
New Examples
-
Example of Pippy onboarding Open platform framework for distributed model inference #2215 @HamidShojanazeri
-
Example of DeepSpeed onboarding Open platform framework for distributed model inference #2218 @lxning
Improvements
-
Enabled Core pinning in CPU nightly benchmark #2166 #2237 @min-jean-cho
TorchServe can be used with Intel® Extension for PyTorch* to give performance boost on Intel hardware. Intel® Extension for PyTorch* is a Python package extending PyTorch with up-to-date features optimizations that take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI), Intel® Advanced Matrix Extensions (Intel® AMX), and more.
Enabling core pinning in TorchServe CPU nightly benchmark shows significant performance speedup. This feature is implemented via a script under PyTorch Xeon backend, initiated from Intel® Extension for PyTorch*. To try out core pinning on your workload, add cpu_launcher_enable=true
in config.properties
.
To try out more optimizations with Intel® Extension for PyTorch*, install Intel® Extension for PyTorch* and add ipex_enable=true
in config.properties
.
- Added Neuron nightly benchmark dashboard #2171 #2167 @namannandan
- Enabled torch.compile support for torch 2.0.0 pre-release #2256 @morgandu
- Fixed torch.compile mac regression test #2250 @msaroufim
- Added configuration option to disable system metrics #2104 @namannandan
- Added regression test cases for SageMaker MME contract #2200 @agunapal
In case of OOM , return error code 507 instead of generic code 503
-
Fixed Error thrown in KServe while loading multi-models #2235 @jagadeeshi2i
-
Added Docker CI for TorchServe #2226 @fabridamicelli
-
Change docker image release from dev to production #2227 @agunapal
-
Supported building docker images with specified Python version #2154 @agunapal
-
Model archiver optimizations:
a). Added wildcard file search in model archiver --extra-file #2142 @gustavhartz
b). Added zip-store option to model archiver tool #2196 @mreso
c). Made model archiver tests runnable from any directory #2191 @mreso
d). Supported tgz format model decompression in TorchServe frontend #2214 @lxning
- Enabled batch processing in example scripted tokenizer #2130 @mreso
- Made handler tests callable with pytest #2173 @mreso
- Refactored sanity tests #2219 @mreso
- Improved benchmark tool #2228 and added auto-validation # 2144 #2157 @agunapal
Automatically flag deviation of metrics from the average of last 30 runs
- Added notification for CI jobs' (benchmark, regression test) failure @agunapal
- Updated CI to run on ubuntu 20.04 #2153 @agunapal
- Added github code scanning codeql.yml #2149 @msaroufim
- freeze pynvml version to avoid crash in nvgpu #2138 @mreso
- Made pre-commit usage clearer in error message #2241 and upgraded isort version #2132 @msaroufim
Dependency Upgrades
Documentation
This study compares TPS b/w TorchServe with Nvidia MPS enabled and TorchServe without Nvidia MPS enabled on P3 and G4. It can help to the decision in enabling MPS for your deployment or not.
- Updated TorchServe page on pytorch.org #2243 @agunapal
- Lint fixed broken windows Conda link #2240 @msaroufim
- Corrected example PT2 doc #2244 @samils7
- Fixed regex error in Configuration.md #2172 @mpoemsl
- Fixed dead Kubectl links #2160 @msaroufim
- Updated model file docs in example doc #2148 @tmc
- Example for serving TorchServe using docker #2118 @agunapal
- Updated walmart blog link #2117 @agunapal
Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support
Torch 2.0.0 + Cuda 11.7, 11.8
Torch 1.13 + Cuda 11.7, 11.8
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2
TorchServe v0.7.1 Release Notes
This is the release of TorchServe v0.7.1.
Security
- Upgraded com.google.code.gson:gson from 2.10 to 2.10.1 in serving sdk - #2096 @snyk-bot
- Upgraded ubuntu from 20.04 to rolling in Dockerfile files - #2066, #2065, #2064 @msaroufim
- Update to safe snakeyaml, grpc and gradle - #2081 @jack-gits
Updated Dockerfile.dev to install gnupg before calling apt-key del 7fa2af80 - #2076 @yeahdongcn
Dependency Upgrades
Improvements
- Removed bad eval when onnx session used - #2034 @msaroufim
- Updated runner label in regression_tests_gpu.yml - #2080 @lxning
- Updated nightly benchmark config - #2092 @lxning
Documentation
- Added TorchServe 2022 blogs in Readme - #2060 @msaroufim
The blogs are Torchserve Performance Tuning, Animated Drawings Case-Study, Walmart Search: Serving Models at a Scale on TorchServe, Scaling inference on CPU with TorchServe, and TorchServe C++ backend. - Fixed HuggingFace large model instruction - #2087 @HamidShojanazeri
- Reworded examples Readme to highlight examples - #2086 @agunapal
- Updated torchserve_on_win_native.md - #2050 @blackrabbit
- Fixed typo in batch inference md - #2049 @MasoudKaviani
Deprecation
- Deprecated future package and drop Python2 support - #2082 @namannandan
Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support
Torch 1.13 + Cuda 11.7
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2
TorchServe v0.7.0 Release Notes
This is the release of TorchServe v0.7.0.
New Examples
- HF + Better Transformer integration #2002 @HamidShojanazeri
Better Transformer / Flash Attention & Xformer Memory Efficient provides out of box performance with major speed ups for PyTorch Transformer encoders. This has been integrated into Torchserve HF Transformer example, please read more about this integration here.
Main speed ups in Better Transformers comes from exploiting sparsity on padded inputs and kernel fusions. As a result you would see the biggest gains when dealing with larger workloads, such sequences with longer paddings and larger batch sizes.
In our benchmarks on P3 instances with 4 V100 GPUs, using Torchserve benchmarking workloads, throughput has shown significant improvement with large batch sizes. 45.5% increase with batch size 8; 50.8% increase with batch size 16; 45.2% increase with batch size 32; 47.2% increase with batch size 64. and 17.2 increase with batch size 4. These number can vary based on your workload (batch size , padding percentage) and your hardware. Please look up some other benchmarks in the blog post.
torch.compile()
support #1960 @msaroufim
We've added experimental support for PT 2.0 as in torch.compile() support within torchserve. To use it you need to supply a file compile.json
when archiving your model to specify which backend you want. We've also enabled by default mode=reduce-overhead
which is ideally suited for smaller batch sizes which are more common for inference. We recommend for now to leverage GPUs with tensor cores available like A10G or A100 since you're likely to see the greatest speedups there.
On training we've seen speedups ranging from 30% to 2x https://pytorch.org/get-started/pytorch-2.0/ but we haven't ran any performance benchmarks yet for inference. Until then we recommend you continue leveraging other runtimes like TensorRT or IPEX for accelerated inference which we highlight in our performance_guide.md
. There are a few important caveats to consider when you're using torch.compile: changes in batch sizes will cause recompilations so make sure to leverage a small batch size, there will be additional overhead to start a model since you need to compile it first and you'll likely still see the largest speedups with TensorRT.
However, we hope that adding this support will make it easier for you to benchmark and try out PT 2.0. Learn more here https://github.com/pytorch/serve/tree/master/examples/pt2
Dependency Upgrades
- Support Python 3.10 #2031 @agunapal
- Support PyTorch 1.13 and Cuda 11.7 #1980 @agunapal
- Update docker default from Ubuntu 18.04 to Ubuntu 20.04 (LTS) #1970 @LuigiCerone
Improvements
- KFServe upgrade to 0.9 - #1860 @jagadeesh
- Added pyyaml for python venv #2014 @lxning
- Added HG BERT better transformer benchmark #2024 @lxning
Documentation
Platform Support
Ubuntu 16.04, Ubuntu 18.04, MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support
Torch 1.13 + Cuda 11.7
Torch 1.11 + Cuda 10.2, 11.3, 11.6
Torch 1.9.0 + Cuda 11.1
Torch 1.8.1 + Cuda 9.2