Releases: intel/neural-compressor
Intel® Neural Compressor v2.2 Release
- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- Examples
- External Contributes
Highlights
- Expanded SmoothQuant support on mainstream frameworks including PyTorch/IPEX, TensorFlow/ITEX, ONNX Runtime, and validated popular large language models (LLMs) such as GPT-J, LLaMA, OPT, BLOOM, Dolly, MPT, LaMini-LM and RedPajama-INCITE.
- Innovated two productivity components Neural Solution for distributed quantization and Neural Insights for quantization accuracy debugging.
- Successfully integrated Intel Neural Compressor into MSFT Olive (#157) and DeepSpeed (#3300).
Features
- [Quantization] Support TensorFlow SmoothQuant (1f4127)
- [Quantization] Support ITEX SmoothQuant (1f4127)
- [Quantization] Support PyTorch FX SmoothQuant (6a39f6, 603811)
- [Quantization] Support ONNX Runtime SmoothQuant (3df647, 1e1d70)
- [Quantization] Support dictionary inputs for IPEX quantization (4ba233)
- [Quantization] Enable calibration algorithm Entropy/KL & Percentile for ONNX Runtime (dae494)
- [MixedPrecision] Support mixed precision op name/type dict option (a9c2cb)
- [Strategy] Support block wise tuning (9c26ed)
- [Strategy] Enable mse_v2 for ONNX Runtime (62122d)
- [Pruning] Support retrain free sparse (d29aa0)
- [Pruning] Support TensorFlow pruning with 2.x API (072c13)
Improvement
- [Quantization] Enhance Keras functional model quantization with Keras model in, quantized Keras model out (699751)
- [Quantization] Enhance MatMul and Gather quantization for ONNX Runtime (1f9c4f)
- [Quantization] Add new recipe for ONNX Runtime NLP models (10d82c)
- [MixedPrecision] Add more FP16 OPs support for ONNX Runtime (15d551)
- [MixedPrecision] Add more BF16 OPs support for TensorFlow (369b9d)
- [Pruning] Enhance multihead-attention slim (f3de50)
- [Pruning] Enable progressive pruning in N:M pattern (483e80)
- [Model Export] Refine PT2ONNX export (877adb)
- Remove redundant classes for quantization, benchmark and mixed precision (c51096)
Productivity
- [Neural Solution] Support multi-node distribute tuning model-level parallelism (ee049c)
- [Neural Insights] Support quantization and benchmark diagnosis with GUI (5dc9ea, 3bde2e, 898344)
- [Neural Coder] Migrate Neural Coder support into 2.x API (113ca1, e74a8a)
- [Ecosystem] MSFT Olive integration (#157)
- [Ecosystem] MSFT DeepSpeed integration (#3300)
- Support ITEX 1.2 (5519e2)
- Support Python 3.11 (6fa053)
- Enhance documentations for mixed precision, diagnosis, dataloader, metric, etc.
Bug Fixes
- Fix ONNX Runtime SmoothQuant issues (85c6a0, 1b26c0)
- Fix bug in IPEX fallback (b4f9c7)
- Fix ITEX quantize/dequantize before BN u8 issue (5519e2)
- Fix example inputs issue for IPEX smoothquant (c8b753)
- Fix IPEX mixed precision (d1e734)
- Fix inspect tensor (8f5f5d)
- Fix PyTorch model peleenet, 3dunet accuracy issue after migrate into 2.x API
- Fix CVEs (04c482, efcd98, 6e9f7b, 7abe32)
Examples
- Enable 4 ONNX Runtime examples, layoutlmv3, layoutlmft, deberta-v3, GPTJ-6B.
- Enable 2 TensorFlow LLMs with SmoothQuant, facebook-opt-125m, gpt2-medium.
External Contributes
- Add a mathematical check for SmoothQuant transform (5c04ac)
- Fix mismatch absorb layers due to tracing and named modules for SmoothQuant (bccc89)
- Fix trace issue when input is dictionary for SmoothQuant (6a3c64)
- Allow dictionary model inputs for ONNX export (17b642)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.7, 3.8, 3.9, 3.10, 3.11
- TensorFlow 2.10.0, 2.11.0, 2.12.0
- ITEX 1.1.0, 1.2.0
- PyTorch/IPEX 1.12.1+cpu, 1.13.0+cpu, 2.0.1+cpu
- ONNX Runtime 1.13.1, 1.14.1, 1.15.0
- MXNet 1.9.1
Intel® Neural Compressor v2.1.1 Release
- Bug Fixes
- Examples
Bug Fixes
- Fix calibration max value issue for SmoothQuant (commit b28bfd)
- Fix exception for untraceable model during SmoothQuant (commit b28bfd)
- Fix depthwise conv issue for SmoothQuant (commit 0e5942)
- Fix Keras model mix precision convert issue (commit 997c57)
Examples
- Add gpt-j alpha-tuning example (commit 3b7d28)
- Migrate notebook example update to INC2.0 API (commit 54d2f5)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.8
- TensorFlow 2.11.0
- ITEX 1.1.0
- PyTorch/IPEX 1.13.0+cpu
- ONNX Runtime 1.13.1
- MXNet 1.9.1
Intel® Neural Compressor v2.1 Release
- Highlights
- Features
- Improvement
- Bug Fixes
- Examples
- Documentations
Highlights
- Support and enhance SmoothQuant on popular large language models (LLMs) (e.g., BLOOM-176B, OPT-30B, GPT-J-6B, etc.)
- Support native Keras model quantization (Keras model as input, and quantized Keras model as output)
- Provide auto-tuning strategy to improve quantization productivity
- Support model conversion from TensorFlow INT8 to ONNX INT8 model
- Polish documentations to help the user be easier to get started
Features
- [Quantization] Support SmoothQuant and verify with LLMs (commit cbb5cf) (commit 08e255) (commit 12c101)
- [Quantization] Support Keras functional model quantization with Keras model in, quantized Keras model out (commit efd737)
- [Strategy] Add auto quantization level as the default tuning process (commit cdfb99)
- [Strategy] Integrate quantization recipes into tuning strategy (commit 44d176)
- [Strategy] Extend the strategy capability for adding the new data type (commit d0059c)
- [Strategy] Enable tuning strategy level multi-node distribute quantization (commit e1fe50)
- [AMP] Support ONNX Runtime with FP16 (commit 108c24)
- [Productivity] Export TensorFlow models into ONNX QDQ mode on both fp32 and int8 precision (commit 33a235)
- [Productivity] Support PT/IPEX v2.0 (commit dbf138)
- [Productivity] Support ONNX Runtime v1.14.1 (commit 146759)
- [Productivity] GitHub IO docs support history versions
Improvement
- Remove the dependency on experimental API (commit 6e10ef)
- Enhance GUI diagnosis function on model graph and tensor histogram showing style (commit 9f0891)
- Optimize memory usage for PyTorch adaptor (commit c295a7), ONNX adaptor (commit 8cbf2e), TensorFlow adaptor (commit ad0f1e), and tuning strategy (commit c49300) to support LLM
- Refine ONNX Runtime QDQ quantization graph (commit c64a5b)
- Enable ONNX model quantization with NVidia GPU TRT EP (commit ba42d0)
- Improve code line coverage to 85%
Bug Fixes
- Fix mix precision config setting (commit 4b71a8)
- Fix multi-instance benchmark on Windows (commit 1f89aa)
- Fix domain detection for large ONNX model (commit 70a566)
Examples
- Migrate examples with INC v2.0 API
- Enable LLMs (e.g., GPT-NeoX, T5 Large, BLOOM-176B, OPT-30B, GPT-J-6B, etc.)
- Enable examples for Keras in Keras out (commit efd737)
- Enable multi-node training examples on CPU (e.g., RN50 distillation, QAT, pruning examples)
- Add 15+ Huggingface (HF) examples with ONNX Runtime backend and update quantized models into HF (commit a4228d)
- Add 2 examples for PT2ONNX model export (commit 26db4a)
Documentations
- Polish documentations with simplified GitHub main page, easy to read IO Docs structure, hands on API migrate user guide, more detailed new API instruction, refreshed API docs template, etc.
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.7, 3.8, 3.9, 3.10
- TensorFlow 2.10.1, 2.11.0, 2.12.0
- ITEX 1.0.0, 1.1.0
- PyTorch/IPEX 1.12.1+cpu, 1.13.0+cpu, 2.0.0+cpu
- ONNX Runtime 1.12.1, 1.13.1, 1.14.1
- MXNet 1.9.1
Intel® Neural Compressor v2.0 Release
- Highlights
- Features
- Bug Fixes
- Examples
- Documentations
Highlights
- Support the quantization for Intel® Xeon® Scalable Processors (e.g., Sapphire Rapids), Intel® Data Center GPU Flex Series, and Intel® Max Series CPUs & GPUs
- Provide the new unified APIs for post-training optimizations (static/dynamic quantization) and during-training optimizations (quantization-aware training, pruning/sparsity, distillation, etc.)
- Support the advanced fine-grained auto mixed precisions (AMP) upon all the supported precisions (e.g., INT8, BF16, and FP32)
- Improve the model conversion from PyTorch INT8 model to ONNX INT8 model
- Support the zero-code quantization in Visual Studio Code and JupyterLab with Neural Coder plugins
- Support the quantization for 10K+ transformer-based models including large language models (e.g., T5, GPT, Stable Diffusion, etc.)
Features
- [Quantization] Experimental Keras model in, quantized Keras model out (commit 4fa753)
- [Quantization] Support quantization for ITEX v1.0 on Intel CPU and Intel GPU (commit a2fcb2)
- [Quantization] Support hardware-neutral quantized ONNX QDQ models and validate on multiple devices (Intel CPU, NVidia GPU, AMD CPU, and ARM CPU) through ONNX Runtime
- [Quantization] Enhance TensorFlow QAT: remove TFMOT dependency (commit 1deb7d)
- [Quantization] Distinguish frameworks, backends and output formats for OnnxRuntime backend (commit 2483a8)
- [Quantization] Support PyTorch/IPEX 1.13 and TensorFlow 2.11 (commit b7a2ef)
- [AMP] Support more TensorFlow bf16 ops (commit 98d3c8)
- [AMP] Add torch.amp bf16 support for IPEX backend (commit 2a361b)
- [Strategy] Add accuracy-first tuning strategies: MSE_v2 (commit 80311f) and HAWQ (commit 83018e) to solve the accuracy problem of specific models
- [Strategy] Refine the tuning strategy, add more data type, more op attributes like per tensor/per channel, dynamic/static, …etc
- [Pruning] Add progressive pruning and pattern lock pruning_type (commit f46bb1)
- [Pruning] Add per_channel sparse pattern (commit f46bb1)
- [Distillation] Support self-distillation towards efficient and compact neural networks (commit acdd4c)
- [Distillation] Enhance API of intermediate layers knowledge distillation (commit 3183f6)
- [Neural Coder] Detect devices and ISA to adjust the optimization (commit 691d0b)
- [Neural Coder] Automatically quantize with ONNX Runtime backend (commit f711b4)
- [Neural Coder] Add Neural Coder Python Launcher (commit 7bb92d)
- [Neural Coder] Add Visual Studio Plugin (commit dd39ca)
- [Productivity] Support Pruning in GUI (commit d24fea)
- [Productivity] Use config-driven API to replace yaml
- [Productivity] Export ONNX QLinear to QDQ format (commit e996a9)
- [Productivity] Validate 10K+ transformer-based models including large language models (e.g., T5, GPT, Stable Diffusion, etc.)
Bug Fixes
- Fix quantization failed of Onnx models with over 2GB model size (commit 8d83cc)
- Fix bf16 disabled by default (commit 83825a)
- Fix PyTorch DLRM quantization out of memory (commit ff1725)
- Fix ITEX resnetv2_50 tuning accuracy (commit ae1e05)
- Fix bf16 ops error in QAT when torch version < 1.11 (commit eda8cb)
- Fix the key comparison in the Bayesian strategy (commit 1e9c12)
- Fix PyTorch T5 can’t do static quantization (commit ee3ef0)
Examples
- Add quantization examples of HuggingFace models with OnnxRuntime backend (commit f4aeb5)
- Add Big language model quantization example: GPT-J (commit 01899d)
- Add Distributed Distillation examples: MobileNetV2 (commit d33ebe) and CNN-2 (commit ebe9e2)
- Update examples with INC v2.0 new API
- Add Stable Diffusion example
Documentations
- Update the accuracy of broad hardware (commit 71b056)
- Refine API helper and documents
Validated Configurations
- Centos 8.4 & Ubuntu 20.04
- Python 3.7, 3.8, 3.9, 3.10
- TensorFlow 2.9.3, 2.10.1, 2.11.0, ITEX 1.0
- PyTorch/IPEX 1.11.0+cpu, 1.12.1+cpu, 1.13.0+cpu
- ONNX Runtime 1.11.0, 1.12.1, 1.13.1
- MxNet 1.7.0, 1.8.0, 1.9.1
Intel® Neural Compressor v1.14. 2 Release
- Highlights
- Features
- Bug Fixes
- Examples
Highlights
- We support experimental quantization support for ITEX v1.0 on Intel CPU and GPU, which is the first time to support the quantization on Intel GPU. We support hardware-neutral quantized ONNX models and validate on multiple devices (Intel CPU, NVidia GPU, AMD CPU, and ARM CPU) through ONNX Runtime.
Features
- Support quantization support on PyTorch v1.13 (commit 97c946)
- Support experimental quantization support for ITEX v1.0 on Intel CPU and GPU (commit a2fcb2)
- Support GUI on native Windows (commit fe9923)
- Support INT8 model load and save API with IPEX backend (commit 23c585)
Bug Fixes
- Fix GPT2 quantization failed with ONNX Runtime backend (commit aea121)
Examples
- Support personalized Stable Diffusion with few-shot fine-tuning (commit 4247fd)
- Add ITEX examples efficientnet_v2_b0, mobilenet_v1, mobilenet_v2, inception_resnet_v2, inception_v3, resnet101, resnet50, vgg16, xception, densenet121....etc. (commit 6ab557)
- Validate quantized ONNX model on multiple devices (Intel CPU, NVIDIA GPU, AMD CPU, and ARM CPU) (commit 288340)
Validated Configurations
- Centos 8.4
- Python 3.8
- TensorFlow 2.10, ITEX 1.0
- PyTorch 1.12.0+cpu, 1.13.0+cpu, IPEX 1.12.0
- ONNX Runtime 1.12
- MxNet 1.9
Intel® Neural Compressor v1.14.1 Release
- Bug Fixes
- Productivity
- Examples
Bug Fixes
- Fix name matching issue of scale and zero-point in PyTorch (commit fd7a53)
- Fix incorrect output quantization mode of MatMul + Relu fusion in TensorFlow (commit 9b5293)
Productivity
- Support Onnx model with Python3.10 (commit 2faf0b)
- Using TensorFlow create_file_writer API to support histogram of Tensorboard (commit f34852)
Examples
- Add NAS notebooks (commit 5f0adf)
- Add Bert mini 2:4, 1x4 and mixed examples with new Pruning API (commit a52074)
- Add keras in, saved_model out resnet101, inception_v3, mobilenetv2, xception, resnetv2 examples (commit fdd40e)
Validated Configurations
- Python 3.7, 3.8, 3.9, 3.10
- Centos 8.3 & Ubuntu 18.04 & Win10
- TensorFlow 2.9, 2.10
- Intel TensorFlow 2.7, 2.8, 2.9
- PyTorch 1.10.0+cpu, 1.11.0+cpu, 1.12.0+cpu
- IPEX 1.10.0, 1.11.0, 1.12.0
- MxNet 1.7, 1.9
- ONNX Runtime 1.10, 1.11, 1.12
Intel® Neural Compressor v1.14 Release
- Highlights
- New Features
- Improvements
- Bug Fixes
- Productivity
- Examples
Highlights
We are excited to announce the release of Intel® Neural Compressor v1.14! We release new Pruning API for PyTorch, allowing users select better combinations of criteria, pattern and scheduler to achieve better pruning accuracy. This release also supports Keras input for TensorFlow quantization, and self-distilled quantization for better quantization accuracy.
New Features
- Pruning/Sparsity
- Quantization
- GUI
- Add mixed precision (commit 26e902)
Improvement
- Enhance tuning for Quantization with IPEX 1.12 to remove additional Quant/DeQuant (commit 192100)
- Add upstream and download API for HuggingFace model hub, which can handle configuration files, tokenizer files and int8 model weights in the format of transformers (commit 46d945)
- Align with Intel PyTorch extension new API (commit cc368a)
- Add load with yaml and pt to be compatible with older PyTorch model saving type (commit a28705)
Bug Fixes
- Quantization
- Export
- Fix export_to_onnx API (commit 158c7f)
Productivity
- Support TensorFlow 2.10.0 (commit d6b6c9 & 8130e7)
- Support OnnxRuntime 1.12 (commit 498ac4)
- Export PyTorch QAT to Onnx (commit 029a63)
- Add Tensorflow and PyTorch container tpp file (commit d245b5)
Examples
- Add example of download from HuggingFace model hub and example of upstream models to the hub (commit 46d945)
- Add notebooks for Neural Coder (commit 105db7)
- Add 2 IPEX examples: bert_large (squad), distilbert_base (squad) (commit 192100)
- ADD 2 DDP for prune once for all examples: roberta-base and Bert Base (commit 26a476)
Validated Configurations
- Python 3.7, 3.8, 3.9, 3.10
- Centos 8.3 & Ubuntu 18.04 & Win10
- TensorFlow 2.9, 2.10
- Intel TensorFlow 2.7, 2.8, 2.9
- PyTorch 1.10.0+cpu, 1.11.0+cpu, 1.12.0+cpu
- IPEX 1.10.0, 1.11.0, 1.12.0
- MxNet 1.7, 1.9
- ONNX Runtime 1.10, 1.11, 1.12
Intel® Neural Compressor v1.13.1 Release
Features
-
Support experimental auto-coding quantization for PyTorch
- Post-training static and dynamic quantization for PyTorch
- Post-training static quantization for IPEX
- Mixed-precision (BF16, INT8, and FP32) for PyTorch
-
Refactor quantization utilities for ONNX Runtime
Bug fix
- Fixed model compression orchestration issue caused by PyTorch v1.11
- Fixed GUI issues
Validated Configurations
- Python 3.8
- Centos 8.4
- TensorFlow 2.9
- Intel TensorFlow 2.9
- PyTorch 1.12.0+cpu
- IPEX 1.12.0
- MXNet 1.7.0
- ONNX Runtime 1.11.0
Intel® Neural Compressor v1.13 Release
Features
-
Quantization
- Support new quantization APIs for Intel TensorFlow
- Support FakeQuant (QDQ) quantization format for ITEX
- Improve INT8 quantization recipes for ONNX Runtime
-
Mixed Precision
- Enhance mixed precision interface to support BF16 (FP16) mixed with FP32
-
Neural Architecture Search
- Support SuperNet-based neural architecture search (DyNAS)
-
Sparsity
- Support training for block-wise structured sparsity
-
Strategy
- Support operator-type based tuning strategy
Productivity
- Support light (default) and full binary packages (default package size 0.5MB, full package size 2MB)
- Add experimental accuracy diagnostic feature for INT8 quantization including tensor statistics visualization and fine-grained precision setting
- Add experimental one-click BF16/INT8 low precision enabling & inference optimization, first-ever code-free solution in industry
Ecosystem
- Upstream 4 more quantized models (emotion_ferplus, ultraface, arcfase, bidaf) to ONNX Model Zoo
- Upstream 10 quantized Transformers-based models to HuggingFace Model Hub
Examples
- Add notebooks for Quantization on Intel DevCloud, Distillation/Sparsity/Quantization for BERT-Mini SST-2, and Neural Architecture Search (DyNAS)
- Add more quantization examples from TensorFlow Model Zoo
Validated Configurations
- Python 3.8, 3.9, 3.10
- Centos 8.3 & Ubuntu 18.04 & Win10
- TensorFlow 2.7, 2.8, 2.9
- Intel TensorFlow 2.7, 2.8, 2.9
- PyTorch 1.10.0+cpu, 1.11.0+cpu, 1.12.0+cpu
- IPEX 1.10.0, 1.11.0, 1.12.0
- MxNet 1.6.0, 1.7.0, 1.8.0
- ONNX Runtime 1.9.0, 1.10.0, 1.11.0
Intel® Neural Compressor v1.12 Release
Features
-
Quantization
- Support accuracy-aware AMP (INT8/BF16/FP32) on PyTorch
- Improve post-training quantization (static & dynamic) on PyTorch
- Improve post-training quantization on TensorFlow
- Improve QLinear and QDQ quantization modes on ONNX Runtime
- Improve accuracy-aware AMP (INT8/FP32) on ONNX Runtime
-
Pruning
- Improve pruning-once-for-all for NLP models
-
Sparsity
- Support experimental sparse kernel for reference examples
Productivity
- Support model deployment by loading INT8 models directly from HuggingFace model hub
- Improve GUI with optimized model downloading, performance profiling, etc.
Ecosystem
- Highlight simple quantization usage with few clicks on ONNX Model Zoo
- Upstream INC quantized models (ResNet101, Tiny YoloV3) to ONNX Model Zoo
Examples
- Add Bert-mini distillation + quantization notebook example
- Add DLRM & SSD-ResNet34 quantization examples on IPEX
- Improve BERT structured sparsity training example
Validated Configurations
- Python 3.8, 3.9, 3.10
- Centos 8.3 & Ubuntu 18.04 & Win10
- TensorFlow 2.6.2, 2.7, 2.8
- Intel TensorFlow 1.15.0 UP3, 2.7, 2.8
- PyTorch 1.8.0+cpu, 1.9.0+cpu, 1.10.0+cpu
- IPEX 1.8.0, 1.9.0, 1.10.0
- MxNet 1.6.0, 1.7.0, 1.8.0
- ONNX Runtime 1.8.0, 1.9.0, 1.10.0