Releases: quic/efficient-transformers
release/v1.21.0
Newly Onboarded Models
Causal Models
GPT OSS (Text)
Example script: efficient-transformers/examples/disagg_serving
Mistral 3.1 (Text)
Example script: mistral3_example.py
Qwen 3 / Qwen3-MoE
Example script: qwen3moe_example
Olmo
Example script: efficient-transformers/examples/text_generation
Multi Models
Qwen 2.5-VL (Vision-Language)
Example script: qwen2_5_vl_example.py
Molmo (Vision-Language)
Example script: molmo_example.py
Gemma 3 (Vision-Language)
Example script: gemma3_example
InternVL 3.5 (Vision-Language)
Example script: intern_example
Audio
Wave2Vec2 (ASR)
Example script: wav2vec2_example
Diffusion Models
Example scripts: efficient-transformers/examples/diffusers
New Features
Diffusers Pipeline Support
Diffusers pipeline support enables seamless integration of diffusion models in QEfficient library. (#604) (#669)
Supported models:
Example scripts: efficient-transformers/examples/diffusers
GPT OSS with Disaggregate Serving Model
Support for GPT OSS using disaggregate serving model. (#608)
Example scripts: efficient-transformers/examples/disagg_serving
Compute-Context-Length (CCL) Support
Feature allows users to optimize the throughput of large language models (LLMs) when handling very large context lengths. (#576) (#663)
Example scripts: efficient-transformers/examples/performance/compute_context_length
ONNX Sub Functions Export Feature for AutoModelForCausalLM
Feature enabling more efficient model compilation and execution on hardware. Users can enable the feature by passing use_onnx_subfunctions=True during export (#621) (#642)
model.export(tmp_path, use_onnx_subfunctions=True)Note: Currently, we are seeing some performance degradation and output discrepancies with the subfunction. We will continue to monitor and evaluate its behavior, and once these issues are resolved, the subfunction will be enabled by default.
Continuous Batching for VLMs
Now VLMs support continuous batching by including scenarios with multiple images and prompts. (#610)
BlockedKV Attention in CausalLM
Implements a blocked K/V cache layout so attention reads/processes the cache block-by-block, improving long-context decode performance. (#618)
Memory Profiling Tool
Adds scripts to profile memory during export/compile/infer (peak usage, cache footprint) for quicker diagnosis. (#674)
Scripts: efficient-transformers/scripts/memory_profiling
Extended On-Device Sampling
Feature extends on-device sampling support to the language decoder of dual QPC vision language models and adds guided decoding capabilities in On Device Sampling. (#597) (#624)
Example script: efficient-transformers/examples/performance/on_device_sampling.py
ONNX Transform, Memory & Time Optimizations
Adds periodic memory cleanup (e.g., to FP16ClipTransform / SplitTensorsTransform) during large-tensor processing, and avoids redundant external data loading when already present. (#640)
Dependency Upgrades
- Transformers 4.55
- Torch 2.7.0+cpu
- Torchvision 0.22.0+cpu
- Python ≥3.9
Removed Platform SDK Dependency
Support QPC generation on systems without the Platform SDK. (#609)
Example Scripts Revamp
This includes:
- Onboarding Guide for adding new Causal models (#574)
- Onboarding Guide for adding new Custom ops in QEff (#638)
- Organized examples into domain-specific subdirectories (#615)
Fine Tuning
Checkpoint Management
Resume from epochs with proper state restoration. Adds resume-from-epoch & epoch checkpoint loading so runs can be restarted with the correct optimizer/scaler/model state. (#614)
Enhanced Loss Tracking
Corrected data type handling for accurate loss computation. Refinements in the finetune/eval path improve numerical stability when computing losses and metrics. (#606)
Custom Dataset Support
Improved handling with better tokenization. Fixes around padding/token typing (e.g., pad_to_max_length) ensure robust dataset ingestion across varied corpora. (#599)
Device-Aware Scaling
Optimized GradScaler for multi-device training. DDP + pipeline-parallel fixes improve device mapping/scaling behavior during mixed-precision training. (#544)
release/v1.20.0
Newly Onboarded Models
• Added support for Llama-4-Scout-17B-16E-Instruct
o Sample script for Text only (Recommended class for testing is QEFFAutoModelForImageTextToText).
o Sample script for Image + Text (Recommended class for testing is QEFFAutoModelForImageTextToText).
o Single QPC + Dual QPC support, (please check the comment section of example script for running single QPC).
o Added support for chunk attention in Llama4.
o Continuous batching and multi batch execution is planned for rel#1.21.0.
o With the redefined interface between QEFF and VLLM, we should be able to run the multiple images in single prompt, please follow (example) and see sample completion below,
• Added support for Grok-1
o Since architecture for this model is Grok1ModelForCausalLM, so it can be executed using QEffAutoModelForCausalLM.
• Added support for Gemma3
o Sample script for Text only (Recommended class for testing is QEFFAutoModelForImageTextToText).
o Sample script for Image + Text.
o Added support for sliding window.
o Continuous batching and multi batch execution is planned for rel#1.21.0
• Added support for Granite Vision models
o Sample script
• Added support for Granite MOE models
New Features
• Upgrading Transformer version to 4.51.3.
• SpD, multiprojection heads
o Implemented post-attention hidden size projections to speculate tokens ahead of the base model.
• Adding compilation support for io_encrypt flag
o Added support for Model-IP I/O encryption feature using qaic-exec (compile only).
o Users can now directly pass the --io-encrypt flag in both high-level APIs(compile) and command-line APIs (infer and compile).
• Support for sperate prefill and decode compilation
o Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for disaggregated serving.
• New features for Embedding Models –
o Flexible Pooling configuration:
User can specify popular pooling strategies via string identifiers or provide custom pooling methods.
It enables seamless integration of pooling at the end of the embedding model, offering flexibility for various use cases. Pooling will also run on AI 100 for improved performance.
Sample script
Added support for sentence embedding.
• With pooling added, Efficient-Transformers now enables direct sentence embedding generation on AI 100, improving efficiency and semantic quality for downstream tasks.
o Support for compilation with multiple sequence lengths.
Users can specify a single or list of seq_len values during compilation (example).
At generation time, the closest greater or equal seq_len graph from the QPC is auto selected for optimal execution.
• Added support for On Device Sampling for CausalLM models.
o Sampling now runs directly on the QAIC device, reducing host-device communication and boosting inference throughput and scalability.
o Documentation and Usage guide.
• Added support for SwiftKV model (Snowflake/Llama-3.1-SwiftKV-8B-Instruct)
o Added support for both continuous and non-continuous batching execution in SwiftKV.
o Since architecture for this model is LlamaSwiftKVForCausalLM, so it can be executed using QEffAutoModelForCausalLM
• Added support for execution of GGUF models, (without quantized weights)
o sample script.
• Added support for compressed quantization status for FP8 model.
o Infermatic/Llama-3.3-70B-Instruct-FP8-Dynamic · Hugging Face
• QNN updates –
o Updated the QNN custom IO generation method for adhering to compiler changes.
o Added --target_backend AIC as default parameter in QNN Converter.
Fine Tuning
• Added Bert FT support, doc
• Documentation and a code template to run fine tuning on custom dataset.
• Added --help option available for usage of training parameters.
• Added support for gradient checkpointing in the finetuning script
• Added support for Passing device type in torch GradScaler.
• Detailed documentation is here
Upcoming models
• Qwen3
• Mistral 3.1
Upcoming features
• Compute context length support planned for 1.21.0)
• Support for passing MDP file to compiler during compilation. (planned as bug-fix in 1.20.0).
• Upgrading the ONNX dependency is required to address a security vulnerability identified in the current version of ONNX.
o (onnx==1.18.0, onnxruntime==1.22, onnxscript==0.2.5, protobuff ==6.31.0) (planned for 1.21.0)
• Support for -inf for pad tokens, for optimized softmax handling in compiler. (planned for 1.21.0).
release/v1.19.3
Added Features
- Vision Language Model
- Speech Sequence to Sequence Model
- Support for FP8 Execution
- Prompt-Lookup Decoding sample script.
release/v1.17.0
What's Changed
- Update README.md to fix broken links and hf_download function to allo… by @quic-mamta in #4
- bugfix and checks for making sure tokenizer is always padded left by @ochougul in #7
- constrained versions for non-essential packages by @ochougul in #6
- renamed exec APIs by @ochougul in #3
- fix mdp config at compilation by @vbaddi in #13
- Update README.md with full system requirements by @quic-aashwins in #17
- Enable arg trust_remote_code to use custom tokenizers by @vbaddi in #16
- Update infer and execute API to take prompts from txt file for BS>=1 by @quic-mamta in #11
- Add Optional CustomIO Config for MXINT8 Precision by @vbaddi in #18
- Add support for new MOE model mistralai/Mixtral-8x7B-v0.1 by @quic-akuruvil in #8
- Add Json Based Pytest Pipeline for Scalability by @vbaddi in #26
- Update text generation interface to remove slash from qpc path by @quic-mamta in #28
- Mixtral readme update by @quic-amitraj in #29
- Update README.md by @anujgupt-github in #32
- Update vicuna model cards in the readme by @vbaddi in #36
- adding codeowners to restrict who can merge PRs by @ochougul in #35
- Adding QEFFAutoModel i.e. model loader for loading any type of model. by @ochougul in #31
- Update templates by @anujgupt-github in #15
- Add .gitignore file by @hupreti in #9
- Fixed input_len to work with prompt chunking by @quic-morteza in #37
- Add modeling and utils changes for transformers=v4.41.2 by @vbaddi in #33
- add fix for missing xgen pad_token_id by @quic-amitraj in #38
- Update CODEOWNERS by @anujgupt-github in #46
- Add support for GPTJ by @quic-mamta in #5
- Add custom model support by @quic-shagun in #41
- Use ctx_len from specializations.json by @vbaddi in #49
- fix: torch version reverted to 2.0.0 by @irajagop in #52
- Use eager mode attention implementation in Infer API by @quic-mamta in #54
- Deploying High Level Pytest Infra cloud api tests by @abukhoy in #57
- Continuous batching by @quic-rishinr in #61
- Added support for Mixtral architecture by @quic-rishinr in #67
- fix setup issues for the release/v1.16 branch by @vbaddi in #69
- Updated README for Release v1.16 with continuous batching flag usage by @quic-rishinr in #72
- [QEff Release v1.16]: Add support for updated rope calculations in Llama by @vbaddi in #83
- Added support for creating model path hash when model card is not pro… by @quic-rishinr in #97
New Contributors
- @quic-aashwins made their first contribution in #17
- @quic-akuruvil made their first contribution in #8
- @anujgupt-github made their first contribution in #32
- @hupreti made their first contribution in #9
- @quic-morteza made their first contribution in #37
- @abukhoy made their first contribution in #57
Full Changelog: https://github.com/quic/efficient-transformers/commits/V1.17