Skip to content

Releases: quic/efficient-transformers

release/v1.21.0

22 Dec 18:08
9757c11

Choose a tag to compare

Newly Onboarded Models

Causal Models

GPT OSS (Text)

Example script: efficient-transformers/examples/disagg_serving

Mistral 3.1 (Text)

Example script: mistral3_example.py

Qwen 3 / Qwen3-MoE

Example script: qwen3moe_example

Olmo

Example script: efficient-transformers/examples/text_generation

Multi Models

Qwen 2.5-VL (Vision-Language)

Example script: qwen2_5_vl_example.py

Molmo (Vision-Language)

Example script: molmo_example.py

Gemma 3 (Vision-Language)

Example script: gemma3_example

InternVL 3.5 (Vision-Language)

Example script: intern_example

Audio

Wave2Vec2 (ASR)

Example script: wav2vec2_example

Diffusion Models

Example scripts: efficient-transformers/examples/diffusers


New Features

Diffusers Pipeline Support

Diffusers pipeline support enables seamless integration of diffusion models in QEfficient library. (#604) (#669)

Supported models:

Example scripts: efficient-transformers/examples/diffusers

GPT OSS with Disaggregate Serving Model

Support for GPT OSS using disaggregate serving model. (#608)

Example scripts: efficient-transformers/examples/disagg_serving

Compute-Context-Length (CCL) Support

Feature allows users to optimize the throughput of large language models (LLMs) when handling very large context lengths. (#576) (#663)

Example scripts: efficient-transformers/examples/performance/compute_context_length

ONNX Sub Functions Export Feature for AutoModelForCausalLM

Feature enabling more efficient model compilation and execution on hardware. Users can enable the feature by passing use_onnx_subfunctions=True during export (#621) (#642)

model.export(tmp_path, use_onnx_subfunctions=True)

Note: Currently, we are seeing some performance degradation and output discrepancies with the subfunction. We will continue to monitor and evaluate its behavior, and once these issues are resolved, the subfunction will be enabled by default.

Continuous Batching for VLMs

Now VLMs support continuous batching by including scenarios with multiple images and prompts. (#610)

BlockedKV Attention in CausalLM

Implements a blocked K/V cache layout so attention reads/processes the cache block-by-block, improving long-context decode performance. (#618)

Memory Profiling Tool

Adds scripts to profile memory during export/compile/infer (peak usage, cache footprint) for quicker diagnosis. (#674)

Scripts: efficient-transformers/scripts/memory_profiling

Extended On-Device Sampling

Feature extends on-device sampling support to the language decoder of dual QPC vision language models and adds guided decoding capabilities in On Device Sampling. (#597) (#624)

Example script: efficient-transformers/examples/performance/on_device_sampling.py

ONNX Transform, Memory & Time Optimizations

Adds periodic memory cleanup (e.g., to FP16ClipTransform / SplitTensorsTransform) during large-tensor processing, and avoids redundant external data loading when already present. (#640)

Dependency Upgrades

  • Transformers 4.55
  • Torch 2.7.0+cpu
  • Torchvision 0.22.0+cpu
  • Python ≥3.9

Removed Platform SDK Dependency

Support QPC generation on systems without the Platform SDK. (#609)

Example Scripts Revamp

This includes:

  • Onboarding Guide for adding new Causal models (#574)
  • Onboarding Guide for adding new Custom ops in QEff (#638)
  • Organized examples into domain-specific subdirectories (#615)

Fine Tuning

Checkpoint Management

Resume from epochs with proper state restoration. Adds resume-from-epoch & epoch checkpoint loading so runs can be restarted with the correct optimizer/scaler/model state. (#614)

Enhanced Loss Tracking

Corrected data type handling for accurate loss computation. Refinements in the finetune/eval path improve numerical stability when computing losses and metrics. (#606)

Custom Dataset Support

Improved handling with better tokenization. Fixes around padding/token typing (e.g., pad_to_max_length) ensure robust dataset ingestion across varied corpora. (#599)

Device-Aware Scaling

Optimized GradScaler for multi-device training. DDP + pipeline-parallel fixes improve device mapping/scaling behavior during mixed-precision training. (#544)

release/v1.20.0

20 Jun 17:26
4da283e

Choose a tag to compare

Newly Onboarded Models

Added support for Llama-4-Scout-17B-16E-Instruct
o Sample script for Text only (Recommended class for testing is QEFFAutoModelForImageTextToText).
o Sample script for Image + Text (Recommended class for testing is QEFFAutoModelForImageTextToText).
o Single QPC + Dual QPC support, (please check the comment section of example script for running single QPC).
o Added support for chunk attention in Llama4.
o Continuous batching and multi batch execution is planned for rel#1.21.0.
o With the redefined interface between QEFF and VLLM, we should be able to run the multiple images in single prompt, please follow (example) and see sample completion below,

Added support for Grok-1
o Since architecture for this model is Grok1ModelForCausalLM, so it can be executed using QEffAutoModelForCausalLM.
Added support for Gemma3
o Sample script for Text only (Recommended class for testing is QEFFAutoModelForImageTextToText).
o Sample script for Image + Text.
o Added support for sliding window.
o Continuous batching and multi batch execution is planned for rel#1.21.0
Added support for Granite Vision models
o Sample script
Added support for Granite MOE models

New Features

Upgrading Transformer version to 4.51.3.
SpD, multiprojection heads
o Implemented post-attention hidden size projections to speculate tokens ahead of the base model.
Adding compilation support for io_encrypt flag
o Added support for Model-IP I/O encryption feature using qaic-exec (compile only).
o Users can now directly pass the --io-encrypt flag in both high-level APIs(compile) and command-line APIs (infer and compile).
Support for sperate prefill and decode compilation
o Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for disaggregated serving.
New features for Embedding Models –
o Flexible Pooling configuration:
 User can specify popular pooling strategies via string identifiers or provide custom pooling methods.
 It enables seamless integration of pooling at the end of the embedding model, offering flexibility for various use cases. Pooling will also run on AI 100 for improved performance.
Sample script
 Added support for sentence embedding.
With pooling added, Efficient-Transformers now enables direct sentence embedding generation on AI 100, improving efficiency and semantic quality for downstream tasks.
o Support for compilation with multiple sequence lengths.
 Users can specify a single or list of seq_len values during compilation (example).
 At generation time, the closest greater or equal seq_len graph from the QPC is auto selected for optimal execution.
Added support for On Device Sampling for CausalLM models.
o Sampling now runs directly on the QAIC device, reducing host-device communication and boosting inference throughput and scalability.
o Documentation and Usage guide.
Added support for SwiftKV model (Snowflake/Llama-3.1-SwiftKV-8B-Instruct)
o Added support for both continuous and non-continuous batching execution in SwiftKV.
o Since architecture for this model is LlamaSwiftKVForCausalLM, so it can be executed using QEffAutoModelForCausalLM
Added support for execution of GGUF models, (without quantized weights)
o sample script.
Added support for compressed quantization status for FP8 model.
o Infermatic/Llama-3.3-70B-Instruct-FP8-Dynamic · Hugging Face
QNN updates –
o Updated the QNN custom IO generation method for adhering to compiler changes.
o Added --target_backend AIC as default parameter in QNN Converter.

Fine Tuning

Added Bert FT support, doc
Documentation and a code template to run fine tuning on custom dataset.
• Added --help option available for usage of training parameters.
• Added support for gradient checkpointing in the finetuning script
• Added support for Passing device type in torch GradScaler.
• Detailed documentation is here

Upcoming models
Qwen3
Mistral 3.1

Upcoming features

• Compute context length support planned for 1.21.0)
• Support for passing MDP file to compiler during compilation. (planned as bug-fix in 1.20.0).
• Upgrading the ONNX dependency is required to address a security vulnerability identified in the current version of ONNX.
o (onnx==1.18.0, onnxruntime==1.22, onnxscript==0.2.5, protobuff ==6.31.0) (planned for 1.21.0)
• Support for -inf for pad tokens, for optimized softmax handling in compiler. (planned for 1.21.0).

release/v1.19.3

28 Feb 16:40
2b17ebd

Choose a tag to compare

Added Features

  • Vision Language Model
  • Speech Sequence to Sequence Model
  • Support for FP8 Execution
  • Prompt-Lookup Decoding sample script.

release/v1.17.0

22 Dec 18:09
7961c3f

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/quic/efficient-transformers/commits/V1.17