[RFC] XPU Enabling Plan for TorchAO

## Motivation


TorchAO is one of the core PyTorch Libraries that is important in the AI infrastructure. The Intel XPU device is also one of the in-tree devices, like CUDA,  in PyTorch. As the hardware vendor, it is critical to enrich the software ecosystem to facilitate users to develop their application as demand on Intel GPUs. As the PyTorch XPU usage information showed in https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html, one can use PyTorch to develop and deploy their AI application on both Intel® Arc and  Intel® Core™ Ultr series Platforms. We wan to user to get the same OS and HW support matrix on both PyTorch and TorchAO.
XPU is the device code name for Intel GPUs in PyTorch, and it is not a prototype hardware.  Both Intel® Arc and  Intel® Core™ Ultra series Platforms use this code name in the PyTorch.

- The Intel® Core™ Ultra series platforms, like the Luna Lake and the latest released Panther Lake at CES 2026,  are important AIPC products and they are widely used by individuals to develop their AI applications. we have published a blog to show the usage of torchAO on AIPC. https://pytorch.org/blog/pytorch-2-8torchao-unlock-efficient-llm-inference-on-intel-ai-pcs/.

- Intel® Arc  Series GPUs like Alchemist and Battlemage are discrete GPUs.

Since the software stack is almost 1x1 mapping between the CUDA and XPU, we plan to align the torchAO feature with CUDA on the XPU(except for the CUDA specific features) to help user seamlessly transfer from CUDA device to XPU device.
In general, we will first focus on features used by the internal tasks. In the past few months, we have upstream the int4, int8 and fp8 with tensor- and channel-wise scaling. 

## Goal

- Similar user experience as CUDA
- Almost zero migration effort from CUDA to XPU
- On-par feature scope with CUDA and same code quality which is ensured by the CI.

## Methodology

- Reuse TorchAO code as much as possible
E.g., Same quantization config for int8/fp8/mxfp8 with CUDA
- Device dispatch is processed by PyTorch and Keep the kernel in PyTorch core
E.g., [scaled_mm](https://github.com/pytorch/pytorch/pull/166056/files) for fp8, [int_mm ](https://github.com/pytorch/pytorch/pull/166056/files)for int8
- Only defined XPU-specific config when the CUDA version is not applicable for XPU
 E.g., Int4PlainInt32Tensor for XPU used by woq-int4

## Features Plan

Based on the above philosophy and feature scope in the [TorchAO Features Overview](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#-overview), we summary the plan and status of features for XPU.
  
###  Stable Workflows 

🟢 = stable, 🟡 = prototype, 🟠 = planned, ⚪ = not supported

| recommended hardware | weight | activation | quantized training | QAT | PTQ data algorithms | quantized inference |
| -------- | ------ | ---------- | ------------------ | --- | ------------------- | ------------------- |
| BMG GPUs | float8 rowwise | float8 rowwise | 🟠 | 🟠 | 🟠| 🟢 [(link)](https://github.com/pytorch/ao/tree/main/torchao/quantization#a8w8-float8-dynamic-quantization-with-rowwise-scaling) |
| BMG GPUs | int4 | bfloat16/float16 | ⚪ | 🟠 |🟠: [HQQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq/README.md),🟡 [AWQ](torchao/prototype/awq), [GPTQ](torchao/quantization/GPTQ) | 🟢 [(link)](https://github.com/pytorch/ao/tree/main/torchao/quantization#a16w4-weightonly-quantization) |
| BMG GPUs  | int8 | bfloat16 | ⚪ | 🟠 | ⚪ | 🟢 [(link)](https://github.com/pytorch/ao/tree/main/torchao/quantization#a16w8-int8-weightonly-quantization) |
| BMG GPUs | int8 | int8 | 🟠 | 🟠 | ⚪ | 🟢 [(link)](https://github.com/pytorch/ao/tree/main/torchao/quantization#a8w8-int8-dynamic-quantization) |

### Prototype Workflows

🟢 = stable, 🟡 = prototype, 🟠 = planned, ⚪ = not supported

| recommended hardware | weight | activation | quantized training | QAT | PTQ data algorithms | quantized inference |
| -------- | ------ | ---------- | ------------------ | --- | ------------------- | ------------------- |
| | mxfp8 | mxfp8 |  🟠 | ⚪ | ⚪ | 🟡 [(link)](torchao/prototype/mx_formats#mx-inference) |
|  | mxfp4 | mxfp4 | ⚪ not supported | 🟠 | 🟠 | 🟠 |
| BMG, CRI GPUs | float8 128x128 (blockwise) | float8 1x128 | 🟠 | ⚪ | ⚪ | 🟠 |


### Other Features
- [ ]  Integrations 
- [ ]  Benchmarks
- [ ] SafeTensor support: #3575 

### Product Specs
All Specs for Intel® Arc and  Intel® Core™ Ultra series Platforms can be found in [Intel® Arc™ GPUs](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html). The follow is an the details of Intel® Arc™ B580

- [Intel® Arc™ B580 Graphics](https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] XPU Enabling Plan for TorchAO #3576

Motivation

Goal

Methodology

Features Plan

Stable Workflows

Prototype Workflows

Other Features

Product Specs

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

recommended hardware	weight	activation	quantized training	QAT	PTQ data algorithms	quantized inference
BMG GPUs	float8 rowwise	float8 rowwise	🟠	🟠	🟠	🟢 (link)
BMG GPUs	int4	bfloat16/float16	⚪	🟠	🟠: HQQ,🟡 AWQ, GPTQ	🟢 (link)
BMG GPUs	int8	bfloat16	⚪	🟠	⚪	🟢 (link)
BMG GPUs	int8	int8	🟠	🟠	⚪	🟢 (link)

recommended hardware	weight	activation	quantized training	QAT	PTQ data algorithms	quantized inference
	mxfp8	mxfp8	🟠	⚪	⚪	🟡 (link)
	mxfp4	mxfp4	⚪ not supported	🟠	🟠	🟠
BMG, CRI GPUs	float8 128x128 (blockwise)	float8 1x128	🟠	⚪	⚪	🟠

[RFC] XPU Enabling Plan for TorchAO #3576

Description

Motivation

Goal

Methodology

Features Plan

Stable Workflows

Prototype Workflows

Other Features

Product Specs

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions