-
Notifications
You must be signed in to change notification settings - Fork 400
Description
Motivation
TorchAO is one of the core PyTorch Libraries that is important in the AI infrastructure. The Intel XPU device is also one of the in-tree devices, like CUDA, in PyTorch. As the hardware vendor, it is critical to enrich the software ecosystem to facilitate users to develop their application as demand on Intel GPUs. As the PyTorch XPU usage information showed in https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html, one can use PyTorch to develop and deploy their AI application on both Intel® Arc and Intel® Core™ Ultr series Platforms. We wan to user to get the same OS and HW support matrix on both PyTorch and TorchAO.
XPU is the device code name for Intel GPUs in PyTorch, and it is not a prototype hardware. Both Intel® Arc and Intel® Core™ Ultra series Platforms use this code name in the PyTorch.
-
The Intel® Core™ Ultra series platforms, like the Luna Lake and the latest released Panther Lake at CES 2026, are important AIPC products and they are widely used by individuals to develop their AI applications. we have published a blog to show the usage of torchAO on AIPC. https://pytorch.org/blog/pytorch-2-8torchao-unlock-efficient-llm-inference-on-intel-ai-pcs/.
-
Intel® Arc Series GPUs like Alchemist and Battlemage are discrete GPUs.
Since the software stack is almost 1x1 mapping between the CUDA and XPU, we plan to align the torchAO feature with CUDA on the XPU(except for the CUDA specific features) to help user seamlessly transfer from CUDA device to XPU device.
In general, we will first focus on features used by the internal tasks. In the past few months, we have upstream the int4, int8 and fp8 with tensor- and channel-wise scaling.
Goal
- Similar user experience as CUDA
- Almost zero migration effort from CUDA to XPU
- On-par feature scope with CUDA and same code quality which is ensured by the CI.
Methodology
- Reuse TorchAO code as much as possible
E.g., Same quantization config for int8/fp8/mxfp8 with CUDA - Device dispatch is processed by PyTorch and Keep the kernel in PyTorch core
E.g., scaled_mm for fp8, int_mm for int8 - Only defined XPU-specific config when the CUDA version is not applicable for XPU
E.g., Int4PlainInt32Tensor for XPU used by woq-int4
Features Plan
Based on the above philosophy and feature scope in the TorchAO Features Overview, we summary the plan and status of features for XPU.
Stable Workflows
🟢 = stable, 🟡 = prototype, 🟠 = planned, ⚪ = not supported
| recommended hardware | weight | activation | quantized training | QAT | PTQ data algorithms | quantized inference |
|---|---|---|---|---|---|---|
| BMG GPUs | float8 rowwise | float8 rowwise | 🟠 | 🟠 | 🟠 | 🟢 (link) |
| BMG GPUs | int4 | bfloat16/float16 | ⚪ | 🟠 | 🟠: HQQ,🟡 AWQ, GPTQ | 🟢 (link) |
| BMG GPUs | int8 | bfloat16 | ⚪ | 🟠 | ⚪ | 🟢 (link) |
| BMG GPUs | int8 | int8 | 🟠 | 🟠 | ⚪ | 🟢 (link) |
Prototype Workflows
🟢 = stable, 🟡 = prototype, 🟠 = planned, ⚪ = not supported
| recommended hardware | weight | activation | quantized training | QAT | PTQ data algorithms | quantized inference |
|---|---|---|---|---|---|---|
| mxfp8 | mxfp8 | 🟠 | ⚪ | ⚪ | 🟡 (link) | |
| mxfp4 | mxfp4 | ⚪ not supported | 🟠 | 🟠 | 🟠 | |
| BMG, CRI GPUs | float8 128x128 (blockwise) | float8 1x128 | 🟠 | ⚪ | ⚪ | 🟠 |
Other Features
- Integrations
- Benchmarks
- SafeTensor support: [xpu][feat] Add xpu support for safetensor #3575
Product Specs
All Specs for Intel® Arc and Intel® Core™ Ultra series Platforms can be found in Intel® Arc™ GPUs. The follow is an the details of Intel® Arc™ B580