Skip to content

[RFC] XPU Enabling Plan for TorchAO #3576

@liangan1

Description

@liangan1

Motivation

TorchAO is one of the core PyTorch Libraries that is important in the AI infrastructure. The Intel XPU device is also one of the in-tree devices, like CUDA, in PyTorch. As the hardware vendor, it is critical to enrich the software ecosystem to facilitate users to develop their application as demand on Intel GPUs. As the PyTorch XPU usage information showed in https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html, one can use PyTorch to develop and deploy their AI application on both Intel® Arc and Intel® Core™ Ultr series Platforms. We wan to user to get the same OS and HW support matrix on both PyTorch and TorchAO.
XPU is the device code name for Intel GPUs in PyTorch, and it is not a prototype hardware. Both Intel® Arc and Intel® Core™ Ultra series Platforms use this code name in the PyTorch.

Since the software stack is almost 1x1 mapping between the CUDA and XPU, we plan to align the torchAO feature with CUDA on the XPU(except for the CUDA specific features) to help user seamlessly transfer from CUDA device to XPU device.
In general, we will first focus on features used by the internal tasks. In the past few months, we have upstream the int4, int8 and fp8 with tensor- and channel-wise scaling.

Goal

  • Similar user experience as CUDA
  • Almost zero migration effort from CUDA to XPU
  • On-par feature scope with CUDA and same code quality which is ensured by the CI.

Methodology

  • Reuse TorchAO code as much as possible
    E.g., Same quantization config for int8/fp8/mxfp8 with CUDA
  • Device dispatch is processed by PyTorch and Keep the kernel in PyTorch core
    E.g., scaled_mm for fp8, int_mm for int8
  • Only defined XPU-specific config when the CUDA version is not applicable for XPU
    E.g., Int4PlainInt32Tensor for XPU used by woq-int4

Features Plan

Based on the above philosophy and feature scope in the TorchAO Features Overview, we summary the plan and status of features for XPU.

Stable Workflows

🟢 = stable, 🟡 = prototype, 🟠 = planned, ⚪ = not supported

recommended hardware weight activation quantized training QAT PTQ data algorithms quantized inference
BMG GPUs float8 rowwise float8 rowwise 🟠 🟠 🟠 🟢 (link)
BMG GPUs int4 bfloat16/float16 🟠 🟠: HQQ,🟡 AWQ, GPTQ 🟢 (link)
BMG GPUs int8 bfloat16 🟠 🟢 (link)
BMG GPUs int8 int8 🟠 🟠 🟢 (link)

Prototype Workflows

🟢 = stable, 🟡 = prototype, 🟠 = planned, ⚪ = not supported

recommended hardware weight activation quantized training QAT PTQ data algorithms quantized inference
mxfp8 mxfp8 🟠 🟡 (link)
mxfp4 mxfp4 ⚪ not supported 🟠 🟠 🟠
BMG, CRI GPUs float8 128x128 (blockwise) float8 1x128 🟠 🟠

Other Features

Product Specs

All Specs for Intel® Arc and Intel® Core™ Ultra series Platforms can be found in Intel® Arc™ GPUs. The follow is an the details of Intel® Arc™ B580

Sub-issues

Metadata

Metadata

Assignees

Labels

trackerxpuIntel XPU related features

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions