Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoBump] Merge with 2a8b1110 (Jan 29) (7) #288

Open
wants to merge 7 commits into
base: bump_to_7b0dd650
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion docs/AddCustomAccelerators.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ The folder content is flexible depending on each accelerator. However, we recomm
To build accelerators in onnx-mlir, use the cmake variable `ONNX_MLIR_ACCELERATORS` when building onnx-mlir. `ONNX_MLIR_ACCELERATORS` accepts a semicolon-separated list of accelerator names. For example,
```bash
$ cd build
$ cmake .. -DONNX_MLIR_ACCELERATORS=accel1;accel2
$ cmake .. -DONNX_MLIR_ACCELERATORS='accel1;accel2'
```
Note that the list should be quoted.

### 1.2 Compile a model to run with selected accelerators.

Expand Down Expand Up @@ -92,6 +93,13 @@ virtual void registerDialects(mlir::DialectRegistry &registry) const = 0;
/// command line options.
virtual void registerPasses(int optLevel) const = 0;

//===--------------------------------------------------------------------===//
// Hooks for both onnx-mlir and onnx-mlir-opt drivers
//===--------------------------------------------------------------------===//

/// Configure passes for the accelerator.
virtual void configurePasses() const = 0;

//===--------------------------------------------------------------------===//
// Hooks for onnx-to-krnl pass
//===--------------------------------------------------------------------===//
Expand Down
65 changes: 65 additions & 0 deletions docs/Quantization-NNPA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
<!--- SPDX-License-Identifier: Apache-2.0 -->

# Overview

NNPA in IBM Telum II supports 8-bit signed-integer quantized matrix multiplications. This document shows how to compile an ONNX model for 8-bit quantization on NNPA. When not following these steps, models will still be accelerated when targeting Telum systems using a mixture of 16-bit floating-point numbers for computations mapped to the Telum's Integrated AI accelerator and 32-bit floating-point numbers for computations mapped to the Telum CPUs.

There are two approaches to using quantization in the onnx-mlir compiler, depending on the input ONNX model to the compile:
- The input model is a quantized model that was quantized by other frameworks such as ONNX Runtime. In this case, the input ONNX model contains 8-bit operations, and the onnx-mlir compiler selects suitable 8-bit operations to run on NNPA. There is no special compile flags needed to enable quantization when compiling this quantized model. Hence, we do not discuss this case in this document.
- In this approach, the compiler supports both static and dynamic quantized models.
- The input model is a non-quantized model, e.g. operations operate on float32 data types. In this case, the onnx-mlir compiler provides several quantization options in order to quantize the model during compilation, then run the compiled model on NNPA. The remaining of this document describes this approach.
- In this approach, the compiler only supports dynamic quantization.

In both approaches, the following constraints are applied:
- Only per-tensor quantization is supported, meaning `scale` and `zero_point` are computed per-tensor and are scalar values.
- Target quantization data type is 8-bit signed-integer.

Quantization requires NNPA in IBM Telum II, meaning that the following compile flags must be specified to enable quantization: `-maccel=NNPA -march=arch15`.

# Dynamic quantization by the compiler

Again, it is important to note that the onnx-mlir compiler currently:
- supports per-tensor dynamic quantization, and
- quantizes data tensors from float32 to 8-bit signed integer. If a data tensor in the input model is already in 8-bit singed integer, the compiler will not quantize it again.

The compiler provides two compile flags for dynamically quantizing a model at compile time:
- `--nnpa-quant-dynamic` to enable dynamic quantization.
- `--nnpa-quant-op-types` to specify the types of ONNX operations to quantize manually, e.g. `MatMul,Conv`.

Users can specify whether or not to symmetrize data for activations and weights by using options `symActivation, asymActivation, symWeight, asymWeight` as values for `--nnpa-quant-dynamic`.
For examples, to asymmetrize data for activations and to symmetrize data for weights, one can use `--nnpa-quant-dynamic=asymActivation,symWeight`.

By specifying `--nnpa-quant-dynamic` only, the compiler will decide quantization options and operation types by itself.

## Computing `scale` and `zero_point`
The compiler uses the following equations to compute `scale` and `zero_point` for 8-bit signed integer quantization.

Asymmetric quantization
```
scale = (maximum(0, max(x)) - minimum(0, min(x))) / (qmax - qmin)
zero_point = cast(round(saturate(qmin - min(x)/scale)))
```
where
- `x` is the input tensor to quantize,
- data range is adjusted to include 0,
- `qmax=127` and `qmin=-128` are the max and min values for quantization range.
- `saturate` is to saturate to `[-128, 127]`.

Symmetric quantization
```
scale = max(abs(x)) / 127
zero_point = 0
```

Given `scale` and `zero_point`, the input `x` is quantized to
```
quantized_x = x/scale + zero_point
```

# Performance notes

It is often the case that symmetric quantization leads to better inference performance but poorer accuracy than asymmetric quantization.
Users may want to experiment with different quantization schemes to find the best combination for their own model.

# Resources
- [A visual guide to quantization](https://www.maartengrootendorst.com/blog/quantization/)
4 changes: 2 additions & 2 deletions docs/SupportedONNXOps-NNPA.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@

# Supported ONNX Operation for Target *NNPA*.

Onnx-mlir currently supports ONNX operations targeting up to opset 21. Limitations are listed when applicable. This documentation highlights the minimum and maximum opset versions that are fully supported by onnx-mlir and not the version changes.
Onnx-mlir currently supports ONNX operations targeting up to opset 22. Limitations are listed when applicable. This documentation highlights the minimum and maximum opset versions that are fully supported by onnx-mlir and not the version changes.

* Operations are defined by the [ONNX Standard](https://github.com/onnx/onnx/blob/main/docs/Operators.md).
* **Supported Opsets** indicates the lowest and highest opset a model may have for onnx-mlir to support compiling a model with the operator.
* A * indicates onnx-mlir is compatible with the latest version of that operator available as of opset 21.
* A * indicates onnx-mlir is compatible with the latest version of that operator available as of opset 22.
* A ^ indicates onnx-mlir is compatible with the latest level of the NNPA Architecture which is z16.


Expand Down
4 changes: 2 additions & 2 deletions docs/SupportedONNXOps-cpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@

# Supported ONNX Operation for Target *cpu*.

Onnx-mlir currently supports ONNX operations targeting up to opset 21. Limitations are listed when applicable. This documentation highlights the minimum and maximum opset versions that are fully supported by onnx-mlir and not the version changes.
Onnx-mlir currently supports ONNX operations targeting up to opset 22. Limitations are listed when applicable. This documentation highlights the minimum and maximum opset versions that are fully supported by onnx-mlir and not the version changes.

* Operations are defined by the [ONNX Standard](https://github.com/onnx/onnx/blob/main/docs/Operators.md).
* **Supported Opsets** indicates the lowest and highest opset a model may have for onnx-mlir to support compiling a model with the operator.
* A * indicates onnx-mlir is compatible with the latest version of that operator available as of opset 21.
* A * indicates onnx-mlir is compatible with the latest version of that operator available as of opset 22.


| Op |Supported Opsets (inclusive) |Limitations |Notes |
Expand Down
7 changes: 7 additions & 0 deletions src/Accelerators/Accelerator.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,13 @@ class Accelerator {
/// command line options.
virtual void registerPasses(int optLevel) const = 0;

//===--------------------------------------------------------------------===//
// Hooks for both onnx-mlir and onnx-mlir-opt drivers
//===--------------------------------------------------------------------===//

/// Configure passes for the accelerator.
virtual void configurePasses() const = 0;

//===--------------------------------------------------------------------===//
// Hooks for onnx-to-krnl pass
//===--------------------------------------------------------------------===//
Expand Down
6 changes: 4 additions & 2 deletions src/Accelerators/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# SPDX-License-Identifier: Apache-2.0

# Populate the accelerator list and add the accelerator subdirectories.
# ONNX_MLIR_ACCELERATORS is the list of accelerators user specified
# ONNX_MLIR_ACCELERATORS is the semicolon-separated list of accelerators user specified
# Note that the list should be quoted, e.g. -DONNX_MLIR_ACCELERATORS='A;B'
# ACCEL_TARGET_LIST is the list of cmake targets
# ACCEL_LINK_LIST is the lists of accelerator libraries
# ACCEL_INCLUDE_LIST is the list passed to inc generator
Expand All @@ -10,7 +11,8 @@ if (ONNX_MLIR_ACCELERATORS)
add_subdirectory(${t})

# If the accelerator can be built
if (${t}_ENABLED)
string(TOUPPER ${t} T)
if (${T}_ENABLED)
list(APPEND ACCEL_TARGET_LIST "${t}Accel")
list(APPEND ACCEL_LINK_LIST "OM${t}Accel")
list(APPEND ACCEL_INCLUDE_LIST "${t}")
Expand Down
52 changes: 39 additions & 13 deletions src/Accelerators/NNPA/Compiler/NNPACompilerOptions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@

namespace onnx_mlir {

// Use external storage for the options so that they are globally accessible
std::vector<NNPAQuantOptions> nnpaQuantDynamic; // common for both
std::vector<std::string> nnpaQuantOpTypes; // common for both

llvm::cl::opt<NNPAEmissionTargetType> nnpaEmissionTarget(
llvm::cl::desc("[Optional] Choose NNPA-related target to emit "
"(once selected it will cancel the other targets):"),
Expand Down Expand Up @@ -101,6 +105,41 @@ llvm::cl::opt<bool> nnpaEnableSaturation("nnpa-saturation",
"Default is false."),
llvm::cl::init(false), llvm::cl::cat(OnnxMlirCommonOptions));

llvm::cl::list<NNPAQuantOptions, std::vector<NNPAQuantOptions>>
nnpaQuantDynamicOpt("nnpa-quant-dynamic",
llvm::cl::desc(
"Enable dynamic quantization of the input model. If enabled, it "
"only quantizes from fp32 to i8. If an ONNX operation is already "
"in i8, no quantization is applied to that operation. Optionally, "
"a comma-separated list of quantization options can be specified "
"as its value, e.g. -nnpa-quant-dynamic=symActivation,symWeight."),
llvm::cl::values(clEnumVal(symWeight, "Symmetric quant for weights."),
clEnumVal(asymWeight, "Asymmetric quant for weights."),
clEnumVal(symActivation, "Symmetric quant for activations."),
clEnumVal(asymActivation, "Asymmetric quant for activations."),
// Use an empty string for the case where `--nnpa-quant-dynamic` is
// specified on the command line WITHOUT value, which is different
// from the case where `--nnpa-quant-dynamic` is NOT specified on
// the command line.
clEnumValN(autoQuantOpt, "",
"Compiler automatically finds the best options. Once this "
"option (an empty string) is in the list, the other options "
"are ignored. This is the default option when "
"`-nnpa-quant-dynamic` is specified without any value.")),
llvm::cl::location(nnpaQuantDynamic), llvm::cl::ValueOptional,
llvm::cl::CommaSeparated, llvm::cl::cat(OnnxMlirCommonOptions));

llvm::cl::list<std::string, std::vector<std::string>> nnpaQuantOpTypesOpt(
"nnpa-quant-op-types",
llvm::cl::desc(
"A comma-separated list of types of operations that are quantized. "
"E.g. 'MatMul,Conv'. Strings for types are the same as ONNX operator "
"names in https://onnx.ai/onnx/operators/. Currently, only MatMul is "
"supported. Without specifying this option, the compiler will "
"determine the operation types by itself."),
llvm::cl::location(nnpaQuantOpTypes), llvm::cl::ValueOptional,
llvm::cl::CommaSeparated, llvm::cl::cat(OnnxMlirCommonOptions));

llvm::cl::opt<bool> nnpaUseDynamicQuantizeLinearOnCPU("nnpa-cpu-dql",
llvm::cl::desc("Use dynamic quantized linear on CPU. Default is false"),
llvm::cl::init(false), llvm::cl::cat(OnnxMlirCommonOptions));
Expand All @@ -111,17 +150,4 @@ llvm::cl::opt<bool> nnpaUseDynamicQuantizeLinearOnCPUForScaleOffset(
" scale and offset on CPU. Default is false"),
llvm::cl::init(false), llvm::cl::cat(OnnxMlirCommonOptions));

llvm::cl::opt<NNPAQuantType> nnpaQuantization("nnpa-quantization",
llvm::cl::desc("Enable quantization with a specific type. Only "
"MatMul whose weight is a constant is supported."),
llvm::cl::values(
clEnumVal(DynSymI8,
"Dynamic Quantization to signed integer 8. Asymmetric "
"quant for activations and symmetric quant for weights."),
clEnumVal(SymSymI8,
"Dynamic Quantization to signed integer 8. Symmetric "
"quant for activations and symmetric quant for weights."),
clEnumVal(QNONE, "No quantization (default).")),
llvm::cl::init(QNONE), llvm::cl::cat(OnnxMlirOptions));

} // namespace onnx_mlir
15 changes: 8 additions & 7 deletions src/Accelerators/NNPA/Compiler/NNPACompilerOptions.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -57,12 +57,12 @@ typedef enum {

// Quantization type
typedef enum {
DynSymI8, /* Dynamic quantization to signed integer 8. Asymmetric quant for
activations and symmetric quant for weights.*/
SymSymI8, /* Dynamic quantization to signed integer 8. Symmetric quant for
activations and symmetric quant for weights.*/
QNONE, /* Only qualifying ops that are faster on NNPA. */
} NNPAQuantType;
symWeight,
asymWeight,
symActivation,
asymActivation,
autoQuantOpt,
} NNPAQuantOptions;

extern llvm::cl::OptionCategory OnnxMlirOptions;
extern llvm::cl::OptionCategory OnnxMlirCommonOptions;
Expand All @@ -79,7 +79,8 @@ extern llvm::cl::opt<std::string> nnpaSaveDevicePlacementFile;
extern llvm::cl::opt<bool> nnpaEnableSaturation;
extern llvm::cl::opt<bool> nnpaUseDynamicQuantizeLinearOnCPU;
extern llvm::cl::opt<bool> nnpaUseDynamicQuantizeLinearOnCPUForScaleOffset;
extern llvm::cl::opt<NNPAQuantType> nnpaQuantization;
extern std::vector<NNPAQuantOptions> nnpaQuantDynamic;
extern std::vector<std::string> nnpaQuantOpTypes;

} // namespace onnx_mlir
#endif
51 changes: 48 additions & 3 deletions src/Accelerators/NNPA/Compiler/NNPACompilerUtils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,56 @@ using namespace onnx_mlir;
namespace onnx_mlir {

void configurePassesNNPA() {
configureOnnxToZHighLoweringPass(optReport == OptReport::NNPAUnsupportedOps);
// z16 does not support for hardware saturation.
// So, force its usage to compiler generated sticks.
if (nnpaEnableSaturation && isLessEqualNNPALevel(NNPALevel::M14))
nnpaEnableCompilerStickUnstick = true;

// Configure ONNXToZHighLoweringPass.
bool isDynQuant = !nnpaQuantDynamic.empty();
// Default/auto mode: symmetric for weighs and asymmetric for activations.
bool isActivationSym = false;
bool isWeightSym = true;
std::vector<std::string> quantOpTypes;
if (isDynQuant) {
// Set options for activations and weights if they are given.
// When auto mode is specified, the other specified options are ignored.
if (!llvm::is_contained(nnpaQuantDynamic, NNPAQuantOptions::autoQuantOpt)) {
for (unsigned i = 0; i < nnpaQuantDynamic.size(); ++i) {
switch (nnpaQuantDynamic[i]) {
case NNPAQuantOptions::symWeight:
isWeightSym = true;
break;
case NNPAQuantOptions::asymWeight:
isWeightSym = false;
break;
case NNPAQuantOptions::symActivation:
isActivationSym = true;
break;
case NNPAQuantOptions::asymActivation:
isActivationSym = false;
break;
default:
llvm_unreachable("Unsupported quantization options");
break;
}
}
}
if (!isWeightSym) {
// TODO: Support asymmetric quantiation for weights.
llvm::outs()
<< "Asymmetric quantization for weights is not yet supported. "
"Turning off quantization.\n";
isDynQuant = false;
}
if (nnpaQuantOpTypes.empty()) {
quantOpTypes.emplace_back("MatMul");
} else {
quantOpTypes = nnpaQuantOpTypes;
}
}
configureONNXToZHighLoweringPass(optReport == OptReport::NNPAUnsupportedOps,
isDynQuant, isActivationSym, isWeightSym, quantOpTypes);
}

void addONNXToZHighPasses(mlir::PassManager &pm) {
Expand Down Expand Up @@ -85,7 +130,8 @@ void addONNXToZHighPasses(mlir::PassManager &pm) {
pm.addNestedPass<func::FuncOp>(
onnx_mlir::createInstrumentPass(instrumentOps, instrumentActions));

pm.addPass(onnx_mlir::createONNXToZHighPass(nnpaQuantization));
// Lowering ONNX to ZHigh.
pm.addPass(onnx_mlir::createONNXToZHighPass());
pm.addNestedPass<func::FuncOp>(onnx_mlir::createShapeInferencePass());

// There are more opportunities for const propagation once all zhigh ops were
Expand Down Expand Up @@ -191,7 +237,6 @@ void addPassesNNPA(mlir::OwningOpRef<mlir::ModuleOp> &module,

// Override pass configurations.
configurePasses();
configurePassesNNPA();

// LLVM_DEBUG(llvm::dbgs() << "Adding NNPA passes" << std::endl;);
if (emissionTarget >= EmitONNXIR) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ void DevicePlacementPass::runOnOperation() {

// Disable reporting on NNPA unsupported ops in this pass even if
// `-opt-report=NNPAUnsupportedOps` is specified..
OnnxToZHighLoweringConfiguration::reportOnNNPAUnsupportedOps = 0;
ONNXToZHighLoweringConfiguration::reportOnNNPAUnsupportedOps = 0;

// Run the unknown dimension analysis to help check equality of unknown
// dimensions at compile time.
Expand Down Expand Up @@ -200,13 +200,13 @@ void DevicePlacementPass::runOnOperation() {
// Call ONNXToZHigh pass for lowering multiple ONNX ops at once to ZHigh.
// E.g. `onnx.ReLu (onnx.Conv)` to zhigh.Conv.
RewritePatternSet Patterns2(context);
getONNXToZHighMultipleOpPatterns(Patterns2, nnpaQuantization);
getONNXToZHighMultipleOpPatterns(Patterns2);
(void)applyAnalysisConversion(module, target, std::move(Patterns2),
ConversionConfig{.legalizableOps = &legalizedOps2});

// Call ONNXToZHigh pass for lowering a single ONNX op to ZHigh.
RewritePatternSet Patterns3(context);
getONNXToZHighOneOpPatterns(Patterns3, nnpaQuantization);
getONNXToZHighOneOpPatterns(Patterns3);
getONNXToZHighOneOpDynamicallyLegal(&target, &dimAnalysis);
(void)applyAnalysisConversion(module, target, std::move(Patterns3),
ConversionConfig{.legalizableOps = &legalizedOps3});
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ using namespace onnx_mlir;

/// Report NNPA unsupported case.
bool onnxToZHighUnsupportedReport(Operation *op, const std::string &message) {
if (OnnxToZHighLoweringConfiguration::reportOnNNPAUnsupportedOps &&
if (ONNXToZHighLoweringConfiguration::reportOnNNPAUnsupportedOps &&
!message.empty()) {
StringAttr opName = op->getName().getIdentifier();
std::string nodeNameStr = getNodeNameInPresenceOfOpt(op);
Expand Down
Loading