Xilinx · jorickert · Jan 16, 2025 · Jan 20, 2025 · Jan 20, 2025 · Jan 23, 2025
diff --git a/docs/AddCustomAccelerators.md b/docs/AddCustomAccelerators.md
@@ -20,8 +20,9 @@ The folder content is flexible depending on each accelerator. However, we recomm
 To build accelerators in onnx-mlir, use the cmake variable `ONNX_MLIR_ACCELERATORS` when building onnx-mlir. `ONNX_MLIR_ACCELERATORS` accepts a semicolon-separated list of accelerator names. For example,
 ```bash
 $ cd build
-$ cmake .. -DONNX_MLIR_ACCELERATORS=accel1;accel2
+$ cmake .. -DONNX_MLIR_ACCELERATORS='accel1;accel2'
 ```
+Note that the list should be quoted.
 
 ### 1.2 Compile a model to run with selected accelerators.
 
@@ -92,6 +93,13 @@ virtual void registerDialects(mlir::DialectRegistry &registry) const = 0;
 /// command line options.
 virtual void registerPasses(int optLevel) const = 0;
 
+//===--------------------------------------------------------------------===//
+// Hooks for both onnx-mlir and onnx-mlir-opt drivers
+//===--------------------------------------------------------------------===//
+
+/// Configure passes for the accelerator.
+virtual void configurePasses() const = 0;
+
 //===--------------------------------------------------------------------===//
 // Hooks for onnx-to-krnl pass
 //===--------------------------------------------------------------------===//

diff --git a/docs/Quantization-NNPA.md b/docs/Quantization-NNPA.md
@@ -0,0 +1,65 @@
+<!--- SPDX-License-Identifier: Apache-2.0 -->
+
+# Overview 
+
+NNPA in IBM Telum II supports 8-bit signed-integer quantized matrix multiplications. This document shows how to compile an ONNX model for 8-bit quantization on NNPA. When not following these steps, models will still be accelerated when targeting Telum systems using a mixture of 16-bit floating-point numbers for computations mapped to the Telum's Integrated AI accelerator and 32-bit floating-point numbers for computations mapped to the Telum CPUs.
+
+There are two approaches to using quantization in the onnx-mlir compiler, depending on the input ONNX model to the compile:
+- The input model is a quantized model that was quantized by other frameworks such as ONNX Runtime. In this case, the input ONNX model contains 8-bit operations, and the onnx-mlir compiler selects suitable 8-bit operations to run on NNPA. There is no special compile flags needed to enable quantization when compiling this quantized model. Hence, we do not discuss this case in this document.
+  - In this approach, the compiler supports both static and dynamic quantized models.
+- The input model is a non-quantized model, e.g. operations operate on float32 data types. In this case, the onnx-mlir compiler provides several quantization options in order to quantize the model during compilation, then run the compiled model on NNPA. The remaining of this document describes this approach.
+  - In this approach, the compiler only supports dynamic quantization.
+
+In both approaches, the following constraints are applied:
+- Only per-tensor quantization is supported, meaning `scale` and `zero_point` are computed per-tensor and are scalar values.
+- Target quantization data type is 8-bit signed-integer.
+
+Quantization requires NNPA in IBM Telum II, meaning that the following compile flags must be specified to enable quantization: `-maccel=NNPA -march=arch15`.
+
+# Dynamic quantization by the compiler
+
+Again, it is important to note that the onnx-mlir compiler currently:
+- supports per-tensor dynamic quantization, and
+- quantizes data tensors from float32 to 8-bit signed integer. If a data tensor in the input model is already in 8-bit singed integer, the compiler will not quantize it again.
+
+The compiler provides two compile flags for dynamically quantizing a model at compile time:
+- `--nnpa-quant-dynamic` to enable dynamic quantization.
+- `--nnpa-quant-op-types` to specify the types of ONNX operations to quantize manually, e.g. `MatMul,Conv`.
+
+Users can specify whether or not to symmetrize data for activations and weights by using options `symActivation, asymActivation, symWeight, asymWeight` as values for `--nnpa-quant-dynamic`.
+For examples, to asymmetrize data for activations and to symmetrize data for weights, one can use `--nnpa-quant-dynamic=asymActivation,symWeight`.
+
+By specifying `--nnpa-quant-dynamic` only, the compiler will decide quantization options and operation types by itself.
+
+## Computing `scale` and `zero_point` 
+The compiler uses the following equations to compute `scale` and `zero_point` for 8-bit signed integer quantization.
+
+Asymmetric quantization
+```
+scale = (maximum(0, max(x)) - minimum(0, min(x))) / (qmax - qmin)
+zero_point = cast(round(saturate(qmin - min(x)/scale)))
+```
+where
+- `x` is the input tensor to quantize,
+- data range is adjusted to include 0,
+- `qmax=127` and `qmin=-128` are the max and min values for quantization range.
+- `saturate` is to saturate to `[-128, 127]`.
+
+Symmetric quantization
+```
+scale = max(abs(x)) / 127
+zero_point = 0
+```
+
+Given `scale` and `zero_point`, the input `x` is quantized to
+```
+quantized_x = x/scale + zero_point
+```
+
+# Performance notes
+
+It is often the case that symmetric quantization leads to better inference performance but poorer accuracy than asymmetric quantization.
+Users may want to experiment with different quantization schemes to find the best combination for their own model.
+
+# Resources
+- [A visual guide to quantization](https://www.maartengrootendorst.com/blog/quantization/)
diff --git a/docs/SupportedONNXOps-NNPA.md b/docs/SupportedONNXOps-NNPA.md
@@ -3,11 +3,11 @@
 
 # Supported ONNX Operation for Target *NNPA*.
 
-Onnx-mlir currently supports ONNX operations targeting up to opset 21. Limitations are listed when applicable. This documentation highlights the minimum and maximum opset versions that are fully supported by onnx-mlir and not the version changes.
+Onnx-mlir currently supports ONNX operations targeting up to opset 22. Limitations are listed when applicable. This documentation highlights the minimum and maximum opset versions that are fully supported by onnx-mlir and not the version changes.
 
 * Operations are defined by the [ONNX Standard](https://github.com/onnx/onnx/blob/main/docs/Operators.md).
 * **Supported Opsets** indicates the lowest and highest opset a model may have for onnx-mlir to support compiling a model with the operator.
-   * A * indicates onnx-mlir is compatible with the latest version of that operator available as of opset 21.
+   * A * indicates onnx-mlir is compatible with the latest version of that operator available as of opset 22.
    * A ^ indicates onnx-mlir is compatible with the latest level of the NNPA Architecture which is z16.
 
 

diff --git a/docs/SupportedONNXOps-cpu.md b/docs/SupportedONNXOps-cpu.md
@@ -3,11 +3,11 @@
 
 # Supported ONNX Operation for Target *cpu*.
 
-Onnx-mlir currently supports ONNX operations targeting up to opset 21. Limitations are listed when applicable. This documentation highlights the minimum and maximum opset versions that are fully supported by onnx-mlir and not the version changes.
+Onnx-mlir currently supports ONNX operations targeting up to opset 22. Limitations are listed when applicable. This documentation highlights the minimum and maximum opset versions that are fully supported by onnx-mlir and not the version changes.
 
 * Operations are defined by the [ONNX Standard](https://github.com/onnx/onnx/blob/main/docs/Operators.md).
 * **Supported Opsets** indicates the lowest and highest opset a model may have for onnx-mlir to support compiling a model with the operator.
-   * A * indicates onnx-mlir is compatible with the latest version of that operator available as of opset 21.
+   * A * indicates onnx-mlir is compatible with the latest version of that operator available as of opset 22.
 
 
 | Op |Supported Opsets (inclusive) |Limitations |Notes |

diff --git a/src/Accelerators/Accelerator.hpp b/src/Accelerators/Accelerator.hpp
@@ -108,6 +108,13 @@ class Accelerator {
   /// command line options.
   virtual void registerPasses(int optLevel) const = 0;
 
+  //===--------------------------------------------------------------------===//
+  // Hooks for both onnx-mlir and onnx-mlir-opt drivers
+  //===--------------------------------------------------------------------===//
+
+  /// Configure passes for the accelerator.
+  virtual void configurePasses() const = 0;
+
   //===--------------------------------------------------------------------===//
   // Hooks for onnx-to-krnl pass
   //===--------------------------------------------------------------------===//

diff --git a/src/Accelerators/CMakeLists.txt b/src/Accelerators/CMakeLists.txt
@@ -1,7 +1,8 @@
 # SPDX-License-Identifier: Apache-2.0
 
 # Populate the accelerator list and add the accelerator subdirectories.
-# ONNX_MLIR_ACCELERATORS is the list of accelerators user specified
+# ONNX_MLIR_ACCELERATORS is the semicolon-separated list of accelerators user specified
+# Note that the list should be quoted, e.g. -DONNX_MLIR_ACCELERATORS='A;B'
 # ACCEL_TARGET_LIST is the list of cmake targets
 # ACCEL_LINK_LIST is the lists of accelerator libraries
 # ACCEL_INCLUDE_LIST is the list passed to inc generator
@@ -10,7 +11,8 @@ if (ONNX_MLIR_ACCELERATORS)
     add_subdirectory(${t})
 
     # If the accelerator can be built
-    if (${t}_ENABLED)
+    string(TOUPPER ${t} T)
+    if (${T}_ENABLED)
       list(APPEND ACCEL_TARGET_LIST "${t}Accel")
       list(APPEND ACCEL_LINK_LIST "OM${t}Accel")
       list(APPEND ACCEL_INCLUDE_LIST "${t}")

diff --git a/src/Accelerators/NNPA/Compiler/NNPACompilerOptions.cpp b/src/Accelerators/NNPA/Compiler/NNPACompilerOptions.cpp
@@ -17,6 +17,10 @@
 
 namespace onnx_mlir {
 
+// Use external storage for the options so that they are globally accessible
+std::vector<NNPAQuantOptions> nnpaQuantDynamic; // common for both
+std::vector<std::string> nnpaQuantOpTypes;      // common for both
+
 llvm::cl::opt<NNPAEmissionTargetType> nnpaEmissionTarget(
     llvm::cl::desc("[Optional] Choose NNPA-related target to emit "
                    "(once selected it will cancel the other targets):"),
@@ -101,6 +105,41 @@ llvm::cl::opt<bool> nnpaEnableSaturation("nnpa-saturation",
                    "Default is false."),
     llvm::cl::init(false), llvm::cl::cat(OnnxMlirCommonOptions));
 
+llvm::cl::list<NNPAQuantOptions, std::vector<NNPAQuantOptions>>
+    nnpaQuantDynamicOpt("nnpa-quant-dynamic",
+        llvm::cl::desc(
+            "Enable dynamic quantization of the input model. If enabled, it "
+            "only quantizes from fp32 to i8. If an ONNX operation is already "
+            "in i8, no quantization is applied to that operation. Optionally, "
+            "a comma-separated list of quantization options can be specified "
+            "as its value, e.g. -nnpa-quant-dynamic=symActivation,symWeight."),
+        llvm::cl::values(clEnumVal(symWeight, "Symmetric quant for weights."),
+            clEnumVal(asymWeight, "Asymmetric quant for weights."),
+            clEnumVal(symActivation, "Symmetric quant for activations."),
+            clEnumVal(asymActivation, "Asymmetric quant for activations."),
+            // Use an empty string for the case where `--nnpa-quant-dynamic` is
+            // specified on the command line WITHOUT value, which is different
+            // from the case where `--nnpa-quant-dynamic` is NOT specified on
+            // the command line.
+            clEnumValN(autoQuantOpt, "",
+                "Compiler automatically finds the best options. Once this "
+                "option (an empty string) is in the list, the other options "
+                "are ignored. This is the default option when "
+                "`-nnpa-quant-dynamic` is specified without any value.")),
+        llvm::cl::location(nnpaQuantDynamic), llvm::cl::ValueOptional,
+        llvm::cl::CommaSeparated, llvm::cl::cat(OnnxMlirCommonOptions));
+
+llvm::cl::list<std::string, std::vector<std::string>> nnpaQuantOpTypesOpt(
+    "nnpa-quant-op-types",
+    llvm::cl::desc(
+        "A comma-separated list of types of operations that are quantized. "
+        "E.g. 'MatMul,Conv'. Strings for types are the same as ONNX operator "
+        "names in https://onnx.ai/onnx/operators/. Currently, only MatMul is "
+        "supported. Without specifying this option, the compiler will "
+        "determine the operation types by itself."),
+    llvm::cl::location(nnpaQuantOpTypes), llvm::cl::ValueOptional,
+    llvm::cl::CommaSeparated, llvm::cl::cat(OnnxMlirCommonOptions));
+
 llvm::cl::opt<bool> nnpaUseDynamicQuantizeLinearOnCPU("nnpa-cpu-dql",
     llvm::cl::desc("Use dynamic quantized linear on CPU. Default is false"),
     llvm::cl::init(false), llvm::cl::cat(OnnxMlirCommonOptions));
@@ -111,17 +150,4 @@ llvm::cl::opt<bool> nnpaUseDynamicQuantizeLinearOnCPUForScaleOffset(
                    " scale and offset on CPU. Default is false"),
     llvm::cl::init(false), llvm::cl::cat(OnnxMlirCommonOptions));
 
-llvm::cl::opt<NNPAQuantType> nnpaQuantization("nnpa-quantization",
-    llvm::cl::desc("Enable quantization with a specific type. Only "
-                   "MatMul whose weight is a constant is supported."),
-    llvm::cl::values(
-        clEnumVal(DynSymI8,
-            "Dynamic Quantization to signed integer 8. Asymmetric "
-            "quant for activations and symmetric quant for weights."),
-        clEnumVal(SymSymI8,
-            "Dynamic Quantization to signed integer 8. Symmetric "
-            "quant for activations and symmetric quant for weights."),
-        clEnumVal(QNONE, "No quantization (default).")),
-    llvm::cl::init(QNONE), llvm::cl::cat(OnnxMlirOptions));
-
 } // namespace onnx_mlir
diff --git a/src/Accelerators/NNPA/Compiler/NNPACompilerOptions.hpp b/src/Accelerators/NNPA/Compiler/NNPACompilerOptions.hpp
@@ -57,12 +57,12 @@ typedef enum {
 
 // Quantization type
 typedef enum {
-  DynSymI8, /* Dynamic quantization to signed integer 8. Asymmetric quant for
-               activations and symmetric quant for weights.*/
-  SymSymI8, /* Dynamic quantization to signed integer 8. Symmetric quant for
-               activations and symmetric quant for weights.*/
-  QNONE,    /* Only qualifying ops that are faster on NNPA. */
-} NNPAQuantType;
+  symWeight,
+  asymWeight,
+  symActivation,
+  asymActivation,
+  autoQuantOpt,
+} NNPAQuantOptions;
 
 extern llvm::cl::OptionCategory OnnxMlirOptions;
 extern llvm::cl::OptionCategory OnnxMlirCommonOptions;
@@ -79,7 +79,8 @@ extern llvm::cl::opt<std::string> nnpaSaveDevicePlacementFile;
 extern llvm::cl::opt<bool> nnpaEnableSaturation;
 extern llvm::cl::opt<bool> nnpaUseDynamicQuantizeLinearOnCPU;
 extern llvm::cl::opt<bool> nnpaUseDynamicQuantizeLinearOnCPUForScaleOffset;
-extern llvm::cl::opt<NNPAQuantType> nnpaQuantization;
+extern std::vector<NNPAQuantOptions> nnpaQuantDynamic;
+extern std::vector<std::string> nnpaQuantOpTypes;
 
 } // namespace onnx_mlir
 #endif
diff --git a/src/Accelerators/NNPA/Compiler/NNPACompilerUtils.cpp b/src/Accelerators/NNPA/Compiler/NNPACompilerUtils.cpp
@@ -49,11 +49,56 @@ using namespace onnx_mlir;
 namespace onnx_mlir {
 
 void configurePassesNNPA() {
-  configureOnnxToZHighLoweringPass(optReport == OptReport::NNPAUnsupportedOps);
   // z16 does not support for hardware saturation.
   // So, force its usage to compiler generated sticks.
   if (nnpaEnableSaturation && isLessEqualNNPALevel(NNPALevel::M14))
     nnpaEnableCompilerStickUnstick = true;
+
+  // Configure ONNXToZHighLoweringPass.
+  bool isDynQuant = !nnpaQuantDynamic.empty();
+  // Default/auto mode: symmetric for weighs and asymmetric for activations.
+  bool isActivationSym = false;
+  bool isWeightSym = true;
+  std::vector<std::string> quantOpTypes;
+  if (isDynQuant) {
+    // Set options for activations and weights if they are given.
+    // When auto mode is specified, the other specified options are ignored.
+    if (!llvm::is_contained(nnpaQuantDynamic, NNPAQuantOptions::autoQuantOpt)) {
+      for (unsigned i = 0; i < nnpaQuantDynamic.size(); ++i) {
+        switch (nnpaQuantDynamic[i]) {
+        case NNPAQuantOptions::symWeight:
+          isWeightSym = true;
+          break;
+        case NNPAQuantOptions::asymWeight:
+          isWeightSym = false;
+          break;
+        case NNPAQuantOptions::symActivation:
+          isActivationSym = true;
+          break;
+        case NNPAQuantOptions::asymActivation:
+          isActivationSym = false;
+          break;
+        default:
+          llvm_unreachable("Unsupported quantization options");
+          break;
+        }
+      }
+    }
+    if (!isWeightSym) {
+      // TODO: Support asymmetric quantiation for weights.
+      llvm::outs()
+          << "Asymmetric quantization for weights is not yet supported. "
+             "Turning off quantization.\n";
+      isDynQuant = false;
+    }
+    if (nnpaQuantOpTypes.empty()) {
+      quantOpTypes.emplace_back("MatMul");
+    } else {
+      quantOpTypes = nnpaQuantOpTypes;
+    }
+  }
+  configureONNXToZHighLoweringPass(optReport == OptReport::NNPAUnsupportedOps,
+      isDynQuant, isActivationSym, isWeightSym, quantOpTypes);
 }
 
 void addONNXToZHighPasses(mlir::PassManager &pm) {
@@ -85,7 +130,8 @@ void addONNXToZHighPasses(mlir::PassManager &pm) {
     pm.addNestedPass<func::FuncOp>(
         onnx_mlir::createInstrumentPass(instrumentOps, instrumentActions));
 
-  pm.addPass(onnx_mlir::createONNXToZHighPass(nnpaQuantization));
+  // Lowering ONNX to ZHigh.
+  pm.addPass(onnx_mlir::createONNXToZHighPass());
   pm.addNestedPass<func::FuncOp>(onnx_mlir::createShapeInferencePass());
 
   // There are more opportunities for const propagation once all zhigh ops were
@@ -191,7 +237,6 @@ void addPassesNNPA(mlir::OwningOpRef<mlir::ModuleOp> &module,
 
   // Override pass configurations.
   configurePasses();
-  configurePassesNNPA();
 
   // LLVM_DEBUG(llvm::dbgs() << "Adding NNPA passes" << std::endl;);
   if (emissionTarget >= EmitONNXIR) {

diff --git a/src/Accelerators/NNPA/Conversion/ONNXToZHigh/DevicePlacement.cpp b/src/Accelerators/NNPA/Conversion/ONNXToZHigh/DevicePlacement.cpp
@@ -161,7 +161,7 @@ void DevicePlacementPass::runOnOperation() {
 
   // Disable reporting on NNPA unsupported ops in this pass even if
   // `-opt-report=NNPAUnsupportedOps` is specified..
-  OnnxToZHighLoweringConfiguration::reportOnNNPAUnsupportedOps = 0;
+  ONNXToZHighLoweringConfiguration::reportOnNNPAUnsupportedOps = 0;
 
   // Run the unknown dimension analysis to help check equality of unknown
   // dimensions at compile time.
@@ -200,13 +200,13 @@ void DevicePlacementPass::runOnOperation() {
   // Call ONNXToZHigh pass for lowering multiple ONNX ops at once to ZHigh.
   // E.g. `onnx.ReLu (onnx.Conv)` to zhigh.Conv.
   RewritePatternSet Patterns2(context);
-  getONNXToZHighMultipleOpPatterns(Patterns2, nnpaQuantization);
+  getONNXToZHighMultipleOpPatterns(Patterns2);
   (void)applyAnalysisConversion(module, target, std::move(Patterns2),
       ConversionConfig{.legalizableOps = &legalizedOps2});
 
   // Call ONNXToZHigh pass for lowering a single ONNX op to ZHigh.
   RewritePatternSet Patterns3(context);
-  getONNXToZHighOneOpPatterns(Patterns3, nnpaQuantization);
+  getONNXToZHighOneOpPatterns(Patterns3);
   getONNXToZHighOneOpDynamicallyLegal(&target, &dimAnalysis);
   (void)applyAnalysisConversion(module, target, std::move(Patterns3),
       ConversionConfig{.legalizableOps = &legalizedOps3});

diff --git a/src/Accelerators/NNPA/Conversion/ONNXToZHigh/ONNXLegalityCheck.cpp b/src/Accelerators/NNPA/Conversion/ONNXToZHigh/ONNXLegalityCheck.cpp
@@ -27,7 +27,7 @@ using namespace onnx_mlir;
 
 /// Report NNPA unsupported case.
 bool onnxToZHighUnsupportedReport(Operation *op, const std::string &message) {
-  if (OnnxToZHighLoweringConfiguration::reportOnNNPAUnsupportedOps &&
+  if (ONNXToZHighLoweringConfiguration::reportOnNNPAUnsupportedOps &&
       !message.empty()) {
     StringAttr opName = op->getName().getIdentifier();
     std::string nodeNameStr = getNodeNameInPresenceOfOpt(op);