Skip to content

Does calibration batch size affect quantization scales significantly? (Entropy Calibration) #619

@luxi78

Description

@luxi78

I am using modelopt.onnx.quantization.quantize to quantize a Super Resolution model (Real-ESRGAN) to INT8 using the Entropy calibration method.

I observed a significant difference in the resulting quantization parameters (specifically scale) when I changed the CAL_BATCH_SIZE (batch size used during calibration), even though the calibration dataset and the total number of calibration samples remained exactly the same.

Reproduction Steps

  • Model: realesr-general-x4v3-dn0.0.onnx
  • Calibration Data: A fixed set of images (.npy format).
  • Code Snippet: I ran the quantization process multiple times, changing only the batch size in calibration_shapes.
# ... setup code ...
# Tested CAL_BATCH_SIZE: 1, 2, 3, 6

quantize(
    log_level='DEBUG',
    onnx_path="realesr-general-x4v3-dn0.0.onnx",
    quantize_mode="int8",
    calibration_data={"input": calib_data}, # The same full dataset every time
    calibration_shapes=f"input:{CAL_BATCH_SIZE}x3x512x512", # <--- Only this changed
    calibration_method="entropy",
    output_path=f"model_int8_b{CAL_BATCH_SIZE}.onnx",
    calibration_eps=['cpu'],
    simplify=True,
    calibrate_per_node=False,
)

Observed Behavior I compared the generated ONNX models using Netron. The scale values in QuantizeLinear / DequantizeLinear nodes differ significantly between models calibrated with different batch sizes.

  • Batch Size = 1: Scale A (e.g., ~1.06)
  • Batch Size = 6: Scale B (e.g., ~1.48)
Image Image

Questions

  1. Is this expected behavior? Theoretically, since the calibration dataset is identical, shouldn't the aggregated statistics (histogram) for Entropy calibration be the same regardless of whether data is fed in batches of 1 or 6?
  2. Why does the batch size impact the scale so heavily? Is it related to how the histogram range is updated or expanded dynamically during the calibration loop?
  3. What is the recommended practice? Should we always aim for the largest possible calibration batch size that fits in memory to get the "correct" or "best" scales?
  4. Inference Batch Size: Does the inference batch size need to match the calibration batch size to maintain accuracy, given that the scales seem sensitive to it?

Environment

  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04.3 LTS
  • CPU architecture (x86_64, aarch64): x86_64
  • GPU name (e.g. H100, A100, L40S): NVIDIA GeForce RTX 4060 Ti
  • GPU memory size: 8.0 GB
  • Number of GPUs: 1
  • Library versions (if applicable):
    • Python: 3.12.3
    • ModelOpt version or commit hash: 0.39.0
    • CUDA: 13.0
    • PyTorch: 2.9.1+cu128
    • Transformers: 4.57.1
    • TensorRT-LLM: ?
    • ONNXRuntime: 1.22.0
    • TensorRT: 10.13.3.9

Metadata

Metadata

Assignees

Labels

questionHelp is is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions