Does calibration batch size affect quantization scales significantly? (Entropy Calibration)

I am using modelopt.onnx.quantization.quantize to quantize a Super Resolution model (Real-ESRGAN) to INT8 using the Entropy calibration method.

I observed a significant difference in the resulting quantization parameters (specifically scale) when I changed the CAL_BATCH_SIZE (batch size used during calibration), even though the calibration dataset and the total number of calibration samples remained exactly the same.

Reproduction Steps

- Model: realesr-general-x4v3-dn0.0.onnx
- Calibration Data: A fixed set of images (.npy format).
- Code Snippet: I ran the quantization process multiple times, changing only the batch size in calibration_shapes.

```python
# ... setup code ...
# Tested CAL_BATCH_SIZE: 1, 2, 3, 6

quantize(
    log_level='DEBUG',
    onnx_path="realesr-general-x4v3-dn0.0.onnx",
    quantize_mode="int8",
    calibration_data={"input": calib_data}, # The same full dataset every time
    calibration_shapes=f"input:{CAL_BATCH_SIZE}x3x512x512", # <--- Only this changed
    calibration_method="entropy",
    output_path=f"model_int8_b{CAL_BATCH_SIZE}.onnx",
    calibration_eps=['cpu'],
    simplify=True,
    calibrate_per_node=False,
)
```
Observed Behavior I compared the generated ONNX models using Netron. The scale values in QuantizeLinear / DequantizeLinear nodes differ significantly between models calibrated with different batch sizes.

- Batch Size = 1: Scale A (e.g., ~1.06)
- Batch Size = 6: Scale B (e.g., ~1.48)

<img width="400" alt="Image" src="https://github.com/user-attachments/assets/5b10107d-fcce-4cc1-a8e8-cc8d635adc8f" />

<img width="400" alt="Image" src="https://github.com/user-attachments/assets/1b45c6f1-b206-4620-892d-087c6ebd2f2f" />


Questions

1. Is this expected behavior? Theoretically, since the calibration dataset is identical, shouldn't the aggregated statistics (histogram) for Entropy calibration be the same regardless of whether data is fed in batches of 1 or 6?
2. Why does the batch size impact the scale so heavily? Is it related to how the histogram range is updated or expanded dynamically during the calibration loop?
3. What is the recommended practice? Should we always aim for the largest possible calibration batch size that fits in memory to get the "correct" or "best" scales?
4. Inference Batch Size: Does the inference batch size need to match the calibration batch size to maintain accuracy, given that the scales seem sensitive to it?

Environment
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04.3 LTS
- CPU architecture (x86_64, aarch64): x86_64
- GPU name (e.g. H100, A100, L40S): NVIDIA GeForce RTX 4060 Ti
- GPU memory size: 8.0 GB
- Number of GPUs: 1
- Library versions (if applicable):
  - Python: 3.12.3
  - ModelOpt version or commit hash: 0.39.0
  - CUDA: 13.0
  - PyTorch: 2.9.1+cu128
  - Transformers: 4.57.1
  - TensorRT-LLM: ?
  - ONNXRuntime: 1.22.0
  - TensorRT: 10.13.3.9


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does calibration batch size affect quantization scales significantly? (Entropy Calibration) #619

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Does calibration batch size affect quantization scales significantly? (Entropy Calibration) #619

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions