-
Notifications
You must be signed in to change notification settings - Fork 213
Open
Labels
questionHelp is is neededHelp is is needed
Description
I am using modelopt.onnx.quantization.quantize to quantize a Super Resolution model (Real-ESRGAN) to INT8 using the Entropy calibration method.
I observed a significant difference in the resulting quantization parameters (specifically scale) when I changed the CAL_BATCH_SIZE (batch size used during calibration), even though the calibration dataset and the total number of calibration samples remained exactly the same.
Reproduction Steps
- Model: realesr-general-x4v3-dn0.0.onnx
- Calibration Data: A fixed set of images (.npy format).
- Code Snippet: I ran the quantization process multiple times, changing only the batch size in calibration_shapes.
# ... setup code ...
# Tested CAL_BATCH_SIZE: 1, 2, 3, 6
quantize(
log_level='DEBUG',
onnx_path="realesr-general-x4v3-dn0.0.onnx",
quantize_mode="int8",
calibration_data={"input": calib_data}, # The same full dataset every time
calibration_shapes=f"input:{CAL_BATCH_SIZE}x3x512x512", # <--- Only this changed
calibration_method="entropy",
output_path=f"model_int8_b{CAL_BATCH_SIZE}.onnx",
calibration_eps=['cpu'],
simplify=True,
calibrate_per_node=False,
)Observed Behavior I compared the generated ONNX models using Netron. The scale values in QuantizeLinear / DequantizeLinear nodes differ significantly between models calibrated with different batch sizes.
- Batch Size = 1: Scale A (e.g., ~1.06)
- Batch Size = 6: Scale B (e.g., ~1.48)
Questions
- Is this expected behavior? Theoretically, since the calibration dataset is identical, shouldn't the aggregated statistics (histogram) for Entropy calibration be the same regardless of whether data is fed in batches of 1 or 6?
- Why does the batch size impact the scale so heavily? Is it related to how the histogram range is updated or expanded dynamically during the calibration loop?
- What is the recommended practice? Should we always aim for the largest possible calibration batch size that fits in memory to get the "correct" or "best" scales?
- Inference Batch Size: Does the inference batch size need to match the calibration batch size to maintain accuracy, given that the scales seem sensitive to it?
Environment
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04.3 LTS
- CPU architecture (x86_64, aarch64): x86_64
- GPU name (e.g. H100, A100, L40S): NVIDIA GeForce RTX 4060 Ti
- GPU memory size: 8.0 GB
- Number of GPUs: 1
- Library versions (if applicable):
- Python: 3.12.3
- ModelOpt version or commit hash: 0.39.0
- CUDA: 13.0
- PyTorch: 2.9.1+cu128
- Transformers: 4.57.1
- TensorRT-LLM: ?
- ONNXRuntime: 1.22.0
- TensorRT: 10.13.3.9
Metadata
Metadata
Assignees
Labels
questionHelp is is neededHelp is is needed