Integrate Automated QDQ placement tool - Part 3 by willg-nv · Pull Request #703 · NVIDIA/Model-Optimizer

willg-nv · 2025-12-17T06:56:58Z

What does this PR do?

Type of change: new feature

Overview: This PR integrates automated QDQ placement tool to ModelOpt. This PR is 3/4 of the change. This PR contains the following changes:

Implements QDQAutotuner and Autotuner CLI interface
Implements Benchmark to measure E2E time of QDQ models.
unit tests for QDQ Autotuner and config.

Part 1: #701
Part 2: #702
Part 3: #703
Part 4: #704

Usage

python -m modelopt.onnx.quantization.autotune --model model.onnx

Testing

Implemented unit tests for QDQAutotuner and Config classes.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes
Did you add or update any necessary documentation?: No, document will be in part 4.
Did you update Changelog?: No. change log will be in part 4.

Additional Information

copy-pr-bot · 2025-12-17T06:57:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

willg-nv · 2025-12-22T03:24:50Z

@vishalpandya1990 could you help me review this PR? thanks!

vishalpandya1990 · 2025-12-22T06:34:13Z

@vishalpandya1990 could you help me review this PR? thanks!

Sorry for the delay. Added Ajinkya for review.

modelopt/onnx/quantization/autotune/__init__.py

modelopt/onnx/quantization/autotune/cli.py

tests/unit/onnx/quantization/autotune/test_autotuner.py

Signed-off-by: Will Guo <willg@nvidia.com>

modelopt/onnx/quantization/autotune/__init__.py

modelopt/onnx/quantization/autotune/workflows.py

tests/unit/onnx/quantization/autotune/test_autotuner.py

gcunhase · 2026-01-26T20:37:30Z

modelopt/onnx/quantization/autotune/autotuner.py

+
+        if needs_fp8_conversion:
+            logger.debug("Converting INT8 to FP8")
+            model = int8_to_fp8(model)


Is this conversion function needed or can we insert Q/DQ nodes already at the correct precision?

Tried the following test code:

def test_export_quantized_model(self): """Test exporting quantized model with Q/DQ.""" model = create_simple_conv_model() autotuner = QDQAutotuner(model) config = self._create_test_config() autotuner.initialize(config) with open("/tmp/autotuner_model.quant.onnx", "w") as f: # tempfile.NamedTemporaryFile(suffix=".onnx", delete=False) as f: output_path = f.name try: # Export baseline without Q/DQ insertion autotuner.export_onnx(output_path, insert_qdq=True) # Verify file was created assert os.path.exists(output_path) # Verify it's a valid ONNX model exported_model = onnx.load(output_path) assert exported_model is not None # Verify that it contains Q/DQ nodes qdq_nodes = [n for n in exported_model.graph.node if n.op_type in ["QuantizeLinear", "DequantizeLinear"]] assert qdq_nodes, "Q/DQ nodes not found in quantized model" print("✓ QDQAutotuner export quantized model") finally: print() # if os.path.exists(output_path): # os.unlink(output_path)

But the simple Conv->Relu model didn't get quantized. Is this expected?

[modelopt][onnx] - DEBUG - Region 0 (level 0) [modelopt][onnx] - DEBUG - → Pattern signature: Conv->Relu [modelopt][onnx] - DEBUG - → No scheme available, skipping [modelopt][onnx] - DEBUG - Matched 0/1 regions, total 0 unique insertion points [modelopt][onnx] - DEBUG - Inserting 0 Q/DQ pairs into graph [modelopt][onnx] - DEBUG - Serializing to ONNX format [modelopt][onnx] - INFO - Exported INT8 model with 0 Q/DQ pairs → /tmp/autotuner_model.quant.onnx ✓ QDQAutotuner export quantized model

I think the above result is expected. because export_onnx(insert_qdq=True) means use the autotune insertion points to insert Q/DQ. Since the regions in autotuner is not tuned, there should be no QDQ node inserted.

For model = int8_to_fp8(model), I don't know how to create fp8 QDQ ONNX natively. So I use int8 Q/DQ nodes and converts to fp8.

gcunhase · 2026-01-26T20:49:51Z

tests/unit/onnx/quantization/autotune/test_autotuner.py

+            scheme_idx = autotuner.generate()
+
+            # Should return a valid index (>= 0) or -1 if no more unique schemes
+            assert isinstance(scheme_idx, int)


What's the expected scheme_idx for create_simple_conv_model()? Please update this assert accordingly. Thanks.

gcunhase · 2026-01-27T18:57:22Z

Can we add a test file for workflows.py and potentially benchmark.py?

cli.py can be moved to __main__.py if the plan is for it to be available as a standalone feature and/or for debugging purposes. Unittests might also need to be added for that.

gcunhase · 2026-01-27T21:16:45Z

modelopt/onnx/quantization/autotune/cli.py

+    # TensorRT Benchmark
+    trt_group = parser.add_argument_group("TensorRT Benchmark")
+    trt_group.add_argument(
+        "--use_trtexec",


The following CLI fails to perform benchmark / quantize the model (this uses TensorRTPyBenchmark):

$ python -m modelopt.onnx.quantization.autotune --onnx_path=conv_relu.onnx

Error:

[modelopt][onnx] - ERROR - Benchmark instance not initialized [modelopt][onnx] - INFO - Results: 3.73 ms → failed (invalid measurement)

This failure happens because pycuda was not installed. After installing that dependency, no error is thrown but the model is not quantized.

@ajrasane should we create another optional_dep in setup.py with autotune's dependencies?

If --use_trtexec is used, autotune does not fail but also doesn't generate a quantized model.

This is due to Latency being used as a measurement instead of GPU Compute Time.

If it is just pycuda, we can probably just include this in the modelopt onnx dependencies. But if we have more dependencies, then it would be better to create a new section in setup.py with autotune dependencies.

@willg-nv how should we approach the tensorrt / trtexec requirements for autotune? Are we just adding a disclaimer for the user in the README or adding that in setup.py?

modelopt/onnx/quantization/autotune/benchmark.py

gcunhase · 2026-01-28T14:52:40Z

Can we add a test file for workflows.py and potentially benchmark.py?

cli.py can be moved to __main__.py if the plan is for it to be available as a standalone feature and/or for debugging purposes. Unittests might also need to be added for that.

Suggestion for test_workflows.py: test_workflows.py

modelopt/onnx/quantization/autotune/autotuner.py

## What does this PR do? **Type of change:** new feature **Overview:** This PR integrates an automatical QDQ placment tool into ModelOpt. This PR is the 1/4 parts of the change, it contains the following changes: 1. Defines common types: Region, RegionType, Error types 2. Defines InsertionPoints (the logical localtion to place QDQ pairs), InsertionScheme (a set of insertion points) 3. Unit tests for new types Part 1: #701 Part 2: #702 Part 3: #703 Part 4: #704 ## Usage ```python # Region type usage: region = Region(region_id=1, level=0, region_type=RegionType.LEAF) assert region.get_id() == 1 assert region.get_level() == 0 region.add_node(1) # 1 is the index of ONNX graph node ... point = NodeInputInsertionPoint(node_index=0, input_index=2) assert point.node_index == 0 # relative node index in region assert point.input_index == 2 # relative input tensor index in specific node resolved = point.resolve(region, graph) ... ``` ## Testing Implement unit tests, all tests could get passed. ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No, document change will be included in part 4. - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: No, this could be done when all parts of the change are merged. ## Additional Information   ## Summary by CodeRabbit * **New Features** * Added foundational autotuner infrastructure for quantization optimization, including region hierarchies and insertion scheme management. * Introduced insertion point system for managing quantize/dequantize operation placement across ONNX graph regions. * Added utility functions for tensor consumer mapping and boolean operation identification. * **Tests** * Added comprehensive test coverage for autotuner components, insertion points, and region management. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>  --------- Signed-off-by: Will Guo <willg@nvidia.com>

willg-nv requested a review from a team as a code owner December 17, 2025 06:56

willg-nv requested a review from vishalpandya1990 December 17, 2025 06:56

This was referenced Dec 17, 2025

Integrate Automated QDQ placement tool - Part 4 #704

Open

Integrate Automated QDQ placement tool - Part 2 #702

Open

Integrate Automated QDQ placement tool - Part 1 #701

Merged

vishalpandya1990 requested a review from ajrasane December 22, 2025 06:31

willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3 branch from 3454bba to 4b9d789 Compare December 31, 2025 02:09