Add FP8 QKVO + NVFP4 MLP PTQ recipe by yueshen2016 · Pull Request #1213 · NVIDIA/Model-Optimizer

yueshen2016 · 2026-04-09T00:41:37Z

Add a new PTQ recipe that applies FP8 per-tensor (W8A8) quantization to attention Q/K/V/O projections and NVFP4 block-wise (W4A4) quantization to MLP/MoE layers with max calibration.

Example usage on Gemma-4-31B-IT:

cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
  --pyt_ckpt_path /models/gemma-4-31B-it \
  --recipe general/ptq/fp8_qkvo-nvfp4_mlp \
  --calib_size 512 \
  --dataset cnn_dailymail \
  --export_path /models/gemma-4-31B-it-fp8qkvo-nvfp4mlp

What does this PR do?

Type of change: New feature

Adds a new built-in PTQ recipe fp8_qkvo-nvfp4_mlp that combines two quantization strategies:

FP8 W8A8 (e4m3, per-tensor) for attention Q, K, V, O projections
NVFP4 W4A4 (e2m1, block size 16, dynamic scaling with e4m3 scales) for MLP and MoE layers

This mixed-precision recipe targets a balance between model quality and inference performance — keeping attention projections at higher precision (FP8) while aggressively quantizing MLP layers (NVFP4). Standard components (routers, norms, lm_head, BatchNorm, etc.) are left unquantized.

Usage

cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
  --pyt_ckpt_path /models/gemma-4-31B-it \
  --recipe general/ptq/fp8_qkvo-nvfp4_mlp \
  --calib_size 512 \
  --dataset cnn_dailymail \
  --export_path /models/gemma-4-31B-it-fp8qkvo-nvfp4mlp

Testing

Verified recipe loads correctly via load_recipe("general/ptq/fp8_qkvo-nvfp4_mlp")
Tested end-to-end PTQ + export on Gemma-4-31B-IT

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ (additive recipe only)
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A (YAML recipe, no new code)
Did you update Changelog?: ❌ (can add if needed)

Additional Information

Recipe file: modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml

Summary by CodeRabbit

New Features
- Added a new post-training quantization recipe that applies FP8 precision to attention projection weights/activations and NVFP4 precision to MLP and MoE layers, with targeted enable/disable rules to restrict quantization to intended components for more efficient, fine-grained model compression.

Add a new PTQ recipe that applies FP8 per-tensor (W8A8) quantization to attention Q/K/V/O projections and NVFP4 block-wise (W4A4) quantization to MLP/MoE layers with max calibration. Example usage on Gemma-4-31B-IT: cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \ --pyt_ckpt_path /models/gemma-4-31B-it \ --recipe general/ptq/fp8_qkvo-nvfp4_mlp \ --calib_size 512 \ --dataset cnn_dailymail \ --export_path /models/gemma-4-31B-it-fp8qkvo-nvfp4mlp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>

coderabbitai · 2026-04-09T00:41:51Z

📝 Walkthrough

Walkthrough

Adds a new PTQ recipe YAML that sets max calibration and configures FP8 (e4m3) per-module quantization for attention Q/K/V/O projections and NVFP4 (e2m1) dynamic quantization for MLP and block-sparse MoE paths, while disabling quantization for many non-target modules.

Changes

Cohort / File(s)	Summary
Recipe Configuration `modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml`	Added PTQ recipe with `metadata.recipe_type: ptq`, `ptq_cfg.algorithm: max`. Enables dynamic NVFP4 (e2m1) with `scale_bits: e4m3` and `block_sizes: -1: 16` for `mlp` and `block_sparse_moe` weight/input quantizers. Enables FP8 (e4m3) per-module quantization for `{q,k,v,o}_proj{weight,input}_quantizer` with `axis:`. Disables quantization for broad defaults and patterns (e.g., `default`, router/gate, `lm_head`, `nn.BatchNorm*`, `nn.LeakyReLU`).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main change: adding a new PTQ recipe that combines FP8 quantization for attention projections (QKVO) with NVFP4 quantization for MLP layers.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	PR adds only YAML configuration file with no Python code changes; security anti-patterns check targets Python code only.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yueshen/fp8_qkvo-nvfp4_mlp

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-09T00:45:35Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1213/
Built to branch `gh-pages` at 2026-04-09 00:51 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml`:
- Line 18: The description is inaccurate: NVFP4 is configured with dynamic
weight quantization (see block_sizes.type: dynamic in the weight quantizer
blocks) but the text says "static weight"; update the description string in
fp8_qkvo-nvfp4_mlp.yml to reflect "NVFP4 dynamic weight and dynamic activation"
(or similar), and audit other NVFP4 recipes (e.g., nvfp4_mlp_only-fp8_kv.yml,
nvfp4_experts_only-fp8_kv.yml) to ensure their description matches the actual
config where weight quantizers use block_sizes.type: dynamic; alternatively, if
you intended static weights, change the weight quantizer block_sizes.type values
to "static" in the weight quantizer sections to match the existing description.
- Around line 85-86: Move the catch-all deny entry "default: enable: false" so
it is the first entry under quant_cfg (so later per-layer enables override it)
and update the YAML ordering accordingly; also reconcile the metadata string
"NVFP4 static weight and dynamic activation" with the actual weight quantizer
settings by either changing the metadata to state "dynamic weights" or changing
the weight quantizers (the entries under e.g. the weight quantizers at lines
where type: dynamic is set) to use a static weight quantizer type, ensuring
metadata and the weight quantizer "type" settings are consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ad2a150-8d54-4f2d-9f30-ac49bc376654

📥 Commits

Reviewing files that changed from the base of the PR and between cccfded and f1e85d9.

📒 Files selected for processing (1)

modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml

coderabbitai · 2026-04-09T00:46:05Z

modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml

+
+metadata:
+  recipe_type: ptq
+  description: FP8 per-tensor weight and activation for attention QKVO projections (W8A8), NVFP4 static weight and dynamic activation for MLP layers (W4A4), max calibration.


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check wording/config consistency with other PTQ recipes. rg -n -C2 'static weight|dynamic activation|block_sizes:|type:\s*dynamic|NVFP4' modelopt_recipes/general/ptq

Repository: NVIDIA/Model-Optimizer

Length of output: 13727

Update description to match config: NVFP4 uses dynamic weight quantization, not static.

Line 18 describes "NVFP4 static weight and dynamic activation" but the weight quantizers at lines 23-29 and 37-43 both set block_sizes.type: dynamic. This discrepancy also appears in other NVFP4 recipes (e.g., nvfp4_mlp_only-fp8_kv.yml, nvfp4_experts_only-fp8_kv.yml), all with "static weight" in descriptions but type: dynamic in configs. Clarify the intended quantization scheme across all affected recipes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` at line 18, The description is inaccurate: NVFP4 is configured with dynamic weight quantization (see block_sizes.type: dynamic in the weight quantizer blocks) but the text says "static weight"; update the description string in fp8_qkvo-nvfp4_mlp.yml to reflect "NVFP4 dynamic weight and dynamic activation" (or similar), and audit other NVFP4 recipes (e.g., nvfp4_mlp_only-fp8_kv.yml, nvfp4_experts_only-fp8_kv.yml) to ensure their description matches the actual config where weight quantizers use block_sizes.type: dynamic; alternatively, if you intended static weights, change the weight quantizer block_sizes.type values to "static" in the weight quantizer sections to match the existing description.

coderabbitai · 2026-04-09T00:46:05Z

modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml

+    default:
+      enable: false


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify precedence + legacy "default" conversion behavior in quant config handling. rg -n -C3 'Entries are applied \*\*in list order\*\*|if key == "default"|key = "\*"' modelopt/torch/quantization/config.py

Repository: NVIDIA/Model-Optimizer

Length of output: 841

🏁 Script executed:

cat -n modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml | head -120

Repository: NVIDIA/Model-Optimizer

Length of output: 3915

Move default: enable: false to the start of quant_cfg to match documented precedence rules.

Per the config documentation, entries are applied in order with later entries overriding earlier ones. The recommended pattern is to start with a deny-all entry (default: enable: false), then add specific enables afterward. The current YAML has this reversed—the catch-all appears at line 85-86 after all the enable: true entries, which violates the intended precedence logic.

Additionally, the metadata (line 18) describes "NVFP4 static weight and dynamic activation" but the weight quantizers (lines 25-27, 40-42) configure type: dynamic. Clarify whether weights should be static or dynamic per the intended recipe behavior.

Proposed precedence fix

ptq_cfg: algorithm: max quant_cfg: + default: + enable: false # NVFP4 W4A4 for MLP / MoE layers '*mlp*weight_quantizer': num_bits: e2m1 block_sizes: type: dynamic @@ - # Standard disables (routers, norms, lm_head, etc.) - default: - enable: false + # Standard disables (routers, norms, lm_head, etc.) '*block_sparse_moe.gate*': enable: false

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` around lines 85 - 86, Move the catch-all deny entry "default: enable: false" so it is the first entry under quant_cfg (so later per-layer enables override it) and update the YAML ordering accordingly; also reconcile the metadata string "NVFP4 static weight and dynamic activation" with the actual weight quantizer settings by either changing the metadata to state "dynamic weights" or changing the weight quantizers (the entries under e.g. the weight quantizers at lines where type: dynamic is set) to use a static weight quantizer type, ensuring metadata and the weight quantizer "type" settings are consistent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>

coderabbitai

♻️ Duplicate comments (2)

modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml (2)

18-19: ⚠️ Potential issue | 🟡 Minor

Metadata description is inconsistent with the actual quantization config.

Line 18 says NVFP4 uses static weights for MLP, but Lines 24-50 configure block_sizes.type: dynamic and also target MoE (*block_sparse_moe*). Please align description with config behavior.

Proposed metadata fix

-  description: FP8 per-tensor weight and activation for attention QKVO projections (W8A8), NVFP4 static weight and dynamic activation for MLP layers (W4A4),
+  description: FP8 per-tensor weight and activation for attention QKVO projections (W8A8), NVFP4 dynamic weight and dynamic activation for MLP/MoE layers (W4A4),
     max calibration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` around lines 18 - 19,
The YAML description string under description is inconsistent with the
configuration: update the metadata to match the actual quantization settings by
changing the descriptive text that claims "NVFP4 static weight and dynamic
activation for MLP layers" to reflect that block_sizes.type is set to "dynamic"
for NVFP4 (and that the config targets MoE via the block_sparse_moe pattern);
ensure the description mentions NVFP4 dynamic weights/activations for MLP or
otherwise accurately notes dynamic block sizing and MoE targeting so the textual
metadata and the keys block_sizes.type and any block_sparse_moe references are
aligned.

86-87: ⚠️ Potential issue | 🔴 Critical

default: enable: false ordering can disable all prior enables.

With precedence applied in list order, placing default at Line 86 after the enable: true entries risks overriding them. Move default to the top of quant_cfg so specific patterns can re-enable targeted quantizers.

Proposed ordering fix

 ptq_cfg:
   algorithm: max
   quant_cfg:
+    default:
+      enable: false
     # NVFP4 W4A4 for MLP / MoE layers
     '*mlp*weight_quantizer':
       num_bits: e2m1
@@
-    # Standard disables (routers, norms, lm_head, etc.)
-    default:
-      enable: false
+    # Standard disables (routers, norms, lm_head, etc.)
     '*block_sparse_moe.gate*':
       enable: false

Use this to confirm precedence/default handling against repo code and this recipe:

#!/bin/bash
set -euo pipefail

rg -n -C3 'Entries are applied \*\*in list order\*\*|if key == "default"|key = "\*"' modelopt/torch/quantization/config.py
cat -n modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml | sed -n '20,100p'

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` around lines 86 - 87,
The `default: enable: false` entry in fp8_qkvo-nvfp4_mlp.yml can override
earlier enables because entries are applied in list order; move the `default`
block to the very top of the `quant_cfg` list so that pattern-specific entries
with `enable: true` (the existing enabled quantizer entries) can re-enable
targeted quantizers; update the YAML ordering so `default` appears first and
then the specific patterns follow, and re-run the repo's precedence check (the
config handling in modelopt/torch/quantization/config.py) to verify behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml`:
- Around line 18-19: The YAML description string under description is
inconsistent with the configuration: update the metadata to match the actual
quantization settings by changing the descriptive text that claims "NVFP4 static
weight and dynamic activation for MLP layers" to reflect that block_sizes.type
is set to "dynamic" for NVFP4 (and that the config targets MoE via the
block_sparse_moe pattern); ensure the description mentions NVFP4 dynamic
weights/activations for MLP or otherwise accurately notes dynamic block sizing
and MoE targeting so the textual metadata and the keys block_sizes.type and any
block_sparse_moe references are aligned.
- Around line 86-87: The `default: enable: false` entry in
fp8_qkvo-nvfp4_mlp.yml can override earlier enables because entries are applied
in list order; move the `default` block to the very top of the `quant_cfg` list
so that pattern-specific entries with `enable: true` (the existing enabled
quantizer entries) can re-enable targeted quantizers; update the YAML ordering
so `default` appears first and then the specific patterns follow, and re-run the
repo's precedence check (the config handling in
modelopt/torch/quantization/config.py) to verify behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0956c4d8-c96c-424e-975c-62cbdd03ec92

📥 Commits

Reviewing files that changed from the base of the PR and between f1e85d9 and 0b53b66.

📒 Files selected for processing (1)

modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml

codecov · 2026-04-09T01:00:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.00%. Comparing base (cccfded) to head (0b53b66).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1213      +/-   ##
==========================================
- Coverage   71.07%   66.00%   -5.08%     
==========================================
  Files         353      353              
  Lines       40430    40430              
==========================================
- Hits        28735    26684    -2051     
- Misses      11695    13746    +2051

Flag	Coverage Δ
unit	`55.17% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

shengliangxu · 2026-04-09T01:15:41Z

modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml

+    max calibration.
+ptq_cfg:
+  algorithm: max
+  quant_cfg:


quant_cfg is now list format, please convert, check the doc:

https://nvidia.github.io/Model-Optimizer/guides/_quant_cfg.html

yueshen2016 requested a review from a team as a code owner April 9, 2026 00:41

yueshen2016 requested a review from shengliangxu April 9, 2026 00:41

yueshen2016 requested review from Edwardf0t1 and cjluo-nv April 9, 2026 00:42

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

Fix yamlfmt: wrap long description line in recipe

0b53b66

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

shengliangxu reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP8 QKVO + NVFP4 MLP PTQ recipe#1213

Add FP8 QKVO + NVFP4 MLP PTQ recipe#1213
yueshen2016 wants to merge 2 commits intomainfrom
yueshen/fp8_qkvo-nvfp4_mlp

yueshen2016 commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-09 00:51 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Apr 9, 2026

Uh oh!

shengliangxu Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yueshen2016 commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-09 00:51 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 9, 2026

Codecov Report

Uh oh!

shengliangxu Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yueshen2016 commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-09 00:51 UTC.
Preview will be ready when the GitHub Pages deployment is complete.