Skip to content

Add FP8 QKVO + NVFP4 MLP PTQ recipe#1213

Open
yueshen2016 wants to merge 2 commits intomainfrom
yueshen/fp8_qkvo-nvfp4_mlp
Open

Add FP8 QKVO + NVFP4 MLP PTQ recipe#1213
yueshen2016 wants to merge 2 commits intomainfrom
yueshen/fp8_qkvo-nvfp4_mlp

Conversation

@yueshen2016
Copy link
Copy Markdown
Contributor

@yueshen2016 yueshen2016 commented Apr 9, 2026

Add a new PTQ recipe that applies FP8 per-tensor (W8A8) quantization to attention Q/K/V/O projections and NVFP4 block-wise (W4A4) quantization to MLP/MoE layers with max calibration.

Example usage on Gemma-4-31B-IT:

cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
  --pyt_ckpt_path /models/gemma-4-31B-it \
  --recipe general/ptq/fp8_qkvo-nvfp4_mlp \
  --calib_size 512 \
  --dataset cnn_dailymail \
  --export_path /models/gemma-4-31B-it-fp8qkvo-nvfp4mlp

What does this PR do?

Type of change: New feature

Adds a new built-in PTQ recipe fp8_qkvo-nvfp4_mlp that combines two quantization strategies:

  • FP8 W8A8 (e4m3, per-tensor) for attention Q, K, V, O projections
  • NVFP4 W4A4 (e2m1, block size 16, dynamic scaling with e4m3 scales) for MLP and MoE layers

This mixed-precision recipe targets a balance between model quality and inference performance — keeping attention projections at higher precision (FP8) while aggressively quantizing MLP layers (NVFP4). Standard components (routers, norms, lm_head, BatchNorm, etc.) are left unquantized.

Usage

cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
  --pyt_ckpt_path /models/gemma-4-31B-it \
  --recipe general/ptq/fp8_qkvo-nvfp4_mlp \
  --calib_size 512 \
  --dataset cnn_dailymail \
  --export_path /models/gemma-4-31B-it-fp8qkvo-nvfp4mlp

Testing

  • Verified recipe loads correctly via load_recipe("general/ptq/fp8_qkvo-nvfp4_mlp")
  • Tested end-to-end PTQ + export on Gemma-4-31B-IT

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ (additive recipe only)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (YAML recipe, no new code)
  • Did you update Changelog?: ❌ (can add if needed)

Additional Information

Recipe file: modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml

Summary by CodeRabbit

  • New Features
    • Added a new post-training quantization recipe that applies FP8 precision to attention projection weights/activations and NVFP4 precision to MLP and MoE layers, with targeted enable/disable rules to restrict quantization to intended components for more efficient, fine-grained model compression.

Add a new PTQ recipe that applies FP8 per-tensor (W8A8) quantization to
attention Q/K/V/O projections and NVFP4 block-wise (W4A4) quantization
to MLP/MoE layers with max calibration.

Example usage on Gemma-4-31B-IT:

  cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
    --pyt_ckpt_path /models/gemma-4-31B-it \
    --recipe general/ptq/fp8_qkvo-nvfp4_mlp \
    --calib_size 512 \
    --dataset cnn_dailymail \
    --export_path /models/gemma-4-31B-it-fp8qkvo-nvfp4mlp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
@yueshen2016 yueshen2016 requested a review from a team as a code owner April 9, 2026 00:41
@yueshen2016 yueshen2016 requested a review from shengliangxu April 9, 2026 00:41
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Adds a new PTQ recipe YAML that sets max calibration and configures FP8 (e4m3) per-module quantization for attention Q/K/V/O projections and NVFP4 (e2m1) dynamic quantization for MLP and block-sparse MoE paths, while disabling quantization for many non-target modules.

Changes

Cohort / File(s) Summary
Recipe Configuration
modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml
Added PTQ recipe with metadata.recipe_type: ptq, ptq_cfg.algorithm: max. Enables dynamic NVFP4 (e2m1) with scale_bits: e4m3 and block_sizes: -1: 16 for *mlp* and *block_sparse_moe* weight/input quantizers. Enables FP8 (e4m3) per-module quantization for *{q,k,v,o}_proj*{weight,input}_quantizer with axis:. Disables quantization for broad defaults and patterns (e.g., default, router/gate, *lm_head*, nn.BatchNorm*, nn.LeakyReLU).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: adding a new PTQ recipe that combines FP8 quantization for attention projections (QKVO) with NVFP4 quantization for MLP layers.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed PR adds only YAML configuration file with no Python code changes; security anti-patterns check targets Python code only.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yueshen/fp8_qkvo-nvfp4_mlp

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1213/

Built to branch gh-pages at 2026-04-09 00:51 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml`:
- Line 18: The description is inaccurate: NVFP4 is configured with dynamic
weight quantization (see block_sizes.type: dynamic in the weight quantizer
blocks) but the text says "static weight"; update the description string in
fp8_qkvo-nvfp4_mlp.yml to reflect "NVFP4 dynamic weight and dynamic activation"
(or similar), and audit other NVFP4 recipes (e.g., nvfp4_mlp_only-fp8_kv.yml,
nvfp4_experts_only-fp8_kv.yml) to ensure their description matches the actual
config where weight quantizers use block_sizes.type: dynamic; alternatively, if
you intended static weights, change the weight quantizer block_sizes.type values
to "static" in the weight quantizer sections to match the existing description.
- Around line 85-86: Move the catch-all deny entry "default: enable: false" so
it is the first entry under quant_cfg (so later per-layer enables override it)
and update the YAML ordering accordingly; also reconcile the metadata string
"NVFP4 static weight and dynamic activation" with the actual weight quantizer
settings by either changing the metadata to state "dynamic weights" or changing
the weight quantizers (the entries under e.g. the weight quantizers at lines
where type: dynamic is set) to use a static weight quantizer type, ensuring
metadata and the weight quantizer "type" settings are consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ad2a150-8d54-4f2d-9f30-ac49bc376654

📥 Commits

Reviewing files that changed from the base of the PR and between cccfded and f1e85d9.

📒 Files selected for processing (1)
  • modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml


metadata:
recipe_type: ptq
description: FP8 per-tensor weight and activation for attention QKVO projections (W8A8), NVFP4 static weight and dynamic activation for MLP layers (W4A4), max calibration.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check wording/config consistency with other PTQ recipes.
rg -n -C2 'static weight|dynamic activation|block_sizes:|type:\s*dynamic|NVFP4' modelopt_recipes/general/ptq

Repository: NVIDIA/Model-Optimizer

Length of output: 13727


Update description to match config: NVFP4 uses dynamic weight quantization, not static.

Line 18 describes "NVFP4 static weight and dynamic activation" but the weight quantizers at lines 23-29 and 37-43 both set block_sizes.type: dynamic. This discrepancy also appears in other NVFP4 recipes (e.g., nvfp4_mlp_only-fp8_kv.yml, nvfp4_experts_only-fp8_kv.yml), all with "static weight" in descriptions but type: dynamic in configs. Clarify the intended quantization scheme across all affected recipes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` at line 18, The
description is inaccurate: NVFP4 is configured with dynamic weight quantization
(see block_sizes.type: dynamic in the weight quantizer blocks) but the text says
"static weight"; update the description string in fp8_qkvo-nvfp4_mlp.yml to
reflect "NVFP4 dynamic weight and dynamic activation" (or similar), and audit
other NVFP4 recipes (e.g., nvfp4_mlp_only-fp8_kv.yml,
nvfp4_experts_only-fp8_kv.yml) to ensure their description matches the actual
config where weight quantizers use block_sizes.type: dynamic; alternatively, if
you intended static weights, change the weight quantizer block_sizes.type values
to "static" in the weight quantizer sections to match the existing description.

Comment on lines +85 to +86
default:
enable: false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify precedence + legacy "default" conversion behavior in quant config handling.
rg -n -C3 'Entries are applied \*\*in list order\*\*|if key == "default"|key = "\*"' modelopt/torch/quantization/config.py

Repository: NVIDIA/Model-Optimizer

Length of output: 841


🏁 Script executed:

cat -n modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml | head -120

Repository: NVIDIA/Model-Optimizer

Length of output: 3915


Move default: enable: false to the start of quant_cfg to match documented precedence rules.

Per the config documentation, entries are applied in order with later entries overriding earlier ones. The recommended pattern is to start with a deny-all entry (default: enable: false), then add specific enables afterward. The current YAML has this reversed—the catch-all appears at line 85-86 after all the enable: true entries, which violates the intended precedence logic.

Additionally, the metadata (line 18) describes "NVFP4 static weight and dynamic activation" but the weight quantizers (lines 25-27, 40-42) configure type: dynamic. Clarify whether weights should be static or dynamic per the intended recipe behavior.

Proposed precedence fix
 ptq_cfg:
   algorithm: max
   quant_cfg:
+    default:
+      enable: false
     # NVFP4 W4A4 for MLP / MoE layers
     '*mlp*weight_quantizer':
       num_bits: e2m1
       block_sizes:
         type: dynamic
@@
-    # Standard disables (routers, norms, lm_head, etc.)
-    default:
-      enable: false
+    # Standard disables (routers, norms, lm_head, etc.)
     '*block_sparse_moe.gate*':
       enable: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` around lines 85 - 86,
Move the catch-all deny entry "default: enable: false" so it is the first entry
under quant_cfg (so later per-layer enables override it) and update the YAML
ordering accordingly; also reconcile the metadata string "NVFP4 static weight
and dynamic activation" with the actual weight quantizer settings by either
changing the metadata to state "dynamic weights" or changing the weight
quantizers (the entries under e.g. the weight quantizers at lines where type:
dynamic is set) to use a static weight quantizer type, ensuring metadata and the
weight quantizer "type" settings are consistent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml (2)

18-19: ⚠️ Potential issue | 🟡 Minor

Metadata description is inconsistent with the actual quantization config.

Line 18 says NVFP4 uses static weights for MLP, but Lines 24-50 configure block_sizes.type: dynamic and also target MoE (*block_sparse_moe*). Please align description with config behavior.

Proposed metadata fix
-  description: FP8 per-tensor weight and activation for attention QKVO projections (W8A8), NVFP4 static weight and dynamic activation for MLP layers (W4A4),
+  description: FP8 per-tensor weight and activation for attention QKVO projections (W8A8), NVFP4 dynamic weight and dynamic activation for MLP/MoE layers (W4A4),
     max calibration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` around lines 18 - 19,
The YAML description string under description is inconsistent with the
configuration: update the metadata to match the actual quantization settings by
changing the descriptive text that claims "NVFP4 static weight and dynamic
activation for MLP layers" to reflect that block_sizes.type is set to "dynamic"
for NVFP4 (and that the config targets MoE via the block_sparse_moe pattern);
ensure the description mentions NVFP4 dynamic weights/activations for MLP or
otherwise accurately notes dynamic block sizing and MoE targeting so the textual
metadata and the keys block_sizes.type and any block_sparse_moe references are
aligned.

86-87: ⚠️ Potential issue | 🔴 Critical

default: enable: false ordering can disable all prior enables.

With precedence applied in list order, placing default at Line 86 after the enable: true entries risks overriding them. Move default to the top of quant_cfg so specific patterns can re-enable targeted quantizers.

Proposed ordering fix
 ptq_cfg:
   algorithm: max
   quant_cfg:
+    default:
+      enable: false
     # NVFP4 W4A4 for MLP / MoE layers
     '*mlp*weight_quantizer':
       num_bits: e2m1
@@
-    # Standard disables (routers, norms, lm_head, etc.)
-    default:
-      enable: false
+    # Standard disables (routers, norms, lm_head, etc.)
     '*block_sparse_moe.gate*':
       enable: false

Use this to confirm precedence/default handling against repo code and this recipe:

#!/bin/bash
set -euo pipefail

rg -n -C3 'Entries are applied \*\*in list order\*\*|if key == "default"|key = "\*"' modelopt/torch/quantization/config.py
cat -n modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml | sed -n '20,100p'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` around lines 86 - 87,
The `default: enable: false` entry in fp8_qkvo-nvfp4_mlp.yml can override
earlier enables because entries are applied in list order; move the `default`
block to the very top of the `quant_cfg` list so that pattern-specific entries
with `enable: true` (the existing enabled quantizer entries) can re-enable
targeted quantizers; update the YAML ordering so `default` appears first and
then the specific patterns follow, and re-run the repo's precedence check (the
config handling in modelopt/torch/quantization/config.py) to verify behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml`:
- Around line 18-19: The YAML description string under description is
inconsistent with the configuration: update the metadata to match the actual
quantization settings by changing the descriptive text that claims "NVFP4 static
weight and dynamic activation for MLP layers" to reflect that block_sizes.type
is set to "dynamic" for NVFP4 (and that the config targets MoE via the
block_sparse_moe pattern); ensure the description mentions NVFP4 dynamic
weights/activations for MLP or otherwise accurately notes dynamic block sizing
and MoE targeting so the textual metadata and the keys block_sizes.type and any
block_sparse_moe references are aligned.
- Around line 86-87: The `default: enable: false` entry in
fp8_qkvo-nvfp4_mlp.yml can override earlier enables because entries are applied
in list order; move the `default` block to the very top of the `quant_cfg` list
so that pattern-specific entries with `enable: true` (the existing enabled
quantizer entries) can re-enable targeted quantizers; update the YAML ordering
so `default` appears first and then the specific patterns follow, and re-run the
repo's precedence check (the config handling in
modelopt/torch/quantization/config.py) to verify behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0956c4d8-c96c-424e-975c-62cbdd03ec92

📥 Commits

Reviewing files that changed from the base of the PR and between f1e85d9 and 0b53b66.

📒 Files selected for processing (1)
  • modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.00%. Comparing base (cccfded) to head (0b53b66).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1213      +/-   ##
==========================================
- Coverage   71.07%   66.00%   -5.08%     
==========================================
  Files         353      353              
  Lines       40430    40430              
==========================================
- Hits        28735    26684    -2051     
- Misses      11695    13746    +2051     
Flag Coverage Δ
unit 55.17% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

max calibration.
ptq_cfg:
algorithm: max
quant_cfg:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quant_cfg is now list format, please convert, check the doc:

https://nvidia.github.io/Model-Optimizer/guides/_quant_cfg.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants