Conversation
Add a new PTQ recipe that applies FP8 per-tensor (W8A8) quantization to
attention Q/K/V/O projections and NVFP4 block-wise (W4A4) quantization
to MLP/MoE layers with max calibration.
Example usage on Gemma-4-31B-IT:
cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
--pyt_ckpt_path /models/gemma-4-31B-it \
--recipe general/ptq/fp8_qkvo-nvfp4_mlp \
--calib_size 512 \
--dataset cnn_dailymail \
--export_path /models/gemma-4-31B-it-fp8qkvo-nvfp4mlp
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
📝 WalkthroughWalkthroughAdds a new PTQ recipe YAML that sets Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml`:
- Line 18: The description is inaccurate: NVFP4 is configured with dynamic
weight quantization (see block_sizes.type: dynamic in the weight quantizer
blocks) but the text says "static weight"; update the description string in
fp8_qkvo-nvfp4_mlp.yml to reflect "NVFP4 dynamic weight and dynamic activation"
(or similar), and audit other NVFP4 recipes (e.g., nvfp4_mlp_only-fp8_kv.yml,
nvfp4_experts_only-fp8_kv.yml) to ensure their description matches the actual
config where weight quantizers use block_sizes.type: dynamic; alternatively, if
you intended static weights, change the weight quantizer block_sizes.type values
to "static" in the weight quantizer sections to match the existing description.
- Around line 85-86: Move the catch-all deny entry "default: enable: false" so
it is the first entry under quant_cfg (so later per-layer enables override it)
and update the YAML ordering accordingly; also reconcile the metadata string
"NVFP4 static weight and dynamic activation" with the actual weight quantizer
settings by either changing the metadata to state "dynamic weights" or changing
the weight quantizers (the entries under e.g. the weight quantizers at lines
where type: dynamic is set) to use a static weight quantizer type, ensuring
metadata and the weight quantizer "type" settings are consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6ad2a150-8d54-4f2d-9f30-ac49bc376654
📒 Files selected for processing (1)
modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml
|
|
||
| metadata: | ||
| recipe_type: ptq | ||
| description: FP8 per-tensor weight and activation for attention QKVO projections (W8A8), NVFP4 static weight and dynamic activation for MLP layers (W4A4), max calibration. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check wording/config consistency with other PTQ recipes.
rg -n -C2 'static weight|dynamic activation|block_sizes:|type:\s*dynamic|NVFP4' modelopt_recipes/general/ptqRepository: NVIDIA/Model-Optimizer
Length of output: 13727
Update description to match config: NVFP4 uses dynamic weight quantization, not static.
Line 18 describes "NVFP4 static weight and dynamic activation" but the weight quantizers at lines 23-29 and 37-43 both set block_sizes.type: dynamic. This discrepancy also appears in other NVFP4 recipes (e.g., nvfp4_mlp_only-fp8_kv.yml, nvfp4_experts_only-fp8_kv.yml), all with "static weight" in descriptions but type: dynamic in configs. Clarify the intended quantization scheme across all affected recipes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` at line 18, The
description is inaccurate: NVFP4 is configured with dynamic weight quantization
(see block_sizes.type: dynamic in the weight quantizer blocks) but the text says
"static weight"; update the description string in fp8_qkvo-nvfp4_mlp.yml to
reflect "NVFP4 dynamic weight and dynamic activation" (or similar), and audit
other NVFP4 recipes (e.g., nvfp4_mlp_only-fp8_kv.yml,
nvfp4_experts_only-fp8_kv.yml) to ensure their description matches the actual
config where weight quantizers use block_sizes.type: dynamic; alternatively, if
you intended static weights, change the weight quantizer block_sizes.type values
to "static" in the weight quantizer sections to match the existing description.
| default: | ||
| enable: false |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify precedence + legacy "default" conversion behavior in quant config handling.
rg -n -C3 'Entries are applied \*\*in list order\*\*|if key == "default"|key = "\*"' modelopt/torch/quantization/config.pyRepository: NVIDIA/Model-Optimizer
Length of output: 841
🏁 Script executed:
cat -n modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml | head -120Repository: NVIDIA/Model-Optimizer
Length of output: 3915
Move default: enable: false to the start of quant_cfg to match documented precedence rules.
Per the config documentation, entries are applied in order with later entries overriding earlier ones. The recommended pattern is to start with a deny-all entry (default: enable: false), then add specific enables afterward. The current YAML has this reversed—the catch-all appears at line 85-86 after all the enable: true entries, which violates the intended precedence logic.
Additionally, the metadata (line 18) describes "NVFP4 static weight and dynamic activation" but the weight quantizers (lines 25-27, 40-42) configure type: dynamic. Clarify whether weights should be static or dynamic per the intended recipe behavior.
Proposed precedence fix
ptq_cfg:
algorithm: max
quant_cfg:
+ default:
+ enable: false
# NVFP4 W4A4 for MLP / MoE layers
'*mlp*weight_quantizer':
num_bits: e2m1
block_sizes:
type: dynamic
@@
- # Standard disables (routers, norms, lm_head, etc.)
- default:
- enable: false
+ # Standard disables (routers, norms, lm_head, etc.)
'*block_sparse_moe.gate*':
enable: false🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` around lines 85 - 86,
Move the catch-all deny entry "default: enable: false" so it is the first entry
under quant_cfg (so later per-layer enables override it) and update the YAML
ordering accordingly; also reconcile the metadata string "NVFP4 static weight
and dynamic activation" with the actual weight quantizer settings by either
changing the metadata to state "dynamic weights" or changing the weight
quantizers (the entries under e.g. the weight quantizers at lines where type:
dynamic is set) to use a static weight quantizer type, ensuring metadata and the
weight quantizer "type" settings are consistent.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml (2)
18-19:⚠️ Potential issue | 🟡 MinorMetadata description is inconsistent with the actual quantization config.
Line 18 says NVFP4 uses static weights for MLP, but Lines 24-50 configure
block_sizes.type: dynamicand also target MoE (*block_sparse_moe*). Please align description with config behavior.Proposed metadata fix
- description: FP8 per-tensor weight and activation for attention QKVO projections (W8A8), NVFP4 static weight and dynamic activation for MLP layers (W4A4), + description: FP8 per-tensor weight and activation for attention QKVO projections (W8A8), NVFP4 dynamic weight and dynamic activation for MLP/MoE layers (W4A4), max calibration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` around lines 18 - 19, The YAML description string under description is inconsistent with the configuration: update the metadata to match the actual quantization settings by changing the descriptive text that claims "NVFP4 static weight and dynamic activation for MLP layers" to reflect that block_sizes.type is set to "dynamic" for NVFP4 (and that the config targets MoE via the block_sparse_moe pattern); ensure the description mentions NVFP4 dynamic weights/activations for MLP or otherwise accurately notes dynamic block sizing and MoE targeting so the textual metadata and the keys block_sizes.type and any block_sparse_moe references are aligned.
86-87:⚠️ Potential issue | 🔴 Critical
default: enable: falseordering can disable all prior enables.With precedence applied in list order, placing
defaultat Line 86 after theenable: trueentries risks overriding them. Movedefaultto the top ofquant_cfgso specific patterns can re-enable targeted quantizers.Proposed ordering fix
ptq_cfg: algorithm: max quant_cfg: + default: + enable: false # NVFP4 W4A4 for MLP / MoE layers '*mlp*weight_quantizer': num_bits: e2m1 @@ - # Standard disables (routers, norms, lm_head, etc.) - default: - enable: false + # Standard disables (routers, norms, lm_head, etc.) '*block_sparse_moe.gate*': enable: falseUse this to confirm precedence/default handling against repo code and this recipe:
#!/bin/bash set -euo pipefail rg -n -C3 'Entries are applied \*\*in list order\*\*|if key == "default"|key = "\*"' modelopt/torch/quantization/config.py cat -n modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml | sed -n '20,100p'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml` around lines 86 - 87, The `default: enable: false` entry in fp8_qkvo-nvfp4_mlp.yml can override earlier enables because entries are applied in list order; move the `default` block to the very top of the `quant_cfg` list so that pattern-specific entries with `enable: true` (the existing enabled quantizer entries) can re-enable targeted quantizers; update the YAML ordering so `default` appears first and then the specific patterns follow, and re-run the repo's precedence check (the config handling in modelopt/torch/quantization/config.py) to verify behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml`:
- Around line 18-19: The YAML description string under description is
inconsistent with the configuration: update the metadata to match the actual
quantization settings by changing the descriptive text that claims "NVFP4 static
weight and dynamic activation for MLP layers" to reflect that block_sizes.type
is set to "dynamic" for NVFP4 (and that the config targets MoE via the
block_sparse_moe pattern); ensure the description mentions NVFP4 dynamic
weights/activations for MLP or otherwise accurately notes dynamic block sizing
and MoE targeting so the textual metadata and the keys block_sizes.type and any
block_sparse_moe references are aligned.
- Around line 86-87: The `default: enable: false` entry in
fp8_qkvo-nvfp4_mlp.yml can override earlier enables because entries are applied
in list order; move the `default` block to the very top of the `quant_cfg` list
so that pattern-specific entries with `enable: true` (the existing enabled
quantizer entries) can re-enable targeted quantizers; update the YAML ordering
so `default` appears first and then the specific patterns follow, and re-run the
repo's precedence check (the config handling in
modelopt/torch/quantization/config.py) to verify behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0956c4d8-c96c-424e-975c-62cbdd03ec92
📒 Files selected for processing (1)
modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.yml
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1213 +/- ##
==========================================
- Coverage 71.07% 66.00% -5.08%
==========================================
Files 353 353
Lines 40430 40430
==========================================
- Hits 28735 26684 -2051
- Misses 11695 13746 +2051
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| max calibration. | ||
| ptq_cfg: | ||
| algorithm: max | ||
| quant_cfg: |
There was a problem hiding this comment.
quant_cfg is now list format, please convert, check the doc:
https://nvidia.github.io/Model-Optimizer/guides/_quant_cfg.html
Add a new PTQ recipe that applies FP8 per-tensor (W8A8) quantization to attention Q/K/V/O projections and NVFP4 block-wise (W4A4) quantization to MLP/MoE layers with max calibration.
Example usage on Gemma-4-31B-IT:
What does this PR do?
Type of change: New feature
Adds a new built-in PTQ recipe
fp8_qkvo-nvfp4_mlpthat combines two quantization strategies:This mixed-precision recipe targets a balance between model quality and inference performance — keeping attention projections at higher precision (FP8) while aggressively quantizing MLP layers (NVFP4). Standard components (routers, norms, lm_head, BatchNorm, etc.) are left unquantized.
Usage
Testing
load_recipe("general/ptq/fp8_qkvo-nvfp4_mlp")Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
Recipe file:
modelopt_recipes/general/ptq/fp8_qkvo-nvfp4_mlp.ymlSummary by CodeRabbit