Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 44 additions & 2 deletions modelopt/torch/export/quant_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,31 @@ def get_scaling_factor(quantizer: TensorQuantizer) -> torch.Tensor:
return scaling_factor


def _ensure_weight_quantizer_calibrated(
weight_quantizer: TensorQuantizer, weight: torch.Tensor, module_name: str = ""
) -> None:
"""Calibrate weight quantizer if amax is not set.

This is a lazy calibration pattern used during export when weight quantizers
may not have been calibrated during the main calibration phase.

Args:
weight_quantizer: The weight quantizer to calibrate
weight: The weight tensor to use for calibration
module_name: Optional module name for better warning messages
"""
if not hasattr(weight_quantizer, "_amax") or weight_quantizer._amax is None:
warn(
f"Weight quantizer{f' for {module_name}' if module_name else ''} was not calibrated. "
f"Computing amax from weights. This may occur if: "
f"some experts were not activated during calibration (expected for MoE models), try increasing --calib_size"
)
weight_quantizer.reset_amax()
enable_stats_collection(weight_quantizer)
weight_quantizer(weight)
finish_stats_collection(weight_quantizer)

Comment on lines 239 to 262
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

vLLM NVFP4 quantization activation scales requirements

💡 Result:

For vLLM NVFP4 (FP4) quantization, the activation-scale requirements are:

  • You must have per-tensor “global” activation scales computed offline via calibration data (i.e., NVFP4 requires a calibration dataset to determine these global activation scales). The per-group (group size = 16) local activation scales are generated dynamically at inference time. [1]
  • Group size is fixed at 16 (you can’t change it). [2]
  • Activation quantization only runs on NVIDIA Blackwell-class GPUs (SM100+); on older GPUs, vLLM will run weight-only (no activation quantization). [1]
  • vLLM’s Marlin FP4 path does not support pairing NVFP4 weights with INT8/FP8 activations. [3]

Sources: [1] [2] [3]


🌐 Web query:

TensorRT-LLM NVFP4 quantization input scales deployment

💡 Result:

In TensorRT-LLM, NVFP4 “input/activation scales” are not something you pass at serving time; they’re produced/embedded during quantization/calibration and then used by the engine at runtime.

  • Weights (W) in NVFP4: TensorRT requires per-block scales with block size 16 for NVFP4 weight quantization, and those scales live with the quantized weights/checkpoint/engine artifacts. [1]
  • Activations (A) in NVFP4: TensorRT uses dynamic, per-block (block size 16) activation quantization. Practically, the engine computes a per-block scale from the current activation block at runtime. [1]
  • What TensorRT-LLM calls “activation global scale”: For NVFP4 activations, TensorRT supports Dynamic Double Quantization, where the per-block activation scales are themselves quantized (to FP8), and that scale-quantization needs an offline-calibrated per-tensor scalar (often described as a “global scale”, globalSf). This globalSf is what you calibrate offline and then deploy with the engine—again, not provided per request. [2]
  • Deployment workflow (where scales come from): Use a quantization flow (e.g., NeMo/ModelOpt → TensorRT-LLM export) that includes a calibration step to obtain algorithm-specific scaling factors, producing a TensorRT-LLM-ready checkpoint/dir used to build the serving engine. [3]
  • How you “enable” NVFP4 in TRT-LLM code/config: TensorRT-LLM has an NVFP4 quantization mode flag (e.g., QuantMode.NVFP4), which is what causes the runtime to use the NVFP4 quantized kernels/scales. [4]

Implication for serving: your client still sends normal (e.g., FP16/BF16) inputs; the TensorRT(-LLM) engine applies NVFP4 activation quantization internally using the calibrated globalSf plus runtime-computed per-block scales. [1][2]

Sources: [1] NVIDIA TensorRT “Working with Quantized Types” (NVFP4 block size + activation scheme) https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html, [2] same doc section on Dynamic Double Quantization / globalSf (NVFP4) https://docs.nvidia.com/deeplearning/tensorrt/10.13.0/inference-library/work-quantized-types.html, [3] NeMo Quantization workflow (calibration produces scaling factors; output used to build TRT-LLM engine) https://docs.nvidia.com/nemo-framework/user-guide/25.09/model-optimization/quantization/quantization.html, [4] TensorRT-LLM quantization mode flags (includes NVFP4) https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html


🌐 Web query:

NVFP4 weight-only quantization activation amax

💡 Result:

  • NVFP4 (TensorRT) is FP4E2M1 with per-block scaling (block size = 16), with data clipped to [-6, 6] during quantization. Weights can be quantized to NVFP4, and activations can be quantized too, but activation quantization is dynamic (computed at inference time). [1]
  • If you are doing NVFP4 weight-only quantization, then activations are not quantized, so there is no “activation amax” parameter to precompute/provide (amax is only needed when you actually quantize activations). [1]
  • When activations are NVFP4-quantized, TensorRT’s dynamic quantization computes the per-block scale from the block’s amax = max(abs(x)) at runtime (conceptually):
    [
    scale=\max_i \left(\frac{|x_i|}{qTypeMax}\right), \quad qTypeMax=6\text{ for FP4E2M1}
    ]
    so “activation amax” corresponds to that runtime max-abs statistic. [2]

Sources:
[1] NVIDIA TensorRT Docs — Working with Quantized Types (NVFP4 scheme + activation quantization requirement) (docs.nvidia.com)
[2] NVIDIA TensorRT Docs — Working with Quantized Types (dynamic quantization scale formula / max-abs) (docs.nvidia.com)

Citations:


Activation scales are required by deployment frameworks, but only if activation quantization is enabled.

Both vLLM and TensorRT-LLM require offline-calibrated activation scales (global scales/globalSf) when NVFP4 activation quantization is used. However, for weight-only NVFP4 deployment, activation scales are not needed.

This fix addresses weight quantizer calibration, which is sufficient for weight-only NVFP4. For MoE models where experts may not be activated during calibration, missing activation scales would be problematic only if activation quantization is required. If the Kimi-K2 checkpoint uses weight-only NVFP4, this limitation is not a concern.

🤖 Prompt for AI Agents
In `@modelopt/torch/export/quant_utils.py` around lines 239 - 256, The current
helper _ensure_weight_quantizer_calibrated should only produce weight scales and
must not attempt to produce activation scales; ensure it remains weight-only by
keeping the stats collection scoped to the provided weight_quantizer (use
enable_stats_collection(weight_quantizer) /
finish_stats_collection(weight_quantizer) as shown) and do not add any global or
activation quantizer calibration here; if activation quantization support is
needed, add a separate explicit code path elsewhere that checks an "activation
quantization enabled" flag and performs offline activation calibration (do not
rely on this weight-only helper to populate activation/global scales).


def get_activation_scaling_factor(
module: nn.Module, input_quantizer_name: str = "input_quantizer"
) -> torch.Tensor:
Expand Down Expand Up @@ -279,6 +304,10 @@ def get_weight_scaling_factor(module: nn.Module, weight_name: str = "weight") ->
QUANTIZATION_NVFP4_SVDQUANT,
QUANTIZATION_W4A8_NVFP4_FP8,
]:
# Calibrate weight quantizer if amax is not set
module_name = f"{type(module).__name__}.{weight_name}"
_ensure_weight_quantizer_calibrated(weight_quantizer, weight, module_name)

if quantization_format == QUANTIZATION_W4A8_NVFP4_FP8:
# weight_scaling_factor_2 for w4a8 needs to be amax/448, so that the wsf is in range 448/6.
# This is because the kernel dequantizes weight to fp8, which is in range 448.
Expand Down Expand Up @@ -307,13 +336,26 @@ def get_weight_scaling_factor_2(module: nn.Module, weight_name: str = "weight")
if weight_quantizer is None:
return None

if get_quantization_format(module) in [
quantization_format = get_quantization_format(module)

# Calibrate weight quantizer if amax is not set for all NVFP4 variants
if quantization_format in [
QUANTIZATION_NVFP4,
QUANTIZATION_NVFP4_AWQ,
QUANTIZATION_NVFP4_SVDQUANT,
QUANTIZATION_W4A8_NVFP4_FP8,
]:
weight = getattr(module, weight_name)
module_name = f"{type(module).__name__}.{weight_name}"
_ensure_weight_quantizer_calibrated(weight_quantizer, weight, module_name)

if quantization_format in [
QUANTIZATION_NVFP4,
QUANTIZATION_NVFP4_AWQ,
QUANTIZATION_NVFP4_SVDQUANT,
]:
return NVFP4QTensor.get_weights_scaling_factor_2_from_quantizer(weight_quantizer)
elif get_quantization_format(module) == QUANTIZATION_W4A8_NVFP4_FP8:
elif quantization_format == QUANTIZATION_W4A8_NVFP4_FP8:
# weight_scaling_factor_2 for w4a8 needs to be amax/448, so that the wsf is in range 448/6.
# This is because the kernel dequantizes weight to fp8, which is in range 448.
return weight_quantizer._amax.float() / 448.0
Expand Down