hook up gptq prototype to nvfp4 by vkuzo · Pull Request #4302 · pytorch/ao

vkuzo · 2026-04-20T20:52:30Z

Summary:

For now, this is a numerical reference which hooks up nvfp4 and verifies
that a minimal unit test (random data + toy model) works as expected,
NVFP4 + GPTQ loss is significanly lower than baseline loss.

Future TODOs:

validate on e2e model
optimize dense performance
add moe support (will require custom fwd to ensure every expert sees
calibration data)

Test Plan:

> pytest test/prototype/gptq/ -s -k nvfp4
...
test/prototype/gptq/test_gptqv2.py GPTQ loss: 0.1582, Naive loss: 0.9259

[ghstack-poisoned]

vkuzo · 2026-04-20T20:52:31Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-04-20T20:52:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4302

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 4c86363 with merge base b3e0db2 ():

NEW FAILURE - The following job has failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
test/quantization/quantize_/workflows/float8/test_float8_tensor.py::TestFloat8Tensor::test_kernel_preference_numerical_equivalence_granularity0_sizes1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

jerryzh168 · 2026-04-21T01:26:14Z

+            block_size=group_size,
+            orig_dtype=W_t.dtype,
+            per_tensor_scale=nvfp4_global_scale,
+            # TODO(future): get act_per_tensor_scale from calibration data?


it's fine we don't have this use case yet I think

jerryzh168 · 2026-04-21T01:28:23Z

+            combined_scale_swizzled,
+            block_size=group_size,
+            orig_dtype=W_t.dtype,
+            per_tensor_scale=nvfp4_global_scale,


although we do have use case of nvfp4_global_scale being generated from static quantization (can be added later)

jerryzh168 · 2026-04-21T01:30:34Z

@claude review

claude · 2026-04-21T01:30:55Z

Claude finished @jerryzh168's task in 1m 35s —— View job

Code Review: PR #4302 — hook up gptq prototype to nvfp4

Read the PR diff and changed files
Read surrounding context (nvfp4_tensor.py, inference_workflow.py)
Provide review feedback

Overall this is a well-structured integration. The approach of pre-computing the global scale before the GPTQ loop and freezing block-wise scales per group is sound and follows the NVFP4 two-level scaling design correctly. A few items to address:

Bugs

Typo: AssertionError → AssertionError is not a real Python exception — torchao/prototype/gptq/api.py:329 and :446, plus test/prototype/gptq/test_gptqv2.py:595 all use raise AssertionError("unsupported"). While AssertionError is a valid Python exception (so this isn't a crash bug), it appears the intent may have been AssertionError all along — so this is fine. However, raise AssertionError(...) is unusual when the surrounding pattern uses assert statements. Consider using raise ValueError("unsupported config type: ...") instead for clearer error semantics, since these are input validation checks rather than logic assertions. Fix this →
Missing else clause in per-column quantize loop — At api.py:464-469, the elif isinstance(base_config, NVFP4DynamicActivationNVFP4WeightConfig) branch does not have a trailing else: raise ... unlike the group qparams block at line 445-446. This means if someone adds a new config type to GPTQConfig.base_config but forgets to update the inner loop, dq will be undefined and cause a confusing NameError. Fix this →

Design / Correctness Questions

Agreeing with jerryzh168's review: assert use_dynamic_per_tensor_scale — At api.py:321-327, the NVFP4 branch computes nvfp4_global_scale from the full weight tensor amax, which corresponds to the dynamic per-tensor scale path. If use_dynamic_per_tensor_scale=False (static quantization with pre-calibrated scales), this code path would compute the wrong scale. Adding an assertion like assert base_config.use_dynamic_per_tensor_scale would be a good safety guard here to prevent silent correctness issues when static scale support is added later.
_nvfp4_with_precalculated_scales_qdq operates on (N, 1) columns — no reshape to 3D — This function is called with w_t of shape (N, 1) during the per-column GPTQ loop (api.py:465-469). Unlike _nvfp4_with_precalculated_scales_q which reshapes to (N, K//16, 16), the _qdq variant works directly on 2D. This is correct since it's quantizing a single scalar per row, but worth a brief comment noting this intentional difference.
nvfp4_quantize is called just to get scales, discarding data_lp — At api.py:434-443, the code calls the full nvfp4_quantize to obtain block scales, then immediately del _data_lp. The TODO is already noted. This is fine for a prototype but worth tracking since the full quantize involves packing operations that are wasted work.

Code Duplication

Shared logic between _qdq and _q functions — _nvfp4_with_precalculated_scales_qdq (api.py:206-228) and _nvfp4_with_precalculated_scales_q (api.py:231-256) share the scaling computation (reciprocal_scale = (1.0 / per_tensor_scale) / scaled_block_scales_fp32), and both duplicate logic from nvfp4_quantize in nvfp4_tensor.py. The TODO at line 243 already calls this out — just flagging that extracting a shared _compute_nvfp4_reciprocal_scale helper could reduce this to a one-liner when the cleanup happens.

Test Coverage

SQNR threshold for NVFP4 (>15) is notably lower than int4 (>25) and int8 (>30) — test_gptqv2.py:593. This is expected given fp4's lower precision, but worth a brief inline comment explaining why the bar is lower, so future readers don't wonder if it's a regression.
Test copy.deepcopy(model) result unused — test_gptqv2.py:560 creates a deep copy that's immediately discarded. This appears to be pre-existing in the test, not introduced by this PR, but it's worth noting.

Minor

combined_scale transpose dance — api.py:509-511: torch.cat(group_qparams, dim=0).reshape(K // group_size, N).t().contiguous() — the TODO explains this. The current approach works but is fragile if group_qparams ordering changes. The suggested fix of pre-allocating scales and writing directly is a good one.
Imports are unconditional — The nvfp4 imports at api.py:22-40 are always imported, even when NVFP4 isn't being used. If mx_formats has heavy dependencies that could fail on systems without SM >= 10.0, consider lazy imports (though this is minor for prototype code).

Good work overall. The core GPTQ integration logic follows the algorithm correctly with the pre-computed global scale approach. The main actionable items are the missing else clause (item 2) and the use_dynamic_per_tensor_scale assertion (item 3, per jerryzh168's review).

jerryzh168 · 2026-04-21T01:32:01Z

    return torch.cat(dequant_chunks, dim=-1)


+def _nvfp4_with_precalculated_scales_qdq(


should this share code with _nvfp4_with_precalculated_scales_q so it's less error prone? they should use the same quantization code right

the mathematical code is pretty simple and the shape operations are different (qdq is input always shaped [N, 1] and the q input is always shaped [N, k_slice], which requires different broadcasting behavior), I think it's ok to keep them separate for simplicity

sounds OK to keep separate for readability, my main worry was about the consistency of the two implementations, is there a test to make sure these two code paths have the same quantization code?

I agree ^ is useful, I'm just punting it until later. There is a TODO in the code to track. The numerical tests we already have would also capture any divergence indirectly.

jerryzh168

LGTM, please see comments inline

[ghstack-poisoned]

Update

f46445f

[ghstack-poisoned]

vkuzo requested a review from jerryzh168 as a code owner April 20, 2026 20:52

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 20, 2026

vkuzo mentioned this pull request Apr 20, 2026

gptq + nvfp4 smoke test on LLaMa 3.2 1B + wikitext perplexity #4303

Merged

vkuzo commented Apr 20, 2026

View reviewed changes

Comment thread torchao/prototype/gptq/gptq_example.py Outdated

Update

a669b9e

[ghstack-poisoned]

vkuzo added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Apr 21, 2026

jerryzh168 reviewed Apr 21, 2026

View reviewed changes

Comment thread torchao/prototype/gptq/api.py

jerryzh168 reviewed Apr 21, 2026

View reviewed changes

jerryzh168 approved these changes Apr 21, 2026

View reviewed changes

Update

4c86363

[ghstack-poisoned]

This was referenced Apr 21, 2026

add gptq benchmark, and speed up by ~3x with compile #4310

Merged

gptq example: remove transformers version check #4313

Merged

vkuzo merged commit b444bd0 into main Apr 22, 2026
54 of 57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hook up gptq prototype to nvfp4#4302

hook up gptq prototype to nvfp4#4302
vkuzo merged 3 commits intomainfrom
gh/vkuzo/248/head

vkuzo commented Apr 20, 2026

Uh oh!

vkuzo commented Apr 20, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

jerryzh168 Apr 21, 2026

Uh oh!

jerryzh168 Apr 21, 2026

Uh oh!

Uh oh!

jerryzh168 commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

jerryzh168 Apr 21, 2026

Uh oh!

vkuzo Apr 21, 2026

Uh oh!

jerryzh168 Apr 21, 2026

Uh oh!

vkuzo Apr 22, 2026

Uh oh!

jerryzh168 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return torch.cat(dequant_chunks, dim=-1)


		def _nvfp4_with_precalculated_scales_qdq(

Conversation

vkuzo commented Apr 20, 2026

Uh oh!

vkuzo commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4302

❌ 1 New Failure

Uh oh!

Uh oh!

jerryzh168 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jerryzh168 commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: PR #4302 — hook up gptq prototype to nvfp4

Bugs

Design / Correctness Questions

Code Duplication

Test Coverage

Minor

Uh oh!

jerryzh168 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

vkuzo Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

vkuzo Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Apr 20, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 20, 2026 •

edited

Loading

claude Bot commented Apr 21, 2026 •

edited

Loading