Python bindings for block quantization by protonu · Pull Request #5591 · NVIDIA/Fuser

protonu · 2025-11-25T21:06:07Z

This add Python bindings to the block quantization operator.
We add tests in Python to test block quantization to nvfp4. This tests against the rowwise 1D quantization in Transfomer Engine.

To get improved precision, the numerics in the runtime function was modified and the clamping functions removed.
We move the code to swizzle the block scales to an utility function.
We copied some code to dequantize nvfp4 to fp32 from TE. This is only used is test validation.

protonu · 2025-11-25T21:06:24Z

!test

github-actions · 2025-11-25T21:07:31Z

Review updated until commit 9181f95

Description

Add Python bindings for block quantization operator with nv_block_quantize function
Implement swizzle block scales functionality as utility function for improved precision
Add validation to ensure swizzled scales when block scaling factor has allocation domain
Update CUDA kernel with improved numerics (multiplication instead of division, removed clamping)
Add comprehensive Python tests comparing against Transformer Engine NVFP4 quantization

Changes walkthrough

Relevant files

Enhancement

10 files

ops.cpp `Add Python binding for block quantization operation`	+42/-0
arith.cpp `Update blockQuantize function with swizzle parameter`	+8/-1
utils.cpp `Implement swizzleBlockScales utility function`	+24/-0
internal_nodes.cpp `Add swizzled_scales parameter to BlockQuantizationOp`	+3/-1
block_quantization_kernels.cu `Update CUDA kernel with improved numerics`	+11/-15
python_translate.cpp `Add BlockQuantizationOp translation to Python`	+25/-0
narrow_precision.py `Add FP4 dequantization and swizzling utility functions`	+48/-0
internal_nodes.h `Update BlockQuantizationOp header with swizzled_scales`	+6/-1
utils.h `Add swizzleBlockScales function declaration`	+5/-0
arith.h `Update blockQuantize function signature`	+1/-0

Bug fix

1 files

validation.cpp `Add validation for swizzled scales requirement`	+5/-0

Tests

2 files

test_narrow_precision.py `Add Python tests for block quantization vs Transformer Engine`	+95/-0
test_low_precision_recipe.cpp `Update C++ tests to use swizzle utility function`	+79/-57

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Restrictive validation logic The new validation check requires swizzled scales whenever a block scaling factor has an allocation domain. This might be overly restrictive and could prevent valid use cases where users want allocation domains without swizzling. Consider if this constraint is necessary or if it should be made more flexible. NVF_ERROR_EQ( bqop->isSwizzledScales(), true, "Block scaling factor with allocation domain requires swizzled " "scales."); Rigid domain dimension check The swizzleBlockScales function enforces a strict 2D loop domain requirement. This limitation might exclude valid tensor configurations that could benefit from block scaling swizzling. Consider making this validation more flexible to handle different tensor dimensionalities. NVF_ERROR( tv && tv->getLoopDomain().size() == 2, "we can only swizzle 2D block scales tvs"); Incomplete error handling in TE comparison test The test_nv_block_quantization_vs_te function catches exceptions during TE quantization but only prints an error message and returns None. The test continues execution and will likely fail with a cryptic error later. Consider adding explicit pytest.skip or pytest.xfail for cases where TE is not available, and ensure the test provides clear feedback about missing dependencies. except Exception as e: print(f"\nError during quantization: {e}") import traceback traceback.print_exc() print("NOTE: This requires an NVIDIA Blackwell GPU and TE >= 1.6.") return None

greptile-apps · 2025-11-25T21:50:59Z

Greptile Overview

Greptile Summary

This PR adds Python bindings for the block quantization operator, enabling Python users to quantize tensors to NVFP4 format with block-wise scaling. The implementation aligns with Transformer Engine's rowwise 1D quantization approach.

Key Changes:

Added fd.ops.nv_block_quantize() Python API that returns (quantized_tensor, block_scales)
Updated quantization numerics to match TE behavior: replaced division with multiplication, removed manual clamping (relying on FP8 conversion for clamping), changed formula from global_scale / scaled_max to multiplication-based approach
Extracted block scale swizzling logic into reusable ir_utils::swizzleBlockScales() utility function
Added swizzle_scales parameter throughout the stack (Python bindings → C++ API → IR nodes → runtime)
Implemented comprehensive Python tests validating against TE's NVFP4 quantization with <10% mismatch tolerance
Added validation ensuring block scales with allocation domain require swizzled_scales=true

Numerics Changes:
The runtime kernel now computes scaled_max = block_max * global_scale * (1/6) then converts to FP8 and back, computing the reciprocal as global_scale / scaled_max. This differs from the old approach which divided then clamped explicitly. The FP8 conversion now handles clamping implicitly.

Confidence Score: 4/5

This PR is safe to merge with minor risk - the numerics changes align with Transformer Engine's proven approach and are well-tested
Score reflects well-structured implementation with proper testing, but numerics changes to remove clamping require careful validation in production. The division-by-zero concern raised in previous threads was confirmed as matching TE behavior (developer confirmed this copies TE). Code is well-organized with good separation of concerns (utility functions, proper parameter threading). Python bindings are correctly implemented with proper docstrings and return value handling.
Pay close attention to runtime/block_quantization_kernels.cu - the numerics changes remove explicit clamping and change the scaling formula, which could affect edge cases with all-zero blocks or extreme values

Important Files Changed

File Analysis

Filename	Score	Overview
python/python_direct/ops.cpp	4/5	Added Python bindings for `nv_block_quantize`, exposing block quantization to Python with proper parameter handling
runtime/block_quantization_kernels.cu	4/5	Updated quantization numerics to match Transformer Engine - removed clamping, changed division to multiplication, improved precision
csrc/ops/arith.cpp	5/5	Added `swizzle_scales` parameter support to `blockQuantize` function with proper delegation to utility function
csrc/ir/utils.cpp	5/5	Extracted block scale swizzling logic into reusable `swizzleBlockScales` utility function
tests/python/direct/test_narrow_precision.py	4/5	Added comprehensive test comparing nvfuser block quantization against Transformer Engine reference implementation

Sequence Diagram

sequenceDiagram
    participant User as Python User
    participant Binding as ops.cpp (Python Binding)
    participant API as arith.cpp (blockQuantize)
    participant Utils as ir_utils (swizzleBlockScales)
    participant IR as BlockQuantizationOp
    participant Runtime as block_quantization_kernels.cu
    
    User->>Binding: fd.ops.nv_block_quantize(input, global_scale, swizzle_scales, block_size)
    Binding->>API: blockQuantize(input, global_scale, block_size, swizzle_scales)
    API->>API: Create quantized_tensor and block_scales TensorViews
    
    alt swizzle_scales == true
        API->>Utils: swizzleBlockScales(block_scales)
        Utils->>Utils: Apply split/merge operations for 128x4 swizzle pattern
    end
    
    API->>IR: Create BlockQuantizationOp with swizzle flag
    IR->>Runtime: Execute block_quantize_to_nvfp4 kernel
    
    Runtime->>Runtime: Compute block_max across threads
    Runtime->>Runtime: scaled_max = block_max * global_scale * (1/6)
    Runtime->>Runtime: Convert to FP8: __float2e4m3(scaled_max)
    Runtime->>Runtime: Convert back to FP32: __e4m32float(clamped_max_fp8)
    Runtime->>Runtime: Compute reciprocal: global_scale / scaled_max
    Runtime->>Runtime: Scale values: vec_in[i] * scaled_max
    Runtime->>Runtime: Quantize to FP4: __float2e2m1(scaled_vals)
    
    Runtime-->>IR: Return quantized_tensor and block_scales
    IR-->>API: Return BlockQuantizationResults
    API-->>Binding: Return (quantized_tensor, block_scales)
    Binding-->>User: Return py::tuple(quantized_tensor, block_scales)

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

python/python_direct/ops.cpp

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

tests/python/direct/test_narrow_precision.py

csrc/ir/utils.cpp

csrc/ir/utils.h

python/python_direct/ops.cpp

jjsjann123 · 2025-11-26T01:05:09Z

python/python_direct/python_translate.cpp

+
+    auto dtype = bqop->quantizedOutput()->as<TensorView>()->dtype();
+    // If the block scales tensor has an allocation, then it had
+    // to have been swizzled.


this might not be the case for sharded tensor though.

Having said that, I think we have some assumptions on the status of Fusion that we try to print out. Wondering if @rdspring1 has some input on this one.

What about adding a isSwizzled field to BlockQuantizationOp?

I was thinking about this as well.
One thing that worried me if would we need to ensure that the field and allocation domain of the TV is always in sync? @jjsjann123, let me know what you think.

ensure that the field and allocation domain of the TV is always in sync?

If we have an isSwizzled field, I think we would only need it for replay. i.e. any further modification done during scheduling can be safely ignored.

I think the only messy bits are for sharded TVs, but we don't yet support sharding/scheduling in replay, so maybe leaving a comment on that would be fine for now.

tests/python/direct_utils/narrow_precision.py

Co-authored-by: jjsjann123 <jiej@nvidia.com>

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

runtime/block_quantization_kernels.cu

protonu · 2025-11-26T02:44:35Z

!test

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{13 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

protonu · 2025-12-01T19:43:28Z

!test

greptile-apps

_{13 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

protonu · 2025-12-01T19:48:13Z

!test

protonu · 2025-12-01T21:17:28Z

!test

greptile-apps

_{13 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

protonu · 2025-12-02T14:46:38Z

!test

greptile-apps

_{13 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

jjsjann123

Happy with the updated field on block quantized op. Stamping to unblock.

jjsjann123 · 2025-12-02T17:05:33Z

csrc/device_lower/validation.cpp

+          bqop->isSwizzledScales(),
+          true,
+          "Block scaling factor with allocation domain requires swizzled "
+          "scales.");


note: for multi-gpu, where allocation domain would be used for sharding, we might run into issues here. But we can fix that at a later point.

Stacked on top of #5591. This removes old validation code in favor of new dequantization of nvfp4 that was added in above mentioned PR. No tests needed. --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: jjsjann123 <jiej@nvidia.com>

protonu added 3 commits November 25, 2025 09:36

python binding for bq op

a15cfad

modified py tests

e4588b1

Merge branch 'main' into pbasu_bq_py

0a5f866

protonu mentioned this pull request Nov 25, 2025

Remove old validation code for dequantization of nvfp4 #5592

Merged

protonu requested a review from jjsjann123 November 25, 2025 21:47

protonu marked this pull request as ready for review November 25, 2025 21:47

greptile-apps bot reviewed Nov 25, 2025

View reviewed changes

python/python_direct/ops.cpp Outdated Show resolved Hide resolved

Apply suggestion from @greptile-apps[bot]

15bb089

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps bot reviewed Nov 25, 2025

View reviewed changes

tests/python/direct/test_narrow_precision.py Show resolved Hide resolved

jjsjann123 reviewed Nov 26, 2025

View reviewed changes

protonu requested a review from rdspring1 November 26, 2025 02:09

Apply suggestion from @jjsjann123

5353550

Co-authored-by: jjsjann123 <jiej@nvidia.com>

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

address reviewer comments

9810d96

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

runtime/block_quantization_kernels.cu Show resolved Hide resolved

protonu requested a review from jjsjann123 November 26, 2025 02:44

Merge branch 'main' into pbasu_bq_py

f648ce6

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

protonu added 2 commits December 1, 2025 11:36

flag in bq op for swizzle

5fb76b9

Merge branch 'main' into pbasu_bq_py

276cad2

greptile-apps bot reviewed Dec 1, 2025

View reviewed changes

Merge branch 'main' into pbasu_bq_py

d5b94e1

greptile-apps bot reviewed Dec 1, 2025

View reviewed changes

Merge branch 'main' into pbasu_bq_py

9181f95

greptile-apps bot reviewed Dec 2, 2025

View reviewed changes

jjsjann123 approved these changes Dec 2, 2025

View reviewed changes

protonu merged commit 9bd8c34 into main Dec 2, 2025
62 checks passed

protonu deleted the pbasu_bq_py branch December 2, 2025 17:15

protonu mentioned this pull request Dec 2, 2025

Expose quantization through Python API to test #5424

Closed

Conversation

protonu commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

protonu commented Nov 25, 2025

Uh oh!

github-actions bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjsjann123 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

rdspring1 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

protonu Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

protonu commented Nov 26, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

protonu commented Dec 1, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

protonu commented Dec 1, 2025

Uh oh!

protonu commented Dec 1, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

protonu commented Dec 2, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

protonu commented Nov 25, 2025 •

edited

Loading

github-actions bot commented Nov 25, 2025 •

edited

Loading

greptile-apps bot commented Nov 25, 2025 •

edited

Loading