Conversation
|
!test |
|
Review updated until commit 9181f95 Description
|
| Relevant files | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 10 files
| ||||||||||||||||||||
| Bug fix | 1 files
| ||||||||||||||||||||
| Tests |
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Restrictive validation logic
|
Greptile OverviewGreptile SummaryThis PR adds Python bindings for the block quantization operator, enabling Python users to quantize tensors to NVFP4 format with block-wise scaling. The implementation aligns with Transformer Engine's rowwise 1D quantization approach. Key Changes:
Numerics Changes: Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User as Python User
participant Binding as ops.cpp (Python Binding)
participant API as arith.cpp (blockQuantize)
participant Utils as ir_utils (swizzleBlockScales)
participant IR as BlockQuantizationOp
participant Runtime as block_quantization_kernels.cu
User->>Binding: fd.ops.nv_block_quantize(input, global_scale, swizzle_scales, block_size)
Binding->>API: blockQuantize(input, global_scale, block_size, swizzle_scales)
API->>API: Create quantized_tensor and block_scales TensorViews
alt swizzle_scales == true
API->>Utils: swizzleBlockScales(block_scales)
Utils->>Utils: Apply split/merge operations for 128x4 swizzle pattern
end
API->>IR: Create BlockQuantizationOp with swizzle flag
IR->>Runtime: Execute block_quantize_to_nvfp4 kernel
Runtime->>Runtime: Compute block_max across threads
Runtime->>Runtime: scaled_max = block_max * global_scale * (1/6)
Runtime->>Runtime: Convert to FP8: __float2e4m3(scaled_max)
Runtime->>Runtime: Convert back to FP32: __e4m32float(clamped_max_fp8)
Runtime->>Runtime: Compute reciprocal: global_scale / scaled_max
Runtime->>Runtime: Scale values: vec_in[i] * scaled_max
Runtime->>Runtime: Quantize to FP4: __float2e2m1(scaled_vals)
Runtime-->>IR: Return quantized_tensor and block_scales
IR-->>API: Return BlockQuantizationResults
API-->>Binding: Return (quantized_tensor, block_scales)
Binding-->>User: Return py::tuple(quantized_tensor, block_scales)
|
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
|
||
| auto dtype = bqop->quantizedOutput()->as<TensorView>()->dtype(); | ||
| // If the block scales tensor has an allocation, then it had | ||
| // to have been swizzled. |
There was a problem hiding this comment.
this might not be the case for sharded tensor though.
There was a problem hiding this comment.
Having said that, I think we have some assumptions on the status of Fusion that we try to print out. Wondering if @rdspring1 has some input on this one.
There was a problem hiding this comment.
What about adding a isSwizzled field to BlockQuantizationOp?
There was a problem hiding this comment.
I was thinking about this as well.
One thing that worried me if would we need to ensure that the field and allocation domain of the TV is always in sync? @jjsjann123, let me know what you think.
There was a problem hiding this comment.
ensure that the field and allocation domain of the TV is always in sync?
If we have an isSwizzled field, I think we would only need it for replay. i.e. any further modification done during scheduling can be safely ignored.
I think the only messy bits are for sharded TVs, but we don't yet support sharding/scheduling in replay, so maybe leaving a comment on that would be fine for now.
Co-authored-by: jjsjann123 <jiej@nvidia.com>
|
!test |
|
!test |
|
!test |
|
!test |
|
!test |
jjsjann123
left a comment
There was a problem hiding this comment.
Happy with the updated field on block quantized op. Stamping to unblock.
| bqop->isSwizzledScales(), | ||
| true, | ||
| "Block scaling factor with allocation domain requires swizzled " | ||
| "scales."); |
There was a problem hiding this comment.
note: for multi-gpu, where allocation domain would be used for sharding, we might run into issues here. But we can fix that at a later point.
Stacked on top of #5591. This removes old validation code in favor of new dequantization of nvfp4 that was added in above mentioned PR. No tests needed. --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: jjsjann123 <jiej@nvidia.com>
This add Python bindings to the block quantization operator.
We add tests in Python to test block quantization to nvfp4. This tests against the rowwise 1D quantization in Transfomer Engine.