Blockwise float8 quantizer and quantized tensor class by kwyss-nvidia · Pull Request #1513 · NVIDIA/TransformerEngine

kwyss-nvidia · 2025-02-27T00:28:04Z

Description

Adds pytorch and C++ quantizer and quantized tensor classes for a subchannel quantization scheme.

The classes are configurable for 128x128 blocksize and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.

Scale tensors are stored in a format emenable for matrix multiplication, however the integration of matmul is deferred as a separate story.

Fusions of quantization and DBIAS or activation functions are not yet implemented, and the dequantization is currently implemented in torch.

Tests for quantization are included in C++ and pytorch layers, with exact comparison to reference quantizer behavior as well as an attempt to hit interesting branches through the API such as tensor creation in pytorch and CPP and dequantization of row and columnwise usage.

Two CUDA kernels for quantization are included.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Pytorch/C++ Quantizer class
Pytorch/C++ Quantized Tensor class
Quantization CUDA kernels for 1x128 and 128x128 block size.
C++ testing of nvte_quantize API
python testing of quantization via tex.quantize

Checklist that can arguably can be deferred for a future MR:

Tasks that have a dependency on a GEMM and are not included.

GEMM implementation in general_gemm
Recipe Setup
Layer-wise numerical testing
Distributed numerical testing

GEMM integration is separate MR at #1545

Test Instructions

Python tests:

pytest tests/pytorch/test_float8blockwisetensor.py
pytest tests/pytorch/test_float8_blockwise_scaling_exact.py

C++ tests:

TE_PATH=<where_is_TE>/ bash qa/L0_cppunittest/test.sh
# Wait for the build to complete.
# To run specific tests
./tests/cpp/build/operator/test_operator --gtest_filter='*FusedCastFloat8*wiseTestSuite*'

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

zhongbozhu · 2025-02-27T00:36:35Z

Great to see this PR!

Can you leave some description about how to run your unit tests? Thank you.

transformer_engine/pytorch/tensor/_internal/mxfp8_tensor_base.py

transformer_engine/common/include/transformer_engine/transformer_engine.h

transformer_engine/common/transpose/cast_transpose.h

transformer_engine/common/transpose/compute_scale.cuh

transformer_engine/common/transpose/quantize_transpose_square_blockwise.cu

transformer_engine/common/transpose/quantize_transpose_vector_blockwise.cu

transformer_engine/common/util/cast_kernels.cuh

kwyss-nvidia · 2025-03-15T00:31:57Z

Benchmark results:

With larger code size:

is_1d_kernel  return_transpose           shape   timing_us
0           True              True     (256, 1024)   10.150487
1           True              True    (4096, 3072)   30.890434
2           True              True    (4096, 4096)   44.860952
3           True              True    (4096, 5440)   70.505325
4           True              True   (16384, 1024)   44.764331
5           True              True   (16384, 3072)  125.236950
6           True              True   (16384, 6144)  245.317183
7           True              True  (16384, 12288)  486.603893
8           True              True  (16384, 24576)  977.890830
9           True             False     (256, 1024)    9.345088
10          True             False    (4096, 3072)   17.145933
11          True             False    (4096, 4096)   32.242464
12          True             False    (4096, 5440)   49.772345
13          True             False   (16384, 1024)   32.224652
14          True             False   (16384, 3072)   88.912155
15          True             False   (16384, 6144)  170.590762
16          True             False  (16384, 12288)  335.485549
17          True             False  (16384, 24576)  669.053740
18         False              True     (256, 1024)   10.150071
19         False              True    (4096, 3072)   28.667756
20         False              True    (4096, 4096)   42.826061
21         False              True    (4096, 5440)   56.505605
22         False              True   (16384, 1024)   43.093660
23         False              True   (16384, 3072)  120.820711
24         False              True   (16384, 6144)  238.482608
25         False              True  (16384, 12288)  473.993761
26         False              True  (16384, 24576)  953.123620
27         False             False     (256, 1024)    9.361475
28         False             False    (4096, 3072)   16.812136
29         False             False    (4096, 4096)   31.802621
30         False             False    (4096, 5440)   40.984200
31         False             False   (16384, 1024)   31.765506
32         False             False   (16384, 3072)   85.275423
33         False             False   (16384, 6144)  164.587606
34         False             False  (16384, 12288)  327.348712
35         False             False  (16384, 24576)  655.115865

With reduced code size:

is_1d_kernel  return_transpose           shape   timing_us
0           True              True     (256, 1024)   33.380988
1           True              True    (4096, 3072)   31.081486
2           True              True    (4096, 4096)   45.144120
3           True              True    (4096, 5440)   70.915230
4           True              True   (16384, 1024)   44.990452
5           True              True   (16384, 3072)  125.563130
6           True              True   (16384, 6144)  245.544268
7           True              True  (16384, 12288)  486.944105
8           True              True  (16384, 24576)  977.990200
9           True             False     (256, 1024)    9.466304
10          True             False    (4096, 3072)   16.890583
11          True             False    (4096, 4096)   32.243207
12          True             False    (4096, 5440)   52.081308
13          True             False   (16384, 1024)   32.251191
14          True             False   (16384, 3072)   88.969267
15          True             False   (16384, 6144)  170.647831
16          True             False  (16384, 12288)  335.637203
17          True             False  (16384, 24576)  669.034820
18         False              True     (256, 1024)   34.060038
19         False              True    (4096, 3072)   30.163735
20         False              True    (4096, 4096)   43.273994
21         False              True    (4096, 5440)   56.714637
22         False              True   (16384, 1024)   43.521530
23         False              True   (16384, 3072)  121.285914
24         False              True   (16384, 6144)  238.833705
25         False              True  (16384, 12288)  474.244582
26         False              True  (16384, 24576)  953.438990
27         False             False     (256, 1024)    9.508913
28         False             False    (4096, 3072)   17.661353
29         False             False    (4096, 4096)   33.111296
30         False             False    (4096, 5440)   41.143637
31         False             False   (16384, 1024)   33.102805
32         False             False   (16384, 3072)   85.891990
33         False             False   (16384, 6144)  165.198970
34         False             False  (16384, 12288)  326.038147
35         False             False  (16384, 24576)  654.584625

There are a few cold start differences, but most of the difference looks negligible to me.

kwyss-nvidia · 2025-03-19T22:48:21Z

Just rebased onto origin/main. An implementation of nvte_shape for these scaling modes was needed to work with upstream changes. The required compatibility is added to this MR.

tests/cpp/test_common.cu

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

kwyss-nvidia · 2025-04-03T20:46:19Z

/te-ci pytorch

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

transformer_engine/common/transformer_engine.cpp

transformer_engine/common/include/transformer_engine/cast.h

transformer_engine/pytorch/csrc/extensions/quantizer.cpp

tests/pytorch/references/blockwise_quantizer_reference.py

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

ptrendx · 2025-04-03T23:13:55Z

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py

+        scale_shape = self.get_scale_shape(shape, columnwise=False)
+        scale_inv = torch.empty(
+            scale_shape,
+            dtype=torch.float32,
+            device=device,
+        )


I see that we pad the scales, good. Are we sure that the torch.empty is enough here or should we make sure that the padding is zeroed out? My concern is that, while TMA handles the boundary conditions for data (by zeroing out the output), if the GEMM does not apply the scale conditionally, you could still end up with the NaN (if the uninitialized memory in the scale turns out to be Inf and so you do Inf * 0). We should double check with cuBLAS.

I have checked with Roman and zero padding is not required.

ptrendx · 2025-04-03T23:14:09Z

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py

+            columnwise_scale_inv = torch.empty(
+                columnwise_scale_shape,
+                dtype=torch.float32,
+                device=device,
+            )


Same comment as scale_inv.

Zero padding not required.

transformer_engine/pytorch/tensor/float8_blockwise_tensor.py

tests/pytorch/test_float8_blockwise_scaling_exact.py

tests/pytorch/references/ref_per_tensor_cs.py

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

kwyss-nvidia · 2025-04-04T01:31:55Z

/te-ci pytorch

timmoon10

LGTM, pending CI

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2025-04-04T02:58:58Z

/te-ci L1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2025-04-04T04:51:28Z

/te-ci

* Blockwise float8 quantizer and quantized tensor class. The classes are configurable for 128x128 blocksize and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively. Scale tensors are stored in a format emenable for matrix multiplication, however the integration of matmul is deferred as a separate story. Fusions of quantization and DBIAS or activation functions are not yet implemented, and the dequantization is currently implemented in torch. Tests for quantization are included in C++ and pytorch layers, with exact comparison to reference quantizer behavior as well as an attempt to hit interesting branches through the API such as tensor creation in pytorch and CPP and dequantization of row and columnwise usage. Two CUDA kernels for quantization are included, and are direct ports of equivalents in the kitchen repository, where a subchannel recipe has been used for end to end training. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Apply linting changes. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Alignment for 1D scaling for GEMM edge case. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Change API name. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix merge conflict with name change. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use common tensor map API. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Change API to use two scaling mode enums. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix typo. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update some call sites. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Tests for torch tensor API surface. Since the quantized tensor is a tensor subclass, these tests exercise torch hooks. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Reuse scale calculation between quantizer refs. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Save memory by dropping reference to saved tensors. Issues previously observed are solved. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove constexpr parameters from kernel. Code size is reduced with fewer constexpr params. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Merge conflict from rebase. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add shape implementations for block scaling. nvte_shape was added upstream. Logic added for block scaled fp8. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Move benchmark to te_playground Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove amax_epsilon and pow_2_scales from tensor. Hardcodes the default values. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Lint changes. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fixup MR changes that broke. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Safer ifdef in kernel. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Documentation prose. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Reuse compute_scale function from Current Scaling. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Bugfix on inf_value scale refactor. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove qopt calls from test. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update pytest list. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add copyright to reference scale calc. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use ptx.cuh functions instead of cde. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update shape logic with allocation and reuse shape. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Usage defaults MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Copyright and header guard. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Updating torch dispatch code. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix exception type. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use TypeInfo Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update CS scale update test to use updated ref impl Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update JAX scaling mode enum Signed-off-by: Tim Moon <tmoon@nvidia.com> * Skip tests on Lovelace Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Keith Wyss <kwyss@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Blockwise float8 quantizer and quantized tensor class. The classes are configurable for 128x128 blocksize and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively. Scale tensors are stored in a format emenable for matrix multiplication, however the integration of matmul is deferred as a separate story. Fusions of quantization and DBIAS or activation functions are not yet implemented, and the dequantization is currently implemented in torch. Tests for quantization are included in C++ and pytorch layers, with exact comparison to reference quantizer behavior as well as an attempt to hit interesting branches through the API such as tensor creation in pytorch and CPP and dequantization of row and columnwise usage. Two CUDA kernels for quantization are included, and are direct ports of equivalents in the kitchen repository, where a subchannel recipe has been used for end to end training. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Apply linting changes. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Alignment for 1D scaling for GEMM edge case. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Change API name. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix merge conflict with name change. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use common tensor map API. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Change API to use two scaling mode enums. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix typo. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update some call sites. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Tests for torch tensor API surface. Since the quantized tensor is a tensor subclass, these tests exercise torch hooks. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Reuse scale calculation between quantizer refs. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Save memory by dropping reference to saved tensors. Issues previously observed are solved. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove constexpr parameters from kernel. Code size is reduced with fewer constexpr params. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Merge conflict from rebase. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add shape implementations for block scaling. nvte_shape was added upstream. Logic added for block scaled fp8. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Move benchmark to te_playground Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove amax_epsilon and pow_2_scales from tensor. Hardcodes the default values. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Lint changes. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fixup MR changes that broke. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Safer ifdef in kernel. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Documentation prose. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Reuse compute_scale function from Current Scaling. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Bugfix on inf_value scale refactor. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove qopt calls from test. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update pytest list. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add copyright to reference scale calc. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use ptx.cuh functions instead of cde. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update shape logic with allocation and reuse shape. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Usage defaults MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Copyright and header guard. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Updating torch dispatch code. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix exception type. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use TypeInfo Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update CS scale update test to use updated ref impl Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update JAX scaling mode enum Signed-off-by: Tim Moon <tmoon@nvidia.com> * Skip tests on Lovelace Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Keith Wyss <kwyss@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Peter Dykas <wdykas@nvidia.com>

ptrendx reviewed Feb 27, 2025

View reviewed changes

transformer_engine/pytorch/tensor/_internal/mxfp8_tensor_base.py Outdated Show resolved Hide resolved

kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch 4 times, most recently from df8c853 to 339b2a6 Compare March 6, 2025 19:21

kwyss-nvidia mentioned this pull request Mar 6, 2025

Subchannel Block quantized GEMM #1545

Merged

12 tasks

kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch from 339b2a6 to a2c9cbc Compare March 6, 2025 19:41