Skip to content

Blockwise float8 quantizer and quantized tensor class#1513

Merged
timmoon10 merged 40 commits intoNVIDIA:mainfrom
kwyss-nvidia:kwyss/subchannel_quantize_dequantize
Apr 4, 2025
Merged

Blockwise float8 quantizer and quantized tensor class#1513
timmoon10 merged 40 commits intoNVIDIA:mainfrom
kwyss-nvidia:kwyss/subchannel_quantize_dequantize

Conversation

@kwyss-nvidia
Copy link
Collaborator

@kwyss-nvidia kwyss-nvidia commented Feb 27, 2025

Description

Adds pytorch and C++ quantizer and quantized tensor classes for a subchannel quantization scheme.

The classes are configurable for 128x128 blocksize and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.

Scale tensors are stored in a format emenable for matrix multiplication, however the integration of matmul is deferred as a separate story.

Fusions of quantization and DBIAS or activation functions are not yet implemented, and the dequantization is currently implemented in torch.

Tests for quantization are included in C++ and pytorch layers, with exact comparison to reference quantizer behavior as well as an attempt to hit interesting branches through the API such as tensor creation in pytorch and CPP and dequantization of row and columnwise usage.

Two CUDA kernels for quantization are included.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Pytorch/C++ Quantizer class
  • Pytorch/C++ Quantized Tensor class
  • Quantization CUDA kernels for 1x128 and 128x128 block size.
  • C++ testing of nvte_quantize API
  • python testing of quantization via tex.quantize
  • Basic Quantizer
    • 2D with tests
    • 1D with tests
    • CPP bitwise tests
    • Generalized shape coverage
  • Python Bitwise tests for Quantizer
  • Columnwise Test Coverage
    • Remove row-wise usage and check dequantize
  • Create Tensor in C++ test coverage
    • 1D
    • 2D

Checklist that can arguably can be deferred for a future MR:

  •  Pytorch API Surface
    • get/set data
    • Operations other than quant/dequant
    • View/Reshape
  • Fused DBIAS/Activation
  • Dequantize in C++

Tasks that have a dependency on a GEMM and are not included.

  • GEMM implementation in general_gemm
  • Recipe Setup
  • Layer-wise numerical testing
  • Distributed numerical testing

GEMM integration is separate MR at #1545

Test Instructions

Python tests:

pytest tests/pytorch/test_float8blockwisetensor.py
pytest tests/pytorch/test_float8_blockwise_scaling_exact.py

C++ tests:

TE_PATH=<where_is_TE>/ bash qa/L0_cppunittest/test.sh
# Wait for the build to complete.
# To run specific tests
./tests/cpp/build/operator/test_operator --gtest_filter='*FusedCastFloat8*wiseTestSuite*'

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@zhongbozhu
Copy link
Collaborator

Great to see this PR!

Can you leave some description about how to run your unit tests? Thank you.

@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch 4 times, most recently from df8c853 to 339b2a6 Compare March 6, 2025 19:21
@kwyss-nvidia kwyss-nvidia mentioned this pull request Mar 6, 2025
12 tasks
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch from 339b2a6 to a2c9cbc Compare March 6, 2025 19:41
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch from a2c9cbc to 9710013 Compare March 10, 2025 23:08
@kwyss-nvidia kwyss-nvidia reopened this Mar 10, 2025
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch 2 times, most recently from 36c600f to 1f24246 Compare March 11, 2025 01:33
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch from 47b3dd5 to b4482a4 Compare March 12, 2025 21:36
@kwyss-nvidia
Copy link
Collaborator Author

Benchmark results:

With larger code size:

is_1d_kernel  return_transpose           shape   timing_us
0           True              True     (256, 1024)   10.150487
1           True              True    (4096, 3072)   30.890434
2           True              True    (4096, 4096)   44.860952
3           True              True    (4096, 5440)   70.505325
4           True              True   (16384, 1024)   44.764331
5           True              True   (16384, 3072)  125.236950
6           True              True   (16384, 6144)  245.317183
7           True              True  (16384, 12288)  486.603893
8           True              True  (16384, 24576)  977.890830
9           True             False     (256, 1024)    9.345088
10          True             False    (4096, 3072)   17.145933
11          True             False    (4096, 4096)   32.242464
12          True             False    (4096, 5440)   49.772345
13          True             False   (16384, 1024)   32.224652
14          True             False   (16384, 3072)   88.912155
15          True             False   (16384, 6144)  170.590762
16          True             False  (16384, 12288)  335.485549
17          True             False  (16384, 24576)  669.053740
18         False              True     (256, 1024)   10.150071
19         False              True    (4096, 3072)   28.667756
20         False              True    (4096, 4096)   42.826061
21         False              True    (4096, 5440)   56.505605
22         False              True   (16384, 1024)   43.093660
23         False              True   (16384, 3072)  120.820711
24         False              True   (16384, 6144)  238.482608
25         False              True  (16384, 12288)  473.993761
26         False              True  (16384, 24576)  953.123620
27         False             False     (256, 1024)    9.361475
28         False             False    (4096, 3072)   16.812136
29         False             False    (4096, 4096)   31.802621
30         False             False    (4096, 5440)   40.984200
31         False             False   (16384, 1024)   31.765506
32         False             False   (16384, 3072)   85.275423
33         False             False   (16384, 6144)  164.587606
34         False             False  (16384, 12288)  327.348712
35         False             False  (16384, 24576)  655.115865

With reduced code size:

is_1d_kernel  return_transpose           shape   timing_us
0           True              True     (256, 1024)   33.380988
1           True              True    (4096, 3072)   31.081486
2           True              True    (4096, 4096)   45.144120
3           True              True    (4096, 5440)   70.915230
4           True              True   (16384, 1024)   44.990452
5           True              True   (16384, 3072)  125.563130
6           True              True   (16384, 6144)  245.544268
7           True              True  (16384, 12288)  486.944105
8           True              True  (16384, 24576)  977.990200
9           True             False     (256, 1024)    9.466304
10          True             False    (4096, 3072)   16.890583
11          True             False    (4096, 4096)   32.243207
12          True             False    (4096, 5440)   52.081308
13          True             False   (16384, 1024)   32.251191
14          True             False   (16384, 3072)   88.969267
15          True             False   (16384, 6144)  170.647831
16          True             False  (16384, 12288)  335.637203
17          True             False  (16384, 24576)  669.034820
18         False              True     (256, 1024)   34.060038
19         False              True    (4096, 3072)   30.163735
20         False              True    (4096, 4096)   43.273994
21         False              True    (4096, 5440)   56.714637
22         False              True   (16384, 1024)   43.521530
23         False              True   (16384, 3072)  121.285914
24         False              True   (16384, 6144)  238.833705
25         False              True  (16384, 12288)  474.244582
26         False              True  (16384, 24576)  953.438990
27         False             False     (256, 1024)    9.508913
28         False             False    (4096, 3072)   17.661353
29         False             False    (4096, 4096)   33.111296
30         False             False    (4096, 5440)   41.143637
31         False             False   (16384, 1024)   33.102805
32         False             False   (16384, 3072)   85.891990
33         False             False   (16384, 6144)  165.198970
34         False             False  (16384, 12288)  326.038147
35         False             False  (16384, 24576)  654.584625

There are a few cold start differences, but most of the difference looks negligible to me.

@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch 3 times, most recently from 91a6721 to 909dac7 Compare March 19, 2025 22:41
@kwyss-nvidia
Copy link
Collaborator Author

Just rebased onto origin/main. An implementation of nvte_shape for these scaling modes was needed to work with upstream changes. The required compatibility is added to this MR.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch from 27c9188 to 18f19bb Compare April 3, 2025 20:38
@kwyss-nvidia
Copy link
Collaborator Author

/te-ci pytorch

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Comment on lines +185 to +190
scale_shape = self.get_scale_shape(shape, columnwise=False)
scale_inv = torch.empty(
scale_shape,
dtype=torch.float32,
device=device,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we pad the scales, good. Are we sure that the torch.empty is enough here or should we make sure that the padding is zeroed out? My concern is that, while TMA handles the boundary conditions for data (by zeroing out the output), if the GEMM does not apply the scale conditionally, you could still end up with the NaN (if the uninitialized memory in the scale turns out to be Inf and so you do Inf * 0). We should double check with cuBLAS.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked with Roman and zero padding is not required.

Comment on lines +200 to +204
columnwise_scale_inv = torch.empty(
columnwise_scale_shape,
dtype=torch.float32,
device=device,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as scale_inv.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero padding not required.

@timmoon10 timmoon10 self-requested a review April 3, 2025 23:23
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
@kwyss-nvidia
Copy link
Collaborator Author

/te-ci pytorch

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI

@timmoon10
Copy link
Collaborator

/te-ci L1

timmoon10 and others added 3 commits April 4, 2025 04:36
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Collaborator

/te-ci

@timmoon10 timmoon10 merged commit 1bbeab1 into NVIDIA:main Apr 4, 2025
12 checks passed
lhb8125 pushed a commit to lhb8125/TransformerEngine that referenced this pull request Apr 8, 2025
* Blockwise float8 quantizer and quantized tensor class.

The classes are configurable for 128x128 blocksize
and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.

Scale tensors are stored in a format emenable for matrix multiplication,
however the integration of matmul is deferred as a separate story.

Fusions of quantization and DBIAS or activation functions are not yet
implemented, and the dequantization is currently implemented in torch.

Tests for quantization are included in C++ and pytorch layers, with
exact comparison to reference quantizer behavior as well as an attempt
to hit interesting branches through the API such as tensor creation
in pytorch and CPP and dequantization of row and columnwise usage.

Two CUDA kernels for quantization are included, and are direct ports
of equivalents in the kitchen repository, where a subchannel recipe
has been used for end to end training.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Apply linting changes.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Alignment for 1D scaling for GEMM edge case.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API name.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix merge conflict with name change.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use common tensor map API.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API to use two scaling mode enums.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix typo.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update some call sites.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Tests for torch tensor API surface.

Since the quantized tensor is a tensor
subclass, these tests exercise torch hooks.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse scale calculation between quantizer refs.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Save memory by dropping reference to saved tensors.

Issues previously observed are solved.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove constexpr parameters from kernel.

Code size is reduced with fewer constexpr params.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Merge conflict from rebase.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add shape implementations for block scaling.

nvte_shape was added upstream. Logic added
for block scaled fp8.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Move benchmark to te_playground

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove amax_epsilon and pow_2_scales from tensor.

Hardcodes the default values.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Lint changes.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fixup MR changes that broke.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Safer ifdef in kernel.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Documentation prose.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse compute_scale function from Current Scaling.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Bugfix on inf_value scale refactor.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove qopt calls from test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update pytest list.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to reference scale calc.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use ptx.cuh functions instead of cde.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update shape logic with allocation and reuse shape.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Usage defaults MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Copyright and header guard.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Updating torch dispatch code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix exception type.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use TypeInfo

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update CS scale update test to use updated ref impl

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update JAX scaling mode enum

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Skip tests on Lovelace

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
wdykas pushed a commit to wdykas/TransformerEngine that referenced this pull request Apr 14, 2025
* Blockwise float8 quantizer and quantized tensor class.

The classes are configurable for 128x128 blocksize
and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.

Scale tensors are stored in a format emenable for matrix multiplication,
however the integration of matmul is deferred as a separate story.

Fusions of quantization and DBIAS or activation functions are not yet
implemented, and the dequantization is currently implemented in torch.

Tests for quantization are included in C++ and pytorch layers, with
exact comparison to reference quantizer behavior as well as an attempt
to hit interesting branches through the API such as tensor creation
in pytorch and CPP and dequantization of row and columnwise usage.

Two CUDA kernels for quantization are included, and are direct ports
of equivalents in the kitchen repository, where a subchannel recipe
has been used for end to end training.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Apply linting changes.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Alignment for 1D scaling for GEMM edge case.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API name.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix merge conflict with name change.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use common tensor map API.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API to use two scaling mode enums.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix typo.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update some call sites.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Tests for torch tensor API surface.

Since the quantized tensor is a tensor
subclass, these tests exercise torch hooks.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse scale calculation between quantizer refs.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Save memory by dropping reference to saved tensors.

Issues previously observed are solved.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove constexpr parameters from kernel.

Code size is reduced with fewer constexpr params.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Merge conflict from rebase.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add shape implementations for block scaling.

nvte_shape was added upstream. Logic added
for block scaled fp8.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Move benchmark to te_playground

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove amax_epsilon and pow_2_scales from tensor.

Hardcodes the default values.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Lint changes.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fixup MR changes that broke.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Safer ifdef in kernel.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Documentation prose.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse compute_scale function from Current Scaling.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Bugfix on inf_value scale refactor.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove qopt calls from test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update pytest list.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to reference scale calc.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use ptx.cuh functions instead of cde.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update shape logic with allocation and reuse shape.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Usage defaults MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Copyright and header guard.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Updating torch dispatch code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix exception type.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use TypeInfo

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update CS scale update test to use updated ref impl

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update JAX scaling mode enum

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Skip tests on Lovelace

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Peter Dykas <wdykas@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants