Subchannel Block quantized GEMM by kwyss-nvidia · Pull Request #1545 · NVIDIA/TransformerEngine

kwyss-nvidia · 2025-03-06T19:28:22Z

Description

Integrates GEMM scaling modes for subchannel/block quantization.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
[ x] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

GEMM dispatch in generic_gemm for scaling modes of GEMMs.
Tests for GEMM numerics

Previous bias tests were flaky due to know issue in CUBLAS upstream. Tested zero tolerance against recent build.

Would like to enable BGRADB.

Depends on quantization changes in related MR: #1513

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

kwyss-nvidia · 2025-03-12T00:41:10Z

@ptrendx here is a mirror of the review with only the GEMM related changes in scope. kwyss-nvidia#1

GEMM test cases included in pytorch integration. Signed-off-by: Keith Wyss <kwyss@nvidia.com>

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

kwyss-nvidia · 2025-04-04T18:03:27Z

/te-ci

transformer_engine/common/gemm/cublaslt_gemm.cu

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

kwyss-nvidia · 2025-04-04T21:46:28Z

/te-ci

transformer_engine/common/gemm/cublaslt_gemm.cu

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

transformer_engine/common/gemm/cublaslt_gemm.cu

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

tests/pytorch/test_float8_blockwise_gemm_exact.py

Configure A and B matrices separately. Have separate code path for each scaling mode. Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2025-04-06T01:16:06Z

/te-ci L1

kwyss-nvidia · 2025-04-07T16:49:58Z

Looking into diagnosing the CI test failures:

OperatorTest/CTDBiasDGeluTestSuite.TestCTDBiasDgelu/float32Xfloat8e5m2X256X65536 - A100 cppunittest
test_numerics.py test_comm_gemm_overlap.py - H100 pytorch distributed unittest

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

timmoon10 · 2025-04-07T21:05:07Z

/te-ci pytorch

timmoon10

LGTM

ptrendx · 2025-04-07T23:14:41Z

tests/pytorch/test_float8_blockwise_gemm_exact.py

+        torch.testing.assert_close(y, y_ref, atol=atol, rtol=rtol)
+
+
+def cublas_gemm_test_constraint_enforced(


What is the reason for this test? Maybe I'm reading this wrong but it seems to enforce that cuBLAS does not support some parameters - is this to raise awareness once cuBLAS actually starts supporting them?

If we haven't verified the results of a branch, it seems better for that branch to return a descriptive error than silently succeed but possibly with bad data. This is checking that the gemm API returns an error for the cases that it shouldn't be called with.

transformer_engine/common/gemm/cublaslt_gemm.cu

ptrendx · 2025-04-07T23:26:19Z

transformer_engine/common/gemm/cublaslt_gemm.cu

+      (inputA->scaling_mode == NVTE_BLOCK_SCALING_2D)) {
+    NVTE_CHECK((epilogue == CUBLASLT_EPILOGUE_DEFAULT || epilogue == CUBLASLT_EPILOGUE_BIAS ||
+                epilogue == CUBLASLT_EPILOGUE_DGELU),
+               "Epilogue requested outside of the available and tested cuBLAS functionality for "


It there an available but untested functionality :-)?

Not as far as I know (yet). ;)

* Add GEMM logic for blockwise quantized tensors. GEMM test cases included in pytorch integration. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update NVTE_BLOCK_SCALING for GEMM. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Gate feature on CUDA 12.9 Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Gemm typo. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove unecessary type converter change. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Reflect epilogue availability and test supported epilogues. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * GEMM simplifications from recipe branch. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Format py code. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update GEMM DGelu tests to match support depending on output dtype. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Force pow2Scales in GEMM Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add GEMM test to pytorch test suite. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add copyright to GEMM test. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update import for GEMM test. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add license. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update test gemm supported predicate. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use sgemm like interfaces and naming. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Rewrite GEMM comment. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR Feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Refactor GEMM param canonicalization Configure A and B matrices separately. Have separate code path for each scaling mode. Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Prune number of tests. Signed-off-by: Keith Wyss <kwyss@nvidia.com> --------- Signed-off-by: Keith Wyss <kwyss@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Peter Dykas <wdykas@nvidia.com>

kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch from 3d8547b to b40a601 Compare March 6, 2025 19:30

kwyss-nvidia mentioned this pull request Mar 6, 2025

Blockwise float8 quantizer and quantized tensor class #1513

Merged

34 tasks

kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch 8 times, most recently from 4560226 to 1194699 Compare March 11, 2025 20:28

This was referenced Mar 11, 2025

Blockwise scaling linear quantization recipe #1559

Merged

Mirror of GEMM MR from source repo to produce smaller diff kwyss-nvidia/TransformerEngine#1

Open

kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch 4 times, most recently from eee37bf to ce4ca80 Compare March 17, 2025 17:24

kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch from ce4ca80 to 5ebc93a Compare March 19, 2025 22:42

kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch 6 times, most recently from cd3e414 to f1e9e62 Compare April 4, 2025 01:17

kwyss-nvidia added 6 commits April 4, 2025 09:14

Add GEMM logic for blockwise quantized tensors.

fbcbcb0

GEMM test cases included in pytorch integration. Signed-off-by: Keith Wyss <kwyss@nvidia.com>

Update NVTE_BLOCK_SCALING for GEMM.

522ffbe

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

Gate feature on CUDA 12.9

d7e1fce

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

Gemm typo.

f212c81

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

Remove unecessary type converter change.

48b2d57

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

Reflect epilogue availability and test supported epilogues.

5761589

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

Add license.

7d5b5d9

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

zhongbozhu reviewed Apr 4, 2025

View reviewed changes

transformer_engine/common/gemm/cublaslt_gemm.cu Outdated Show resolved Hide resolved

Update test gemm supported predicate.

efdf8e0

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

timmoon10 reviewed Apr 4, 2025

View reviewed changes

timmoon10 self-requested a review April 4, 2025 22:15

kwyss-nvidia added 2 commits April 4, 2025 17:58

Use sgemm like interfaces and naming.

a9f209a

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

Rewrite GEMM comment.

861c870

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch from 32799ab to 861c870 Compare April 5, 2025 00:59

timmoon10 reviewed Apr 5, 2025

View reviewed changes

transformer_engine/common/gemm/cublaslt_gemm.cu Outdated Show resolved Hide resolved

transformer_engine/common/gemm/cublaslt_gemm.cu Outdated Show resolved Hide resolved

timmoon10 self-requested a review April 5, 2025 01:21

MR Feedback.

ada6438

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

timmoon10 reviewed Apr 5, 2025

View reviewed changes

tests/pytorch/test_float8_blockwise_gemm_exact.py Show resolved Hide resolved

timmoon10 self-requested a review April 5, 2025 04:02

timmoon10 and others added 3 commits April 6, 2025 01:11

Refactor GEMM param canonicalization

e484269

Configure A and B matrices separately. Have separate code path for each scaling mode. Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9f0707e

for more information, see https://pre-commit.ci

Merge branch 'main' into kwyss/cublas_gemm_github_mr

cf36b99

kwyss-nvidia and others added 2 commits April 7, 2025 13:20

Prune number of tests.

f3123cf

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

Merge branch 'main' into kwyss/cublas_gemm_github_mr

4e4c59e

timmoon10 approved these changes Apr 7, 2025

View reviewed changes

timmoon10 merged commit db2aaa9 into NVIDIA:main Apr 7, 2025
11 of 12 checks passed

ptrendx reviewed Apr 7, 2025

View reviewed changes

timmoon10 mentioned this pull request Apr 7, 2025

[PyTorch] Debug GEMM refactor #1652

Merged

13 tasks

huanghua1994 mentioned this pull request Apr 8, 2025

[JAX] grouped_gemm() uses variadic arguments #1658

Merged

13 tasks

		torch.testing.assert_close(y, y_ref, atol=atol, rtol=rtol)


		def cublas_gemm_test_constraint_enforced(

Conversation

kwyss-nvidia commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

kwyss-nvidia commented Mar 12, 2025

Uh oh!

kwyss-nvidia commented Apr 4, 2025

Uh oh!

Uh oh!

kwyss-nvidia commented Apr 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented Apr 6, 2025

Uh oh!

kwyss-nvidia commented Apr 7, 2025

Uh oh!

timmoon10 commented Apr 7, 2025

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ptrendx Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

kwyss-nvidia Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

kwyss-nvidia Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwyss-nvidia commented Mar 6, 2025 •

edited

Loading