Enable reuse of dummy wgrad tensor by vasunvidia · Pull Request #1651 · NVIDIA/TransformerEngine

vasunvidia · 2025-04-07T17:09:25Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ksivaman

LGTM

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

ksivaman · 2025-04-07T17:59:51Z

/te-ci pytorch L0 L1

timmoon10 · 2025-04-07T20:55:57Z

transformer_engine/pytorch/module/linear.py

                if getattr(weight, "zero_out_wgrad", False):
-                    wgrad = torch.zeros(
-                        weight.main_grad.shape,
-                        dtype=weight.dtype,
-                        device=torch.cuda.current_device(),
-                        requires_grad=False,
+                    wgrad = get_dummy_wgrad(
+                        list(weight.main_grad.shape),
+                        weight.dtype,
+                        zero=True,
                    )
                else:
-                    wgrad = torch.empty(
-                        weight.main_grad.shape,
-                        dtype=weight.dtype,
-                        device=torch.cuda.current_device(),
-                        requires_grad=False,
+                    wgrad = get_dummy_wgrad(
+                        list(weight.main_grad.shape),
+                        weight.dtype,
                    )


We could clean this up:

Suggested change

if getattr(weight, "zero_out_wgrad", False):

wgrad = torch.zeros(

weight.main_grad.shape,

dtype=weight.dtype,

device=torch.cuda.current_device(),

requires_grad=False,

wgrad = get_dummy_wgrad(

list(weight.main_grad.shape),

weight.dtype,

zero=True,

)

else:

wgrad = torch.empty(

weight.main_grad.shape,

dtype=weight.dtype,

device=torch.cuda.current_device(),

requires_grad=False,

wgrad = get_dummy_wgrad(

list(weight.main_grad.shape),

weight.dtype,

)

wgrad = get_dummy_wgrad(

list(weight.main_grad.shape),

weight.dtype,

zero=not getattr(weight, "zero_out_wgrad", False),

)

We could do a similar change in LayerNormLinear.

timmoon10 · 2025-04-07T20:56:47Z

transformer_engine/pytorch/module/base.py

    return _multi_stream_cublas_workspace


+def get_dummy_wgrad(shape: list, dtype: torch.dtype, zero=False) -> torch.Tensor:


This could be simplified with lru_cache.

timmoon10 · 2025-04-07T20:58:59Z

transformer_engine/pytorch/attention.py

        send_dst = cp_global_ranks[(rank + 1) % cp_size * cp_size_a2a + rank_a2a]
        recv_src = cp_global_ranks[(rank - 1) % cp_size * cp_size_a2a + rank_a2a]
-        batch_p2p_comm = int(os.getenv("NVTE_BATCH_MHA_P2P_COMM", "0")) or (cp_size == 2)
+        batch_p2p_comm = int(os.getenv("NVTE_BATCH_MHA_P2P_COMM", "0"))


What's the motivation for this test change? It seems orthogonal to the functional changes.

* Use dummy wgrads for lower memory consumption Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix to avoid sharing gradients. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Disable automatic use of batch_p2p_comm for CP2 Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change weight to origin_weight for LN_LINEAR Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Peter Dykas <wdykas@nvidia.com>

* Use dummy wgrads for lower memory consumption Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix to avoid sharing gradients. Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Disable automatic use of batch_p2p_comm for CP2 Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change weight to origin_weight for LN_LINEAR Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman approved these changes Apr 7, 2025

View reviewed changes

vasunvidia force-pushed the dummy_wgrads branch from aaeda9d to a66c1c6 Compare April 7, 2025 17:21

ksivaman added the 2.2.0 label Apr 7, 2025

ksivaman and others added 5 commits April 7, 2025 17:57

Use dummy wgrads for lower memory consumption

2b071b8

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

Bug fix to avoid sharing gradients.

c014f89

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

Disable automatic use of batch_p2p_comm for CP2

eb8433d

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

Change weight to origin_weight for LN_LINEAR

77b0955

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e5cacd7

for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

ksivaman force-pushed the dummy_wgrads branch from a66c1c6 to e5cacd7 Compare April 7, 2025 17:58

timmoon10 reviewed Apr 7, 2025

View reviewed changes

timmoon10 approved these changes Apr 7, 2025

View reviewed changes

ksivaman merged commit ba5dc5d into NVIDIA:main Apr 8, 2025
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable reuse of dummy wgrad tensor#1651

Enable reuse of dummy wgrad tensor#1651
ksivaman merged 5 commits intoNVIDIA:mainfrom
vasunvidia:dummy_wgrads

vasunvidia commented Apr 7, 2025

Uh oh!

ksivaman left a comment

Uh oh!

ksivaman commented Apr 7, 2025

Uh oh!

timmoon10 Apr 7, 2025

Uh oh!

timmoon10 Apr 7, 2025

Uh oh!

timmoon10 Apr 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return _multi_stream_cublas_workspace


		def get_dummy_wgrad(shape: list, dtype: torch.dtype, zero=False) -> torch.Tensor:

Conversation

vasunvidia commented Apr 7, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Apr 7, 2025

Uh oh!

timmoon10 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timmoon10 Apr 7, 2025 •

edited

Loading