[PyTorch] Integration test for Megatron-LM by timmoon10 · Pull Request #1329 · NVIDIA/TransformerEngine

timmoon10 · 2024-11-13T00:00:27Z

Description

#1033 broke Megatron-LM's wrappers for the LayerNorm and RMSNorm modules:

We renamed the hidden_size arg to normalized_shape in order to match torch.nn.LayerNorm, but Megatron-LM treats hidden_size as a kwarg:
https://github.com/NVIDIA/Megatron-LM/blob/aded519cfb1de2abf96f36ca059f992294b7876f/megatron/core/extensions/transformer_engine.py#L65.
This PR adds logic to handle the hidden_size arg and print a deprecation warning.
Megatron-LM has the option to initialize modules in CPU, but TE's LayerNorm and RMSNorm ops assume they are initialized on GPU. This PR allows LayerNorm and RMSNorm ops to be initialized on CPU, although it still assumes it is transferred to GPU before performing compute.

To help detect these issues in the future, I've also added an integration test that runs Megatron-LM to train a very small GPT model.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor
Testing

Changes

Handle deprecated hidden_size arg in LayerNorm and RMSNorm modules
Allow LayerNorm and RMSNorm operations to be initialized on CPU
Add Megatron-LM integration test

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-11-13T04:25:18Z

Pipeline 20338114

timmoon10 · 2024-11-15T19:28:49Z

Pipeline 20444324 is green

timmoon10 · 2024-11-20T19:02:42Z

transformer_engine/pytorch/ops/fuser.py

+            if requires_grad != x.requires_grad:
+                if requires_grad:
+                    x.requires_grad_()
+                else:
+                    x = x.detach()


This fixes a te.Sequential bug that was exposed by Mcore. When running in eval mode, we want x.requires_grad=False so that the op knows that it doesn't need to prepare for that grad. However, PyTorch sometimes complains if you change a tensor's requires_grad from True to False (i.e. when the tensor is not a leaf in the autograd graph). Detaching the tensor works around this case.

transformer_engine/pytorch/module/rmsnorm.py

sudhakarsingh27 · 2024-11-20T21:28:42Z

transformer_engine/pytorch/ops/basic/layer_norm.py

        # Check tensor dims
+        weight = self.weight
+        weight_dims = tuple(weight.size())
        input_dims = tuple(input_.size())


apparently torch.Size is a subclass of tuple so tuple creation probably not needed

sudhakarsingh27 · 2024-11-20T21:40:28Z

transformer_engine/pytorch/ops/basic/rmsnorm.py


        # Check tensor dims
+        weight = self.weight
+        weight_dims = tuple(weight.size())


no need to tupleize

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Handle deprecated `hidden_size` arg in norm modules Signed-off-by: Tim Moon <tmoon@nvidia.com> * Support initializing norm ops on CPU Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add integration test for Megatron-LM Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename Mcore integration test Signed-off-by: Tim Moon <tmoon@nvidia.com> * Handle case in RMSNorm where hidden dim is not provided Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

timmoon10 added 3 commits November 12, 2024 19:42

Handle deprecated hidden_size arg in norm modules

6e5d1e0

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Support initializing norm ops on CPU

4cf53ad

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add integration test for Megatron-LM

610a5d3

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the bug Something isn't working label Nov 13, 2024

timmoon10 requested a review from ksivaman November 13, 2024 00:00

[pre-commit.ci] auto fixes from pre-commit.com hooks

79e9fa8

for more information, see https://pre-commit.ci

timmoon10 mentioned this pull request Nov 13, 2024

[Dummy] Testing branch for #1326 #1330

Closed

13 tasks

Rename Mcore integration test

49a8f3c

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This was referenced Nov 13, 2024

[BUG]Megatron-LM doesn't support transformer-engine 1.13 NVIDIA/Megatron-LM#1280

Closed

[JAX] WIP Added L0 Distributed Tests #1331

Closed

timmoon10 requested a review from ptrendx November 14, 2024 03:11

ksivaman added the 1.13.0 label Nov 14, 2024

Merge branch 'main' into debug-mcore-norm

1014133

timmoon10 requested a review from sudhakarsingh27 November 18, 2024 19:04

timmoon10 commented Nov 20, 2024

View reviewed changes

sudhakarsingh27 approved these changes Nov 20, 2024

View reviewed changes

Handle case in RMSNorm where hidden dim is not provided

c462c63

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 merged commit 6b98768 into NVIDIA:main Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Integration test for Megatron-LM#1329

[PyTorch] Integration test for Megatron-LM#1329
timmoon10 merged 7 commits intoNVIDIA:mainfrom
timmoon10:debug-mcore-norm

timmoon10 commented Nov 13, 2024

Uh oh!

timmoon10 commented Nov 13, 2024

Uh oh!

timmoon10 commented Nov 15, 2024 •

edited

Loading

Uh oh!

timmoon10 Nov 20, 2024

Uh oh!

Uh oh!

sudhakarsingh27 Nov 20, 2024

Uh oh!

sudhakarsingh27 Nov 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timmoon10 commented Nov 13, 2024

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Nov 13, 2024

Uh oh!

timmoon10 commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timmoon10 Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sudhakarsingh27 Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timmoon10 commented Nov 15, 2024 •

edited

Loading