[PyTorch] Debug CUDA graph support with operation-based API by timmoon10 · Pull Request #1117 · NVIDIA/TransformerEngine

timmoon10 · 2024-08-16T01:53:27Z

Description

This PR debugs CUDA graph support with the operation-based API (see #707). The CUDA graph logic is similar to the module-based API.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Debug CUDA graph support with operation-based API
Refactor CUDA graph tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2024-08-16T01:56:21Z

/te-ci pytorch

transformer_engine/pytorch/graph.py

ptrendx · 2024-09-18T22:48:26Z

transformer_engine/pytorch/ops/op.py

+        if fp8_recipe is None:
+            fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
+        if fp8_recipe is None:
+            fp8_recipe = get_default_fp8_recipe()


Hmmmm, this second if looks like logic that should be inside get_fp8_recipe in the FP8GlobalStateManager.

Also, since this is an internal function, couldn't we just always ask for a valid recipe here and just deal with getting it int the caller?

This case shouldn't happen in any of our current use-cases (FP8GlobalStateManager.get_fp8_recipe() is set within fp8_autocast, fp8_recipe is provided within make_graphed_callables), but it seems delicate to rely on that assumption.

ptrendx · 2024-09-18T22:51:26Z

transformer_engine/pytorch/ops/op.py

            if curr_len == amax_history_len:
                continue
+
+            # Reallocate amax history


Could this be its own function?

I've tried to keep this logic similar to how it's handled in the modules:

TransformerEngine/transformer_engine/pytorch/module/base.py

Line 410 in 0ee5ccd

def adjust_amax_history_length(self, length: int, fwd: Optional[bool] = None) -> None:

I think it would be nice to consolidate this logic in fp8.py and reuse it for both modules and operations, but that's probably best done in a pure refactor PR.

ptrendx · 2024-09-18T22:52:00Z

transformer_engine/pytorch/ops/op.py

                        pad=(0, 0, 0, amax_history_len - curr_len),
                    )

+            # Update global buffers for amax reductions


This does not look like graph specific thing - was the lack of this in the previous code a bug?

Yep, if the amax history length changes then I don't expect amax reductions to be handled correctly.

Return default recipe from FP8GlobalStateManager.get_fp8_recipe if needed. Expand error message when failing to load FP8 state after capturing CUDA graph. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-09-20T03:16:47Z

/te-ci pytorch

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2024-09-24T19:25:29Z

/te-ci pytorch

timmoon10 · 2024-10-02T01:47:48Z

/te-ci pytorch

timmoon10 · 2024-10-09T17:41:27Z

/te-ci pytorch

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-11-05T00:56:06Z

/te-ci pytorch

timmoon10 · 2024-11-05T17:27:55Z

Merging with approval from @ptrendx and @ksivaman.

timmoon10 added 2 commits August 15, 2024 17:22

Debug CUDA graph support with operation-based API

d771ca5

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Refactoring CUDA graph tests

ade0c02

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the bug Something isn't working label Aug 16, 2024

timmoon10 requested a review from ksivaman August 16, 2024 01:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

e5d40a6

for more information, see https://pre-commit.ci

timmoon10 marked this pull request as ready for review August 16, 2024 01:56

timmoon10 mentioned this pull request Sep 10, 2024

[WIP] [PyTorch] Proof-of-concept for using operation-based API in modules #1173

Closed

13 tasks

ptrendx reviewed Sep 18, 2024

View reviewed changes

transformer_engine/pytorch/graph.py Outdated Show resolved Hide resolved

ptrendx reviewed Sep 18, 2024

View reviewed changes

timmoon10 added 3 commits September 19, 2024 19:35

Merge branch 'main' into cuda-graph-ops

b1972cf

Review suggestions from @ptrendx

7d04de5

Return default recipe from FP8GlobalStateManager.get_fp8_recipe if needed. Expand error message when failing to load FP8 state after capturing CUDA graph. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Avoid unnecessary recursion when saving/loading FP8 state

805abc1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 requested a review from ptrendx September 20, 2024 03:16

timmoon10 and others added 3 commits September 24, 2024 12:07

Merge branch 'main' into cuda-graph-ops

69f66d0

Fix circular import

11e6b45

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f612e10

for more information, see https://pre-commit.ci

Merge branch 'main' into cuda-graph-ops

bdaa5ed

Merge branch 'main' into cuda-graph-ops

3bf06b3

timmoon10 added a commit to timmoon10/TransformerEngine that referenced this pull request Oct 9, 2024

Rebase NVIDIA#1117

ca17ac2

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 and others added 2 commits October 18, 2024 16:32

Merge branch 'main' into cuda-graph-ops

04587ac

Merge branch 'main' into cuda-graph-ops

2bd7911

timmoon10 merged commit 50b22da into NVIDIA:main Nov 5, 2024

Conversation

timmoon10 commented Aug 16, 2024

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Aug 16, 2024

Uh oh!

Uh oh!

ptrendx Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

ptrendx Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

timmoon10 Sep 20, 2024

Choose a reason for hiding this comment

Uh oh!

ptrendx Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

timmoon10 Sep 20, 2024

Choose a reason for hiding this comment

Uh oh!

ptrendx Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 Sep 20, 2024

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Sep 20, 2024

Uh oh!

timmoon10 commented Sep 24, 2024

Uh oh!

timmoon10 commented Oct 2, 2024

Uh oh!

timmoon10 commented Oct 9, 2024

Uh oh!

timmoon10 commented Nov 5, 2024

Uh oh!

timmoon10 commented Nov 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ptrendx Sep 18, 2024 •

edited

Loading