Skip to content

[PyTorch] Debug CUDA graph support with operation-based API#1117

Merged
timmoon10 merged 13 commits intoNVIDIA:mainfrom
timmoon10:cuda-graph-ops
Nov 5, 2024
Merged

[PyTorch] Debug CUDA graph support with operation-based API#1117
timmoon10 merged 13 commits intoNVIDIA:mainfrom
timmoon10:cuda-graph-ops

Conversation

@timmoon10
Copy link
Collaborator

Description

This PR debugs CUDA graph support with the operation-based API (see #707). The CUDA graph logic is similar to the module-based API.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

  • Debug CUDA graph support with operation-based API
  • Refactor CUDA graph tests

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 added the bug Something isn't working label Aug 16, 2024
@timmoon10 timmoon10 requested a review from ksivaman August 16, 2024 01:53
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10 timmoon10 marked this pull request as ready for review August 16, 2024 01:56
if fp8_recipe is None:
fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
if fp8_recipe is None:
fp8_recipe = get_default_fp8_recipe()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, this second if looks like logic that should be inside get_fp8_recipe in the FP8GlobalStateManager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since this is an internal function, couldn't we just always ask for a valid recipe here and just deal with getting it int the caller?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case shouldn't happen in any of our current use-cases (FP8GlobalStateManager.get_fp8_recipe() is set within fp8_autocast, fp8_recipe is provided within make_graphed_callables), but it seems delicate to rely on that assumption.

if curr_len == amax_history_len:
continue

# Reallocate amax history
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be its own function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to keep this logic similar to how it's handled in the modules:

def adjust_amax_history_length(self, length: int, fwd: Optional[bool] = None) -> None:

I think it would be nice to consolidate this logic in fp8.py and reuse it for both modules and operations, but that's probably best done in a pure refactor PR.

pad=(0, 0, 0, amax_history_len - curr_len),
)

# Update global buffers for amax reductions
Copy link
Member

@ptrendx ptrendx Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look like graph specific thing - was the lack of this in the previous code a bug?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if the amax history length changes then I don't expect amax reductions to be handled correctly.

Return default recipe from FP8GlobalStateManager.get_fp8_recipe if needed. Expand error message when failing to load FP8 state after capturing CUDA graph.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10 timmoon10 requested a review from ptrendx September 20, 2024 03:16
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

timmoon10 added a commit to timmoon10/TransformerEngine that referenced this pull request Oct 9, 2024
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

Merging with approval from @ptrendx and @ksivaman.

@timmoon10 timmoon10 merged commit 50b22da into NVIDIA:main Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants