[PyTorch] Remove special handling for FP8 params in FP8 recipe infrastructure by timmoon10 · Pull Request #1326 · NVIDIA/TransformerEngine

timmoon10 · 2024-11-09T03:59:12Z

Description

#1142 exposed a very subtle bug that caused non-deterministic test failures in test_fusible_ops_with_userbuffers.py.

Bug description

test_fusible_ops_with_userbuffers.py runs multiple test cases at a time because launching a parallel job is expensive, so it constructs and destroys multiple TE models with FP8 parameters. Python IDs may be reused after an object is deallocated, so the Python ID for FP8 tensors is sometimes reused. However, Float8Tensor.post_optimizer_step_fwd_amax_reduction uses Python IDs to check whether to perform amax reductions and FP8 scale updates. I observed that this was causing FP8 scale updates at weird times, which corrupted UB buffers, which caused hangs.

🫠

In short, the problem is from this weird callback in Float8Tensor:

TransformerEngine/transformer_engine/pytorch/tensor/float8_tensor.py

Line 77 in 2643ba1

def post_optimizer_step_fwd_amax_reduction(param: Float8Tensor) -> None:

This hack was added in #575 so that we would properly update FP8 scales for FP8 params after the optimizer step. However, we've made improvements since then:

Avoid amax roll for non-run modules #825: There is no harm if we perform FP8 scale updates more often than necessary.
[PyTorch] Refactor FP8 workspaces in linear modules #820: FP8 weights keep a private copy of its FP8 scale-inv, so we can perform FP8 scale updates whenever we want.

Thus, there's no need to do an FP8 scale update for the weights immediately after the optimizer step. We just need to do it sometime before the next optimizer step and there should be no change in numerics. In fact, these FP8 scales are already participating in the forward pass amax reduction and scale update, so avoiding those operations reduces runtime overheads. Also, this just makes Float8Tensor more sane and less tightly coupled with the FP8 recipe infrastructure.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Remove special handling for FP8 params from FP8 recipe infrastructure

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2024-11-09T04:01:30Z

/te-ci pytorch L1 L3

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2024-11-11T23:13:40Z

/te-ci pytorch

timmoon10 · 2024-11-13T19:38:32Z

The convergence tests in pipeline 20334396 timed out, but all the tests that did run passed.

timmoon10 and others added 2 commits November 9, 2024 03:06

Remove manual FP8 scale update for FP8 params

dc71166

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0476054

for more information, see https://pre-commit.ci

timmoon10 requested a review from ksivaman November 9, 2024 04:01

ksivaman added 2 commits November 11, 2024 23:13

lint

7b817d7

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'main' into remove-float8tensor-scale-update-callback

e138393

timmoon10 mentioned this pull request Nov 13, 2024

[Dummy] Testing branch for #1326 #1330

Closed

13 tasks

ksivaman approved these changes Nov 14, 2024

View reviewed changes

ksivaman merged commit 28aa41a into NVIDIA:main Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Remove special handling for FP8 params in FP8 recipe infrastructure#1326

[PyTorch] Remove special handling for FP8 params in FP8 recipe infrastructure#1326
ksivaman merged 4 commits intoNVIDIA:mainfrom
timmoon10:remove-float8tensor-scale-update-callback

timmoon10 commented Nov 9, 2024 •

edited

Loading

Uh oh!

timmoon10 commented Nov 9, 2024

Uh oh!

ksivaman commented Nov 11, 2024

Uh oh!

timmoon10 commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timmoon10 commented Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Nov 9, 2024

Uh oh!

ksivaman commented Nov 11, 2024

Uh oh!

timmoon10 commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timmoon10 commented Nov 9, 2024 •

edited

Loading