Handle the scaling factor when amax is too tiny that leads to an infinite scale by jinzex · Pull Request #786 · NVIDIA/TransformerEngine

jinzex · 2024-04-16T22:39:39Z

When the amax is too tiny that the scale becoming infinite in FP32, we set the scale to the max value of FP32. In this case, the tensor’s amax won't get mapped to the FP8 max representable, but rather something below that, but this is the best thing we can do.

cc @Oleg-Goncharov

…nite scale Signed-off-by: Jinze Xue <jinzex@nvidia.com>

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

transformer_engine/pytorch/fp8.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

jinzex · 2024-04-24T00:00:01Z

@timmoon10 Thanks Tim for your review and suggestions. All review suggestions were applied.

jinzex · 2024-04-24T00:04:13Z

@ksivaman Currently test_recipe.py is not included in the qa/L0_pytorch_unittest/test.sh. And test_amax_and_scale_update is failed for the case that has is_first_microbatch=False, could you give some suggestions on how to fix it?

ksivaman · 2024-04-25T19:00:14Z

@jinzex Could you add the test_recipe.py in the qa folder such that it is included in our tests? We should also remove freezing scaling factors during the is_first_microbatch tests since it was removed in #575. I suspect this is also the reason for the failures you observe.

jinzex · 2024-04-25T19:07:24Z

@ksivaman Thanks for your reply! Do you mean removing the is_first_microbatch=False case?

TransformerEngine/tests/pytorch/test_recipe.py

Lines 31 to 39 in f85553e

    
           @pytest.mark.parametrize("amax_history_len", [1, 31, 1024]) 
        
           @pytest.mark.parametrize("amax_compute_algo", ["max", "most_recent"]) 
        
           @pytest.mark.parametrize("is_first_microbatch", [None, True, False]) 
        
           def test_amax_and_scale_update( 
        
               self, 
        
               amax_history_len: int, 
        
               amax_compute_algo: str, 
        
               is_first_microbatch: Optional[bool], 
        
               margin: int = 2,

For example, change it to

    @pytest.mark.parametrize("amax_history_len", [1, 31, 1024])
    @pytest.mark.parametrize("amax_compute_algo", ["max", "most_recent"])
    @pytest.mark.parametrize("is_first_microbatch", [None, True])
    def test_amax_and_scale_update(
        self,
        amax_history_len: int,
        amax_compute_algo: str,
        is_first_microbatch: Optional[bool],
        margin: int = 2,
    ):

ksivaman · 2024-04-25T19:19:08Z

I meant that in this line, we should set update_weight_scale_inv to True and check the results!

…r is_first_microbatch=False Signed-off-by: Jinze Xue <jinzex@nvidia.com>

jinzex · 2024-04-25T19:28:31Z

Thanks for the suggestion! That fixed the test. Changes has been committed.

timmoon10

I don't think we should change test_recipe.py. The test failure exposes a valid bug introduced by the recipe change in #575. In particular, by updating the weight scales in every microbatch step, we might change the FP8 scale in a step where we do not change the FP8 data.

This bug is beyond the scope of this PR though, and otherwise this PR looks good. I think a better solution is to merge this PR without making any changes to the tests and we will fix that bug in an upcoming PR.

timmoon10 · 2024-04-26T04:40:20Z

/te-ci pytorch

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

jinzex · 2024-04-26T05:50:00Z

@timmoon10 Thanks Tim for the comment! The changes to update_weight_scale_inv have been reverted.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10

LGTM. This will expose some test failures from an existing bug (see #786 (review)) and I see some unrelated linter failures (see #816).

timmoon10 · 2024-04-26T17:45:15Z

/te-ci pytorch

…nite scale (NVIDIA#786) * Handle the scaling factor when amax is too tiny that leads to an infinite scale Signed-off-by: Jinze Xue <jinzex@nvidia.com> * revert formatting changes Signed-off-by: Jinze Xue <jinzex@nvidia.com> * fix comments Signed-off-by: Jinze Xue <jinzex@nvidia.com> * Apply review suggestion Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com> * apply review suggestion Signed-off-by: Jinze Xue <jinzex@nvidia.com> * add test_recipe.py to qa/L0_pytorch_unittest/test.sh; fix unittest for is_first_microbatch=False Signed-off-by: Jinze Xue <jinzex@nvidia.com> * revert changes to update_weight_scale_inv Signed-off-by: Jinze Xue <jinzex@nvidia.com> * Debug test failures Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Jinze Xue <jinzex@nvidia.com> Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Jinze Xue <jinzex@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

jinzex added 3 commits April 16, 2024 15:37

Handle the scaling factor when amax is too tiny that leads to an infi…

7ab78f6

…nite scale Signed-off-by: Jinze Xue <jinzex@nvidia.com>

revert formatting changes

42c8f5a

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

fix comments

29f5fb6

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

timmoon10 requested changes Apr 17, 2024

View reviewed changes

transformer_engine/pytorch/fp8.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/fp8.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/fp8.py Show resolved Hide resolved

transformer_engine/pytorch/fp8.py Outdated Show resolved Hide resolved

jinzex and others added 5 commits April 23, 2024 16:53

Apply review suggestion

0c5e795

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

Apply review suggestion

9cc5fd5

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

Apply review suggestion

015fc88

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

apply review suggestion

3e5f443

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

Merge branch 'main' into scaling_factor_for_tiny_amax

192ce32

jinzex added 2 commits April 24, 2024 15:12

Merge branch 'main' into scaling_factor_for_tiny_amax

5ae78dd

Merge branch 'main' into scaling_factor_for_tiny_amax

6e6f1c6

add test_recipe.py to qa/L0_pytorch_unittest/test.sh; fix unittest fo…

c9b96e1

…r is_first_microbatch=False Signed-off-by: Jinze Xue <jinzex@nvidia.com>

timmoon10 requested changes Apr 26, 2024

View reviewed changes

Merge branch 'main' into scaling_factor_for_tiny_amax

e204844

revert changes to update_weight_scale_inv

0eb5a08

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

Debug test failures

5d5b4e5

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 self-requested a review April 26, 2024 17:43

timmoon10 approved these changes Apr 26, 2024

View reviewed changes

timmoon10 mentioned this pull request Apr 27, 2024

[PyTorch] Refactor FP8 workspaces in linear modules #820

Merged

timmoon10 merged commit 7acb5e2 into NVIDIA:main May 1, 2024

timmoon10 mentioned this pull request May 3, 2024

Avoid amax roll for non-run modules #825

Merged

4 tasks

jinzex deleted the scaling_factor_for_tiny_amax branch May 6, 2024 16:10

Conversation

jinzex commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jinzex commented Apr 24, 2024

Uh oh!

jinzex commented Apr 24, 2024

Uh oh!

ksivaman commented Apr 25, 2024

Uh oh!

jinzex commented Apr 25, 2024

Uh oh!

ksivaman commented Apr 25, 2024

Uh oh!

jinzex commented Apr 25, 2024

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 26, 2024

Uh oh!

jinzex commented Apr 26, 2024

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jinzex commented Apr 16, 2024 •

edited

Loading

timmoon10 left a comment •

edited

Loading

timmoon10 left a comment •

edited

Loading