Skip to content

Handle the scaling factor when amax is too tiny that leads to an infinite scale#786

Merged
timmoon10 merged 14 commits intoNVIDIA:mainfrom
jinzex:scaling_factor_for_tiny_amax
May 1, 2024
Merged

Handle the scaling factor when amax is too tiny that leads to an infinite scale#786
timmoon10 merged 14 commits intoNVIDIA:mainfrom
jinzex:scaling_factor_for_tiny_amax

Conversation

@jinzex
Copy link
Contributor

@jinzex jinzex commented Apr 16, 2024

When the amax is too tiny that the scale becoming infinite in FP32, we set the scale to the max value of FP32. In this case, the tensor’s amax won't get mapped to the FP8 max representable, but rather something below that, but this is the best thing we can do.

cc @Oleg-Goncharov

jinzex added 3 commits April 16, 2024 15:37
…nite scale

Signed-off-by: Jinze Xue <jinzex@nvidia.com>
Signed-off-by: Jinze Xue <jinzex@nvidia.com>
Signed-off-by: Jinze Xue <jinzex@nvidia.com>
jinzex and others added 5 commits April 23, 2024 16:53
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>
Signed-off-by: Jinze Xue <jinzex@nvidia.com>
@jinzex
Copy link
Contributor Author

jinzex commented Apr 24, 2024

@timmoon10 Thanks Tim for your review and suggestions. All review suggestions were applied.

@jinzex
Copy link
Contributor Author

jinzex commented Apr 24, 2024

@ksivaman Currently test_recipe.py is not included in the qa/L0_pytorch_unittest/test.sh. And test_amax_and_scale_update is failed for the case that has is_first_microbatch=False, could you give some suggestions on how to fix it?

@ksivaman
Copy link
Member

@jinzex Could you add the test_recipe.py in the qa folder such that it is included in our tests? We should also remove freezing scaling factors during the is_first_microbatch tests since it was removed in #575. I suspect this is also the reason for the failures you observe.

@jinzex
Copy link
Contributor Author

jinzex commented Apr 25, 2024

@ksivaman Thanks for your reply! Do you mean removing the is_first_microbatch=False case?

@pytest.mark.parametrize("amax_history_len", [1, 31, 1024])
@pytest.mark.parametrize("amax_compute_algo", ["max", "most_recent"])
@pytest.mark.parametrize("is_first_microbatch", [None, True, False])
def test_amax_and_scale_update(
self,
amax_history_len: int,
amax_compute_algo: str,
is_first_microbatch: Optional[bool],
margin: int = 2,

For example, change it to

    @pytest.mark.parametrize("amax_history_len", [1, 31, 1024])
    @pytest.mark.parametrize("amax_compute_algo", ["max", "most_recent"])
    @pytest.mark.parametrize("is_first_microbatch", [None, True])
    def test_amax_and_scale_update(
        self,
        amax_history_len: int,
        amax_compute_algo: str,
        is_first_microbatch: Optional[bool],
        margin: int = 2,
    ):

@ksivaman
Copy link
Member

I meant that in this line, we should set update_weight_scale_inv to True and check the results!

…r is_first_microbatch=False

Signed-off-by: Jinze Xue <jinzex@nvidia.com>
@jinzex
Copy link
Contributor Author

jinzex commented Apr 25, 2024

Thanks for the suggestion! That fixed the test. Changes has been committed.

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change test_recipe.py. The test failure exposes a valid bug introduced by the recipe change in #575. In particular, by updating the weight scales in every microbatch step, we might change the FP8 scale in a step where we do not change the FP8 data.

This bug is beyond the scope of this PR though, and otherwise this PR looks good. I think a better solution is to merge this PR without making any changes to the tests and we will fix that bug in an upcoming PR.

@timmoon10
Copy link
Collaborator

/te-ci pytorch

Signed-off-by: Jinze Xue <jinzex@nvidia.com>
@jinzex
Copy link
Contributor Author

jinzex commented Apr 26, 2024

@timmoon10 Thanks Tim for the comment! The changes to update_weight_scale_inv have been reverted.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 self-requested a review April 26, 2024 17:43
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This will expose some test failures from an existing bug (see #786 (review)) and I see some unrelated linter failures (see #816).

@timmoon10
Copy link
Collaborator

/te-ci pytorch

@jinzex jinzex deleted the scaling_factor_for_tiny_amax branch May 6, 2024 16:10
pggPL pushed a commit to pggPL/TransformerEngine that referenced this pull request May 23, 2024
…nite scale (NVIDIA#786)

* Handle the scaling factor when amax is too tiny that leads to an infinite scale

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* revert formatting changes

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* fix comments

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* Apply review suggestion

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

* Apply review suggestion

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

* Apply review suggestion

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>

* apply review suggestion

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* add test_recipe.py to qa/L0_pytorch_unittest/test.sh; fix unittest for is_first_microbatch=False

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* revert changes to update_weight_scale_inv

Signed-off-by: Jinze Xue <jinzex@nvidia.com>

* Debug test failures

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Jinze Xue <jinzex@nvidia.com>
Signed-off-by: Jinze Xue <155670984+jinzex@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Jinze Xue <jinzex@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants