[PyTorch] Use dummy amax for Float8Tensor cast by ksivaman · Pull Request #693 · NVIDIA/TransformerEngine

ksivaman · 2024-02-29T22:51:35Z

No description provided.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

timmoon10

Right now, any change to the values in Float8Tensor will update the amax_history in fp8_meta. This is necessary to automatically update the amax after in-place operations (e.g. the optimizer step) and it makes it easier to reason about fp8_meta (Float8Tensor treats the contents of fp8_meta as ground-truth). This happens in Float8Tensor.to_float8:

TransformerEngine/transformer_engine/pytorch/float8_tensor.py

Line 105 in b8eea8a

amax = fp8_meta[fp8_meta_key].amax_history[0][fp8_meta_index]

It also happens in Float8Tensor.__torch_dispatch__ with in-place operations:

TransformerEngine/transformer_engine/pytorch/float8_tensor.py

Line 621 in b8eea8a

amax = dst._fp8_meta[fp8_meta_key].amax_history[0][fp8_meta_index]

This PR changes the behavior so to_float8 ignores the amax_history in fp8_meta. I see some possibilities:

We treat fp8_meta differently betwen to_float8 and __torch_dispatch__. This is confusing.
We change __torch_dispatch__ so it also ignores the amax_history in fp8_meta. This means we no longer fuse the amax with the FP8 cast, but have to externally call an amax kernel after each in-place operation.
If there is some localized bug when calling to_float8, we could pass in a dummy amax there instead of modifying Float8Tensor.

We should do the third one if possible. As far as I'm aware, the only place we call to_float8 is at:

TransformerEngine/transformer_engine/pytorch/module/base.py

Line 790 in b8eea8a

param = Float8Tensor.to_float8(

If we have to change Float8Tensor, I'd prefer if we reverted this branch and deleted these lines instead:

TransformerEngine/transformer_engine/pytorch/float8_tensor.py

Lines 104 to 105 in b8eea8a

    
           if amax is None: 
        
               amax = fp8_meta[fp8_meta_key].amax_history[0][fp8_meta_index]

This makes it more obvious that Float8Tensor is ignoring the amax_history in fp8_meta. We also need to figure out and document our decision regarding __torch_dispatch__.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

timmoon10

This is much nicer than the previous approach. I'm not quite comfortable that the initial weight amax will never be included in the amax history. But I suppose it's a subtle point that only affects the first step and I'll approve if it makes #575 cleaner.

ksivaman · 2024-03-01T02:34:32Z

For reference, this also resolves the inconsistency in amax histories between weights and activations/grads, where the initial amax is included only in the weight history.

ksivaman · 2024-03-01T02:34:44Z

/te-ci pytorch

Avoid updating real during param cast

4e88c88

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman requested a review from timmoon10 February 29, 2024 22:51

ksivaman self-assigned this Feb 29, 2024

Merge branch 'main' into float8_tensor_amax_fix

cb3e9a5

ksivaman marked this pull request as draft February 29, 2024 22:51

timmoon10 requested changes Feb 29, 2024

View reviewed changes

Review comments

3306613

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman marked this pull request as ready for review March 1, 2024 01:05

Merge branch 'main' into float8_tensor_amax_fix

92434c1

timmoon10 approved these changes Mar 1, 2024

View reviewed changes

ksivaman merged commit 4e2ce51 into NVIDIA:main Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Use dummy amax for Float8Tensor cast#693

[PyTorch] Use dummy amax for Float8Tensor cast#693
ksivaman merged 4 commits intoNVIDIA:mainfrom
ksivaman:float8_tensor_amax_fix

ksivaman commented Feb 29, 2024

Uh oh!

timmoon10 left a comment •

edited

Loading

Uh oh!

timmoon10 left a comment

Uh oh!

ksivaman commented Mar 1, 2024

Uh oh!

ksivaman commented Mar 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if amax is None:
	amax = fp8_meta[fp8_meta_key].amax_history[0][fp8_meta_index]

Conversation

ksivaman commented Feb 29, 2024

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Mar 1, 2024

Uh oh!

ksivaman commented Mar 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timmoon10 left a comment •

edited

Loading