[PyTorch] Miscellanous fixes for FP8 DPA module by cyanguwa · Pull Request #804 · NVIDIA/TransformerEngine

cyanguwa · 2024-04-24T00:59:02Z

This PR

fixes cuDNN version extraction in unit tests due to versioning difference between cuDNN pre-9.0 and post-9.0,
allows for better compatibility with older checkpoints (pre-TE 1.6). Since TE 1.6, FusedAttention has been subclassed with TEBaseModule, and an _extra_state has been added to the module's state_dict. _extra_state contains FP8 meta data, but due to the subclassing, the addition of _extra_state to state_dict happens regardless of FP8 training or F16 training. This PR allows users to load older checkpoints (which do not have _extra_state for FusedAttention), as well as save and load new checkpoints as usual (which will contain _extra_state for FusedAttention).

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

…old checkpoints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

ksivaman · 2024-04-24T01:34:35Z

I looked at this further post our sync, looks like tp_size/tp_group aren't used at all at the DPA/FusedAttention level. Can we simply remove/deprecate them? @cyanguwa

ksivaman

Added comment

cyanguwa · 2024-04-24T20:13:31Z

I looked at this further post our sync, looks like tp_size/tp_group aren't used at all at the DPA/FusedAttention level. Can we simply remove/deprecate them? @cyanguwa

We do use tp_group_initialized in prepare_forward. Also, if we don't keep track of tp_size/tp_group, how do we manage amax reduction for TP groups, or does fp8.py already handle that? @ksivaman

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

…with fp8_group Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa · 2024-04-24T21:49:03Z

/te-ci pytorch

ksivaman · 2024-04-24T21:52:52Z

With #575, the amax reduction is handled in the reduce_and_update_fp8_tensors function using the fp8_group passed into the autocast. So we don't store the tensor parallel group to handle it separately.

mikolajblaz · 2024-04-26T15:09:05Z

@cyanguwa Regarding checkpoints compatibility: not requiring _extra_state in state dict is good, although I believe it doesn't solve the problem on the application side in case of switching from one attention implementation to another. Would such interoperability of attention layers be possible? It would require _extra_state to live on the same level as the default attention implementation.

…xtra_state Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa · 2024-04-26T22:10:30Z

@mikolajblaz I've moved core_attention.fused_attention._extra_state to core_attention._extra_state in b94a1ee. Let me know if you have any thoughts/comments. Thanks.

Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

…re_attention; keep the test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa · 2024-04-30T21:46:25Z

@ksivaman could you please help take another look?

cyanguwa · 2024-04-30T21:51:17Z

/te-ci pytorch

cyanguwa · 2024-04-30T22:03:35Z

I had some discussion with mikolajblaz offline and we decided to not pursue the move from core_attention.fused_attention._extra_state to core_attention._extra_state. The possible solutions all look unclean and may not even guarantee the loading of checkpoints is correct. This is because PyTorch relies heavily on the module structure and FusedAttention is currently a submodule of DotProductAttention, so it's very hard to manipulate state_dict()/load_state_dict() calls to get around this hierarchical structure. Will consider this another time.

https://github.com/pytorch/pytorch/blob/74b7c56517f97c5d813620da9a479417a564e8b4/torch/nn/modules/module.py#L2164

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa · 2024-05-01T17:58:02Z

/te-ci pytorch

ptrendx

LGTM

cyanguwa · 2024-05-01T21:41:10Z

/te-ci pytorch

ksivaman

LGTM

* initialize tp_group for FP8 DPA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuDNN version in unit tests for cuDNN v9 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add hook to ignore missing fused_attn._extra_states if training from old checkpoints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove test and redundant implementation from last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove warning message and replace with docstring Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove tp_size/tp_group in FusedAttention; amax reduction is handled with fp8_group Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move core_attention.fused_attention._extra_state to core_attention._extra_state Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify post_state_dict_hooks between FU and DPA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add temporary test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove previous attempts to move core_attention.fused_attention to core_attention; keep the test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove the test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable pylint self arg for hook which is required by hook Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

* initialize tp_group for FP8 DPA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuDNN version in unit tests for cuDNN v9 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add hook to ignore missing fused_attn._extra_states if training from old checkpoints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove test and redundant implementation from last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove warning message and replace with docstring Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove tp_size/tp_group in FusedAttention; amax reduction is handled with fp8_group Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move core_attention.fused_attention._extra_state to core_attention._extra_state Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify post_state_dict_hooks between FU and DPA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add temporary test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove previous attempts to move core_attention.fused_attention to core_attention; keep the test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove the test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable pylint self arg for hook which is required by hook Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

cyanguwa added 5 commits April 23, 2024 21:02

initialize tp_group for FP8 DPA

2cf9067

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

fix cuDNN version in unit tests for cuDNN v9

df6fea0

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

add hook to ignore missing fused_attn._extra_states if training from …

ae3de42

…old checkpoints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

remove test and redundant implementation from last commit

7532773

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'main' into fp8_dpa/misc_fixes

b4105f4

ksivaman requested changes Apr 24, 2024

View reviewed changes

cyanguwa requested a review from ksivaman April 24, 2024 20:13

cyanguwa added 4 commits April 24, 2024 13:14

Merge branch 'main' into fp8_dpa/misc_fixes

4d45bba

remove warning message and replace with docstring

befc86d

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

remove tp_size/tp_group in FusedAttention; amax reduction is handled …

273cd4e

…with fp8_group Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'main' into fp8_dpa/misc_fixes

8e168dc

Merge branch 'main' into fp8_dpa/misc_fixes

9bfd19e

cyanguwa added 2 commits April 26, 2024 21:57

move core_attention.fused_attention._extra_state to core_attention._e…

b94a1ee

…xtra_state Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'main' into fp8_dpa/misc_fixes

5d35ff4

cyanguwa added 8 commits April 29, 2024 15:05

Merge branch 'main' into fp8_dpa/misc_fixes

3924565

Signed-off-by: cyanguwa <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'main' into fp8_dpa/misc_fixes

8f52de6

simplify post_state_dict_hooks between FU and DPA

84b3d78

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

add temporary test

8ed65cd

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'NVIDIA:main' into fp8_dpa/misc_fixes

de0f072

remove previous attempts to move core_attention.fused_attention to co…

4635fdc

…re_attention; keep the test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

remove the test

cd9777b

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'main' into fp8_dpa/misc_fixes

2840f19

cyanguwa requested a review from ptrendx April 30, 2024 21:45

cyanguwa added 2 commits May 1, 2024 17:57

disable pylint self arg for hook which is required by hook

ab8a7d3

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'main' into fp8_dpa/misc_fixes

9304d26

ptrendx approved these changes May 1, 2024

View reviewed changes

Merge branch 'main' into fp8_dpa/misc_fixes

b78163d

ksivaman approved these changes May 1, 2024

View reviewed changes

ksivaman merged commit 6459fd8 into NVIDIA:main May 2, 2024

ksivaman added the 1.6.0 label May 2, 2024

Conversation

cyanguwa commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ksivaman commented Apr 24, 2024

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

cyanguwa commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyanguwa commented Apr 24, 2024

Uh oh!

ksivaman commented Apr 24, 2024

Uh oh!

mikolajblaz commented Apr 26, 2024

Uh oh!

cyanguwa commented Apr 26, 2024

Uh oh!

cyanguwa commented Apr 30, 2024

Uh oh!

cyanguwa commented Apr 30, 2024

Uh oh!

cyanguwa commented Apr 30, 2024

Uh oh!

cyanguwa commented May 1, 2024

Uh oh!

ptrendx left a comment

Choose a reason for hiding this comment

Uh oh!

cyanguwa commented May 1, 2024

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cyanguwa commented Apr 24, 2024 •

edited

Loading

cyanguwa commented Apr 24, 2024 •

edited

Loading