Support using fp16 master weights and fp16/fp8 optimizer states in FusedAdam by kunlunl · Pull Request #1078 · NVIDIA/TransformerEngine

kunlunl · 2024-08-05T19:41:30Z

Description

Add options to set the dtypes of master weights, exp_avg and exp_avg_sq of FusedAdam.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Support using fp32/fp16 master weights
Support using fp32/fp16/fp8 exp_avg
Support using fp32/fp16/fp8 exp_avg_sq

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: kunlunl <kunlunl@nvidia.com>

kunlunl · 2024-10-30T09:51:38Z

@timmoon10 Hello, I noticed no one has commented on this MR for a long time, could you please take a look, or could you help find someone to review it?

timmoon10

Overall this looks good. It would be more general if we disentangled the state dtypes and state scaling (e.g. why not have scaled FP32 states or unscaled BF16 states?), but this does cover the specific cases in the MS-AMP paper.

For future reference, this PR adapts logic from NVIDIA/apex#1771. This is a proof-of-concept with several opporunities for future improvement:

TE kernel for computing absmax and scale
Fusing scale/unscale within Adam kernel
Reduce memory usage in optimizer step, perhaps by processing params in chunks
Reduce memory usage in checkpointing, perhaps by storing checkpoint buffers in CPU

transformer_engine/pytorch/optimizers/fused_adam.py

timmoon10 · 2024-10-31T00:44:14Z

transformer_engine/pytorch/optimizers/fused_adam.py

This removes the use-case where the master weights are provided externally (added in #977). I personally like this change since it makes things cleaner, but will it have an effect on Mcore integration? Pinging @Wong4j.

Yes, I know this problem. I talked with @Wong4j offline and invited him to review this PR.
His MR in MCore (fuse dtype casting) has not been merged yet, so I put the "fusing dtype casting" function into a new MR in MCore, together with this precision-aware optimizer.

tests/pytorch/test_fused_optimizer.py

timmoon10 · 2024-10-31T01:24:32Z

/te-ci pytorch

yaox12 · 2024-10-31T09:13:30Z

/te-ci pytorch

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Kunlun Li <94586211+kunlunl@users.noreply.github.com>

timmoon10 · 2024-10-31T20:46:09Z

/te-ci pytorch

timmoon10

LGTM, pending CI and confirmation from @Wong4j that this won't break Mcore integration.

Wong4j · 2024-11-01T07:03:32Z

LGTM.
@timmoon10 This design is better. My mcore PR is not merged yet. So it won't break mcore.

Ethan-yt · 2025-03-20T11:24:16Z

Hello. @kunlunl @timmoon10
I am using Megatron-LM to train models. After this PR, when I load model with optimizer state, the memory increased and sometimes OOM occurred.

Memory:
enter load_checkpoint: 20G
generate_state_dict: 34G, +14G
_load_base_checkpoint(): 47G, +14G
optimizer.load_state_dict(state_dict['optimizer']): 61G，+14G (before this PR won't increase)
exit load_checkpoint: 35G

kunlunl changed the title ~~Add MX-FP16~~ Support using fp16 master weights and fp16/fp8 optimizer states in FusedAdam Aug 6, 2024

kunlunl force-pushed the mx_fp16 branch from 27a1516 to c4c2126 Compare August 6, 2024 17:37

kunlunl changed the title ~~Support using fp16 master weights and fp16/fp8 optimizer states in FusedAdam~~ Draft: Support using fp16 master weights and fp16/fp8 optimizer states in FusedAdam Aug 6, 2024

kunlunl force-pushed the mx_fp16 branch from c4c2126 to 7de6806 Compare October 29, 2024 12:51

kunlunl changed the title ~~Draft: Support using fp16 master weights and fp16/fp8 optimizer states in FusedAdam~~ Support using fp16 master weights and fp16/fp8 optimizer states in FusedAdam Oct 29, 2024

Add precision aware fused adam

d05cc42

Signed-off-by: kunlunl <kunlunl@nvidia.com>

kunlunl force-pushed the mx_fp16 branch from ba19169 to d05cc42 Compare October 30, 2024 09:43

timmoon10 reviewed Oct 31, 2024

View reviewed changes

timmoon10 self-requested a review October 31, 2024 01:24

kunlunl force-pushed the mx_fp16 branch from 84b3929 to 405d70e Compare October 31, 2024 09:12

kunlunl force-pushed the mx_fp16 branch from 08d3f8b to ad12551 Compare October 31, 2024 09:20

Minor changes based on review comments.

051e94b

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Kunlun Li <94586211+kunlunl@users.noreply.github.com>

kunlunl force-pushed the mx_fp16 branch from 709b341 to 051e94b Compare October 31, 2024 09:30

Merge branch 'main' into mx_fp16

41fe81b

timmoon10 approved these changes Oct 31, 2024

View reviewed changes

timmoon10 merged commit 05c0fb0 into NVIDIA:main Nov 1, 2024

Conversation

kunlunl commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

kunlunl commented Oct 30, 2024

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

kunlunl Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timmoon10 commented Oct 31, 2024

Uh oh!

yaox12 commented Oct 31, 2024

Uh oh!

timmoon10 commented Oct 31, 2024

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Wong4j commented Nov 1, 2024

Uh oh!

Ethan-yt commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

kunlunl commented Aug 5, 2024 •

edited

Loading

timmoon10 left a comment •

edited

Loading

Ethan-yt commented Mar 20, 2025 •

edited

Loading