Improve communication overlapping in FP8 distributed optimizer by timmoon10 · Pull Request #8221 · NVIDIA-NeMo/NeMo

timmoon10 · 2024-01-23T01:50:19Z

What does this PR do ?

When training GPT, the Apex distributed Adam optimizer overlaps its first parameter all-gather with the optimizer step. This optimization has been applied to both FP8 and non-FP8 models.

Collection: NLP

Changelog

Wait until all distopt buckets finish optimizer step before updating FP8 scaling factors
Support dist opt buckets with multiple dtype configs
Put GPT "leftover params" (i.e. embeddings, layer norm params, biases) in same distopt bucket as first layer params

Usage

Run GPT, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml.

Enable FP8 support with model.fp8=True, FP8 parameters with model.fp8_params=True, the distributed optimizer with model.optim.name=distributed_fused_adam, and overlapped param all-gathers with model.optim.overlap_param_sync=True

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

This PR was implemented together with Support scaled optimizer state in distributed Adam optimizer NVIDIA/apex#1771. Each of the two PRs should be able to run without the other, but there may be performance degradations when training GPT with FP8 params. With the Apex PR and without the NeMo PR, there will be redundant amax reductions. Without the Apex PR and with the NeMo PR, the first bucket all-gather will not be overlapped with the optimizer step.
This communication optimization used to be present for non-FP8 models, but was removed in Use distributed optimizer support for multiple dtypes #7359
Distributed optimizer support for FP8 models was added in Add distopt support for FP8 params and BF16 optimizer state #7909

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

for more information, see https://pre-commit.ci

timmoon10 · 2024-01-23T01:53:24Z

jenkins

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-02-03T01:54:37Z

jenkins

timmoon10 · 2024-02-06T00:24:16Z

jenkins

Avoid unnecessary FP8 weight transposes. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-02-08T18:54:38Z

jenkins

dimapihtar

LGTM. Thank you!

…A-NeMo#8221) * Only reduce amaxes after fp8 cast for last distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Handle case with FP8 and contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Support distopt buckets with mixed dtypes Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix bug where fp8 casts were being skipped Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Separate non-FP8 params into leftover distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure to update FP8 transpose cache Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Avoid unnecessary FP8 weight transposes. Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>

…A-NeMo#8221) * Only reduce amaxes after fp8 cast for last distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Handle case with FP8 and contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Support distopt buckets with mixed dtypes Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix bug where fp8 casts were being skipped Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Separate non-FP8 params into leftover distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure to update FP8 transpose cache Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Avoid unnecessary FP8 weight transposes. Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

NVIDIA-NeMo#8221)" This reverts commit 5521687.

NVIDIA-NeMo#8221)" This reverts commit c84121a.

…A-NeMo#8221) * Only reduce amaxes after fp8 cast for last distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Handle case with FP8 and contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Support distopt buckets with mixed dtypes Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix bug where fp8 casts were being skipped Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Separate non-FP8 params into leftover distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure to update FP8 transpose cache Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Avoid unnecessary FP8 weight transposes. Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Only reduce amaxes after fp8 cast for last distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Handle case with FP8 and contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Support distopt buckets with mixed dtypes Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix bug where fp8 casts were being skipped Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Separate non-FP8 params into leftover distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure to update FP8 transpose cache Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Avoid unnecessary FP8 weight transposes. Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>

…A-NeMo#8221) * Only reduce amaxes after fp8 cast for last distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Handle case with FP8 and contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Support distopt buckets with mixed dtypes Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix bug where fp8 casts were being skipped Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Separate non-FP8 params into leftover distopt bucket Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug FP8 params with contiguous param buffer Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure to update FP8 transpose cache Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Avoid unnecessary FP8 weight transposes. Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

timmoon10 added 7 commits January 18, 2024 12:01

Only reduce amaxes after fp8 cast for last distopt bucket

5900d4e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Handle case with FP8 and contiguous param buffer

11f709d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Support distopt buckets with mixed dtypes

7e16a09

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix bug where fp8 casts were being skipped

3286773

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug FP8 params with contiguous param buffer

2d49eec

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Separate non-FP8 params into leftover distopt bucket

97d00f8

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug FP8 params with contiguous param buffer

31202bb

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added feature request/PR for a new feature NLP labels Jan 23, 2024

timmoon10 requested a review from ericharper January 23, 2024 01:50

github-actions bot added the core Changes to NeMo Core label Jan 23, 2024

timmoon10 and others added 2 commits January 22, 2024 17:52

Merge branch 'main' into distopt-fp8-perf-optim

fa5f3dd

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f35eb6

for more information, see https://pre-commit.ci

timmoon10 and others added 2 commits February 2, 2024 09:19

Make sure to update FP8 transpose cache

7405e47

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into distopt-fp8-perf-optim

43050e5

Merge branch 'main' into distopt-fp8-perf-optim

741beb4

timmoon10 requested a review from erhoo82 February 6, 2024 03:59

ericharper requested a review from dimapihtar February 8, 2024 18:01

timmoon10 and others added 2 commits February 8, 2024 10:43

Merge branch 'main' into distopt-fp8-perf-optim

2eabc53

Update Apex commit

5c980bb

Avoid unnecessary FP8 weight transposes. Signed-off-by: Tim Moon <tmoon@nvidia.com>

github-actions bot added the CI label Feb 8, 2024

dimapihtar approved these changes Feb 8, 2024

View reviewed changes

timmoon10 merged commit c84121a into NVIDIA-NeMo:main Feb 8, 2024

vasunvidia added a commit to vasunvidia/NeMo that referenced this pull request Feb 19, 2024

Revert "Improve communication overlapping in FP8 distributed optimizer (

4c87646

NVIDIA-NeMo#8221)" This reverts commit 5521687.

layalir added a commit to layalir/NeMo that referenced this pull request Feb 28, 2024

Revert "Improve communication overlapping in FP8 distributed optimizer (

5f5dd54

NVIDIA-NeMo#8221)" This reverts commit c84121a.

layalir added a commit to layalir/NeMo that referenced this pull request Feb 29, 2024

Revert "Improve communication overlapping in FP8 distributed optimizer (

827ec07

NVIDIA-NeMo#8221)" This reverts commit c84121a.

timmoon10 mentioned this pull request Jun 7, 2024

Fix bug with GPT distopt buckets when interleaved pipelining is enabled #9408

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve communication overlapping in FP8 distributed optimizer#8221

Improve communication overlapping in FP8 distributed optimizer#8221
timmoon10 merged 14 commits intoNVIDIA-NeMo:mainfrom
timmoon10:distopt-fp8-perf-optim

timmoon10 commented Jan 23, 2024

Uh oh!

timmoon10 commented Jan 23, 2024

Uh oh!

timmoon10 commented Feb 3, 2024

Uh oh!

timmoon10 commented Feb 6, 2024

Uh oh!

timmoon10 commented Feb 8, 2024

Uh oh!

dimapihtar left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

timmoon10 commented Jan 23, 2024

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

timmoon10 commented Jan 23, 2024

Uh oh!

timmoon10 commented Feb 3, 2024

Uh oh!

timmoon10 commented Feb 6, 2024

Uh oh!

timmoon10 commented Feb 8, 2024

Uh oh!

dimapihtar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments