Skip to content

Distributed optimizer support for experimental FP8 tensors#7469

Closed
timmoon10 wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
timmoon10:fp8-distopt
Closed

Distributed optimizer support for experimental FP8 tensors#7469
timmoon10 wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
timmoon10:fp8-distopt

Conversation

@timmoon10
Copy link
Collaborator

@timmoon10 timmoon10 commented Sep 20, 2023

What does this PR do ?

This PR integrates with experimental Float8Tensors from the Transformer Engine float8tensor_experiments branch. This allows the model to only store FP8 weight matrices. The distributed optimizer will store an FP32 master copy of the weights and will perform param all-gathers in FP8.

Collection: NLP

Changelog

  • Add logic to initialize GPT with FP8 weight matrices
  • Add distributed optimizer support for FP8 weight matrices, including FP8 param all-gathers

Usage

Enable FP8 support:
https://github.com/NVIDIA/NeMo/blob/19a3b7015fe353199af97903df1814e3a470b503/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L169

Use Megatron-core model:
https://github.com/NVIDIA/NeMo/blob/19a3b7015fe353199af97903df1814e3a470b503/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L49

Set the optimizer to distributed_fused_adam:

https://github.com/NVIDIA/NeMo/blob/f8be40b75ee1f8437b56fcc9602dc2aaddfb0643/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml#L228

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Pinging @sudhakarsingh27.

Additional Information

sudhakarsingh27 and others added 5 commits September 7, 2023 16:11
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@github-actions github-actions bot added core Changes to NeMo Core NLP labels Sep 20, 2023
bucket_id = fragment.bucket_id
bucket_start, bucket_end = fragment.bucket_range
param_start, param_end = fragment.param_range
if param_end <= param_start or bucket_id not in self._params_buckets:

Check notice

Code scanning / CodeQL

Unused local variable

Variable state_bucket is not used.
HAVE_TE_FP8TENSOR = False
try:
from transformer_engine.pytorch import Float8Tensor
from transformer_engine.pytorch.cpp_extensions import cast_to_fp8

Check notice

Code scanning / CodeQL

Empty except

'except' clause does nothing but pass and there is no explanatory comment.
@github-actions
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Oct 16, 2023
@github-actions
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Changes to NeMo Core NLP stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments