Make fused normalization functions backward-compatible by timmoon10 · Pull Request #1760 · NVIDIA/apex

timmoon10 · 2023-12-21T01:21:11Z

#1715 makes breaking API changes to some fused normalization functions, in particular adding memory_efficient as a positional argument. This PR makes memory_efficient a keyword argument to ensure backward compatibility.

This change is motivated by the fact that Megatron-LM uses the old API:
https://github.com/NVIDIA/Megatron-LM/blob/2bc6cd307a11423928c675f741e79e03df23e721/megatron/core/fusions/fused_layer_norm.py#L147
This prevents NeMo from upgrading from the 23.09 to 23.11 PyTorch container. See NVIDIA-NeMo/NeMo#7909 (comment).

Feedback would be appreciated. An alternative approach is to update Megatron-LM, but this seems simpler. Pinging @RuiWang1998.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

See NVIDIA/apex#1760. Signed-off-by: Tim Moon <tmoon@nvidia.com>

RuiWang1998 · 2023-12-21T05:35:09Z

Hi @timmoon10 ,

Just thought people may not be using the Function directly and forgot about Megatron. I believe it might be best to submit another PR to Megatron-LM in tandem with this one, since I believe Megatron-Deepspeed already have this feature (deepspeedai/Megatron-DeepSpeed#277) and it'd be great if Megatron has it as well.

timmoon10 · 2023-12-27T19:07:45Z

@RuiWang1998 That's nifty, it'll be convenient to just reuse that existing work.

These two approaches aren't mutually exclusive, so I don't think there is any harm to merging. This change won't break the newer code that uses memory_efficient.

* Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <tmoon@nvidia.com> * Removed unused import Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>

…eMo#7909) * Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <tmoon@nvidia.com> * Removed unused import Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>

…eMo#7909) * Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <tmoon@nvidia.com> * Removed unused import Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>

…eMo#7909) * Add distopt support for FP8 params and BF16 optimizer state Signed-off-by: Tim Moon <tmoon@nvidia.com> * Removed unused import Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update PyTorch container in Jenkins pipeline Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use custom container with Apex bugfixes See NVIDIA/apex#1760. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Upgrade to PyTorch 23.11 container Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commit Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>

Make fused layer norm functions backward-compatible

99a276e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added a commit to timmoon10/NeMo that referenced this pull request Dec 21, 2023

Use custom container with Apex bugfixes

f40a38a

See NVIDIA/apex#1760. Signed-off-by: Tim Moon <tmoon@nvidia.com>

crcrpar approved these changes Jan 1, 2024

View reviewed changes

crcrpar merged commit c07a4cf into NVIDIA:master Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make fused normalization functions backward-compatible#1760

Make fused normalization functions backward-compatible#1760
crcrpar merged 1 commit intoNVIDIA:masterfrom
timmoon10:memory-efficient-layer-norm-bugfix

timmoon10 commented Dec 21, 2023 •

edited

Loading

Uh oh!

RuiWang1998 commented Dec 21, 2023

Uh oh!

timmoon10 commented Dec 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

timmoon10 commented Dec 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RuiWang1998 commented Dec 21, 2023

Uh oh!

timmoon10 commented Dec 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

timmoon10 commented Dec 21, 2023 •

edited

Loading