Distributed optimizer infrastructure for FP8 parameters by timmoon10 · Pull Request #1723 · NVIDIA/apex

timmoon10 · 2023-09-06T21:21:42Z

This PR does some refactoring that will enable distributed optimizer support for FP8 parameters in NeMo. It adds the option to do parameter all-gathers in integer dtypes and adds two member functions - _check_params_shard_dtypes and _param_copy_fragments - to handle casting into and out of the all-gather buffer. For now these functions will either do a direct cast for floating-point dtypes or copy the most significant bytes for other dtypes. I plan to override these functions in the NeMo derived class so that it casts to FP8, performs the all-gather in UINT8, and unpacks into a custom FP8 tensor class.

This PR depends on #1719 and #1721.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

apex/contrib/optimizers/distributed_fused_adam.py

Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

* Add distopt support for param syncs with non-floating-point dtypes Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update apex/contrib/optimizers/distributed_fused_adam.py Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

* Add update_scale_hysteresis * Fix compile errors * Massively reduce LayerNorm/RMSNorm GPU memory usage in modern networks by tricking torch autograd (#1715) * input grad checks out * adding clamp gamma * Both old and proposed implementation checks out * 2 tests not yet passed due to numerical issues * mem_eff works * fast-layer-norm done * Moving mem-eff to templates * Relax tolerance for memory efficient backward * Fix backward api of python * Distributed optimizer infrastructure for FP8 parameters (#1723) * Add distopt support for param syncs with non-floating-point dtypes Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update apex/contrib/optimizers/distributed_fused_adam.py Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com> * Add unit test * Fix comment in unit test * Remove unnecessary bits --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Rui Wang <rui@helixon.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

Add distopt support for param syncs with non-floating-point dtypes

c71321f

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 force-pushed the distopt-fp8 branch from fbc1ab4 to c71321f Compare September 6, 2023 23:59

crcrpar reviewed Sep 12, 2023

View reviewed changes

apex/contrib/optimizers/distributed_fused_adam.py Outdated Show resolved Hide resolved

Update apex/contrib/optimizers/distributed_fused_adam.py

7128896

Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>

timmoon10 mentioned this pull request Sep 20, 2023

Distributed optimizer support for experimental FP8 tensors NVIDIA-NeMo/NeMo#7469

Closed

8 tasks

timmoon10 requested a review from crcrpar September 28, 2023 16:47

crcrpar approved these changes Sep 29, 2023

View reviewed changes

crcrpar merged commit 2386a91 into NVIDIA:master Sep 29, 2023

timmoon10 mentioned this pull request Nov 14, 2023

Distributed optimizer support for contiguous param buffer with FP8 params #1749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed optimizer infrastructure for FP8 parameters#1723

Distributed optimizer infrastructure for FP8 parameters#1723
crcrpar merged 2 commits intoNVIDIA:masterfrom
timmoon10:distopt-fp8

timmoon10 commented Sep 6, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

timmoon10 commented Sep 6, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments