[PyTorch] Experimental FP8 tensor class by timmoon10 · Pull Request #452 · NVIDIA/TransformerEngine

timmoon10 · 2023-09-29T15:05:05Z

This FP8 tensor class is based on the implementation at https://github.com/facebookexperimental/protoquant/tree/fp8_poc and is primarily oriented toward enabling efficient FP8 support in Apex's DistributedFusedAdam. See NVIDIA-NeMo/NeMo#7469 and NVIDIA-NeMo/NeMo#7565.

CC @sudhakarsingh27 @ksivaman

transformer_engine/pytorch/fp8.py

ptrendx · 2023-09-30T05:41:44Z

/te-ci

timmoon10 · 2023-10-05T23:58:23Z

/te-ci

sudhakarsingh27 · 2023-10-16T20:18:08Z

/te-ci pytorch

Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2023-10-20T04:47:45Z

/te-ci pytorch

ksivaman · 2023-10-20T08:16:05Z

/te-ci pytorch

ksivaman · 2023-10-20T08:17:24Z

/te-ci pytorch

ptrendx · 2023-10-23T18:23:58Z

transformer_engine/pytorch/float8_tensor.py

+    handled outside this class. If a tensor is initialized with an FP8
+    metadata object, it extracts the information it needs so it isn't
+    affected by later changes in the FP8 metadata (although its design
+    does cause us to leak some subtle side-effects into FP8 metadata).


This doc is not really correct since we are holding a view to the meta, right?

Ops using the tensor class's __torch_dispatch__ are insensitive to external changes in the meta since we cache scale_inv. However, all bets are off when we extract _data and pass it to external ops like tex.fp8_gemm.

transformer_engine/pytorch/float8_tensor.py

transformer_engine/pytorch/fp8.py

transformer_engine/pytorch/module/base.py

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/fp8.py

transformer_engine/pytorch/module/base.py

transformer_engine/pytorch/module/layernorm_linear.py

timmoon10 · 2023-10-24T00:46:52Z

transformer_engine/pytorch/float8_tensor.py

+    handled outside this class. If a tensor is initialized with an FP8
+    metadata object, it extracts the information it needs so it isn't
+    affected by later changes in the FP8 metadata (although its design
+    does cause us to leak some subtle side-effects into FP8 metadata).


Ops using the tensor class's __torch_dispatch__ are insensitive to external changes in the meta since we cache scale_inv. However, all bets are off when we extract _data and pass it to external ops like tex.fp8_gemm.

transformer_engine/pytorch/float8_tensor.py

transformer_engine/pytorch/fp8.py

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Handle case where transpose cache is updated externally. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

transformer_engine/pytorch/float8_tensor.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

transformer_engine/pytorch/float8_tensor.py

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Easier for multiple tensors to share, e.g. detached tensors. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…10/TransformerEngine into float8tensor_experiments

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ptrendx

Approving as experimental. We will iterate upon this in the next release.

* Experimental FP8 tensor Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add fp8 tensor to ci test Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * review comments and tests Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Minor changes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Default to FP8 usage Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix docs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Naming changes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * minor fix Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix transpose caching Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Debug transpose caching Handle case where transpose cache is updated externally. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Rename FP8GlobalStateManager.with_fp8_parameters Signed-off-by: Tim Moon <tmoon@nvidia.com> * remove Float8Tensor from import API Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Avoid caching FP8 transposes if not required Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix import error in FP8 tensor tests Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix tranpose caching and checkpointing bug Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Improve caching and fix distopt case Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/pytorch/float8_tensor.py Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Remove recursive logic Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix cache reset bug Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Store FP8 attributes in dict Easier for multiple tensors to share, e.g. detached tensors. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Make sure scale_inv is 1D tensor Signed-off-by: Tim Moon <tmoon@nvidia.com> * Make sure scale_inv is 1D tensor Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fixes and detach recipe Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Set default fp8 data type Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>

carmocca · 2023-11-29T02:39:40Z

transformer_engine/pytorch/fp8.py

+             * full model training using optimizer with master weights, where the high
+               precision copies of weights are already present in the optimizer.


How does this look in practice? If the model will be initialized directly with fp8 weights, how does the optimizer get high-precision copies?

timmoon10 added the enhancement New feature or request label Sep 29, 2023

ptrendx reviewed Sep 29, 2023

View reviewed changes

transformer_engine/pytorch/fp8.py Outdated Show resolved Hide resolved

ptrendx added the 1.0.0 label Oct 16, 2023

ksivaman force-pushed the float8tensor_experiments branch from de20156 to 4315115 Compare October 16, 2023 21:08

Experimental FP8 tensor

b6bfddb

Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman force-pushed the float8tensor_experiments branch from 67f7cd3 to b6bfddb Compare October 19, 2023 23:45

Add fp8 tensor to ci test

36093a5

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman marked this pull request as ready for review October 20, 2023 04:47

Merge branch 'main' into float8tensor_experiments

b50423b

ptrendx reviewed Oct 23, 2023

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

ptrendx reviewed Oct 23, 2023

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

ptrendx reviewed Oct 23, 2023

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Outdated Show resolved Hide resolved

ptrendx reviewed Oct 23, 2023

View reviewed changes

transformer_engine/pytorch/fp8.py Outdated Show resolved Hide resolved

ptrendx reviewed Oct 23, 2023

View reviewed changes

transformer_engine/pytorch/fp8.py Show resolved Hide resolved

ptrendx reviewed Oct 23, 2023

View reviewed changes

transformer_engine/pytorch/module/base.py Show resolved Hide resolved

ptrendx reviewed Oct 23, 2023

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Show resolved Hide resolved

Merge branch 'main' into float8tensor_experiments

78239da

timmoon10 commented Oct 24, 2023

View reviewed changes

sudhakarsingh27 reviewed Oct 24, 2023

View reviewed changes

transformer_engine/pytorch/fp8.py Show resolved Hide resolved

sudhakarsingh27 reviewed Oct 24, 2023

View reviewed changes

transformer_engine/pytorch/fp8.py Outdated Show resolved Hide resolved

Merge branch 'main' into float8tensor_experiments

dfcbcf1

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

sudhakarsingh27 force-pushed the float8tensor_experiments branch from 8ff9e05 to dfcbcf1 Compare October 24, 2023 21:27

review comments and tests

8814ead

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

timmoon10 and others added 8 commits October 25, 2023 14:25

Debug transpose caching

c3e0078

Handle case where transpose cache is updated externally. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Rename FP8GlobalStateManager.with_fp8_parameters

202afcb

Signed-off-by: Tim Moon <tmoon@nvidia.com>

remove Float8Tensor from import API

1d0b1fe

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Avoid caching FP8 transposes if not required

39add1a

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix import error in FP8 tensor tests

b845d32

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix tranpose caching and checkpointing bug

a5351b3

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'main' into float8tensor_experiments

79932f1

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Improve caching and fix distopt case

7d95a91

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

timmoon10 commented Oct 27, 2023

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Outdated Show resolved Hide resolved

timmoon10 and others added 2 commits October 26, 2023 17:34

Update transformer_engine/pytorch/float8_tensor.py

20fc9a9

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Remove recursive logic

9f08be7

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

timmoon10 commented Oct 27, 2023

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

ksivaman and others added 4 commits October 27, 2023 22:20

Fix cache reset bug

00b9c31

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Store FP8 attributes in dict

4cf27a1

Easier for multiple tensors to share, e.g. detached tensors. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Make sure scale_inv is 1D tensor

718d284

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Make sure scale_inv is 1D tensor

94848da

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 force-pushed the float8tensor_experiments branch from 718d284 to 94848da Compare October 31, 2023 00:34

ksivaman added 3 commits October 31, 2023 00:49

Fixes and detach recipe

ac192d8

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Merge branch 'float8tensor_experiments' of https://github.com/timmoon…

918bda3

…10/TransformerEngine into float8tensor_experiments

Set default fp8 data type

7b3f5cd

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ptrendx approved these changes Oct 31, 2023

View reviewed changes

ksivaman merged commit b1820c4 into NVIDIA:main Oct 31, 2023

This was referenced Nov 14, 2023

Distributed optimizer support for experimental FP8 tensors NVIDIA-NeMo/NeMo#7885

Closed

Add distopt support for FP8 params and BF16 optimizer state NVIDIA-NeMo/NeMo#7909

Merged

carmocca reviewed Nov 29, 2023

View reviewed changes

timmoon10 mentioned this pull request Feb 13, 2024

[PyTorch] Float8Tensor uses cached transpose if available #524

Closed

timmoon10 mentioned this pull request Apr 27, 2024

[PyTorch] Refactor FP8 workspaces in linear modules #820

Merged

		* full model training using optimizer with master weights, where the high
		precision copies of weights are already present in the optimizer.

Conversation

timmoon10 commented Sep 29, 2023

Uh oh!

Uh oh!

ptrendx commented Sep 30, 2023

Uh oh!

timmoon10 commented Oct 5, 2023

Uh oh!

sudhakarsingh27 commented Oct 16, 2023

Uh oh!

ksivaman commented Oct 20, 2023

Uh oh!

ksivaman commented Oct 20, 2023

Uh oh!

ksivaman commented Oct 20, 2023

Uh oh!

ptrendx Oct 23, 2023

Choose a reason for hiding this comment

Uh oh!

timmoon10 Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx left a comment

Choose a reason for hiding this comment

Uh oh!

carmocca Nov 29, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants