[PyTorch] Support pickling Float8Tensor by timmoon10 · Pull Request #529 · NVIDIA/TransformerEngine

timmoon10 · 2023-11-21T23:25:58Z

We've experienced some problems when trying to checkpoint FP8 models (NVIDIA-NeMo/NeMo#7909 (comment)). The root cause is because we cast FP8 params to higher precision when checkpointing TE modules:

TransformerEngine/transformer_engine/pytorch/module/base.py

Line 831 in 8864983

state[key] = val.from_float8()

This messes with some of the bookkeeping for checkpointing in Megatron-core, e.g. to figure out corresponding tensors in the model and optimizer state_dict s. I've modified the behavior so that pickling Float8Tensor s will save the FP8 data, dtypes, and scale-inv (but not fp8_meta). This fixes the error for me when I run some quick tests.

This is built on top of #524. Closes #524.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

transformer_engine/pytorch/float8_tensor.py

Avoid FP8 casts when copying between Float8Tensors. Make make_like a class function. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2023-12-01T03:29:13Z

After discussion with @sudhakarsingh27, I think the cleanest approach to handle fp8_meta when loading a checkpoint was to modify Float8Tensor.copy_ so that it just copies _data and _scale_inv when copying from another Float8Tensor. This also has the benefit of making these copies faster (avoiding an extra write and read in higher precision) and avoiding adding any rounding error.

I'll go more into detail. We generally expect users to use torch.nn.Module.state_dict/torch.nn.Module.load_state_dict when checkpointing models. When loading a checkpoint, you typically initialize a model with junk weights, unpickle a file to get a state dict, copy the weight values from the state dict into the model, and then discard the state dict. Since we're going to throw away the unpickled Float8Tensors, it's fine if it doesn't have an fp8_meta as long as the corresponding model weights do. However, we have the problem that the parameters are copied from the state dict before any extra state like fp8_meta (see the implementation of torch.nn.Module.load_state_dict. The existing implementation of Float8Tensor.copy_ tried to use the FP8 scale from fp8_meta (requiring casting to high precision and back to FP8), but we have no reason to expect the initial scale is any good and there could be numerical/convergence problems. This PR's change to copy_ works around this since the _scale_inv for the loaded weight is presumably good. By the time we start training we can expect that the Float8Tensor's fp8_meta has been properly configured and that will take precedence for any future FP8 casts (e.g. in the optimizer step). I'm not completely at ease since there could be other use-cases where we want an unpickled Float8Tensor to have access to fp8_meta (CPU offloading?), but I'm not aware of them and it's probably best not to overengineer a solution.

timmoon10 · 2023-12-01T03:29:23Z

/te-ci pytorch

transformer_engine/pytorch/float8_tensor.py

Debugged pickling and copy functions. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2023-12-05T17:33:41Z

/te-ci pytorch

transformer_engine/pytorch/float8_tensor.py

sudhakarsingh27

Lgtm!

timmoon10 and others added 8 commits November 19, 2023 02:25

Float8Tensor uses cached transpose if available

5215774

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix bug with non-2D transpose

22eccf6

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into float8tensor-transpose-caching

0dcd8f4

Custom pickling for Float8Tensor

30f2805

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug test for pickling Float8Tensor

d2470cc

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into float8tensor-transpose-caching

57c7e03

Merge branch 'float8tensor-transpose-caching' into float8tensor-pickle

76ea1d1

Fix merge conflict

871eaf0

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the bug Something isn't working label Nov 21, 2023

timmoon10 requested review from ksivaman and sudhakarsingh27 November 21, 2023 23:25

timmoon10 mentioned this pull request Nov 21, 2023

Add distopt support for FP8 params and BF16 optimizer state NVIDIA-NeMo/NeMo#7909

Merged

8 tasks

sudhakarsingh27 reviewed Nov 30, 2023

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/float8_tensor.py Outdated Show resolved Hide resolved

Review suggestions from @sudhakarsingh27

68d0ca9

Avoid FP8 casts when copying between Float8Tensors. Make make_like a class function. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into float8tensor-pickle

60ec28f

timmoon10 requested a review from sudhakarsingh27 December 1, 2023 03:37

sudhakarsingh27 reviewed Dec 1, 2023

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Outdated Show resolved Hide resolved

timmoon10 requested a review from sudhakarsingh27 December 1, 2023 23:50

Merge branch 'main' into float8tensor-pickle

068b64e

timmoon10 marked this pull request as draft December 4, 2023 07:40

timmoon10 and others added 2 commits December 5, 2023 17:31

Add unit test for checkpointing model with FP8 params

cd60e0f

Debugged pickling and copy functions. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into float8tensor-pickle

110b651

timmoon10 marked this pull request as ready for review December 5, 2023 17:33

sudhakarsingh27 reviewed Dec 6, 2023

View reviewed changes

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

transformer_engine/pytorch/float8_tensor.py Show resolved Hide resolved

sudhakarsingh27 approved these changes Dec 6, 2023

View reviewed changes

timmoon10 merged commit ff760a9 into NVIDIA:main Dec 7, 2023

timmoon10 mentioned this pull request Dec 8, 2023

[PyTorch] Float8Tensor uses cached transpose if available #524

Closed

timmoon10 deleted the float8tensor-pickle branch February 2, 2024 01:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Support pickling Float8Tensor#529

[PyTorch] Support pickling Float8Tensor#529
timmoon10 merged 13 commits intoNVIDIA:mainfrom
timmoon10:float8tensor-pickle

timmoon10 commented Nov 21, 2023

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented Dec 1, 2023

Uh oh!

timmoon10 commented Dec 1, 2023

Uh oh!

Uh oh!

timmoon10 commented Dec 5, 2023

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

timmoon10 commented Nov 21, 2023

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented Dec 1, 2023

Uh oh!

timmoon10 commented Dec 1, 2023

Uh oh!

Uh oh!

timmoon10 commented Dec 5, 2023

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments