[PyTorch] Debug checkpointing with te.Sequential by timmoon10 · Pull Request #1629 · NVIDIA/TransformerEngine

timmoon10 · 2025-04-01T01:44:41Z

Description

TE 2.0 changed how the fusible ops handled FP8 state, i.e. using quantizers rather than fp8_meta dicts, but the get_extra_state/set_extra_state functions were not updated. This PR updates these functions so they are similar to the module get_extra_state/set_extra_state functions:

TransformerEngine/transformer_engine/pytorch/module/base.py

Line 556 in be055eb

def get_extra_state(self) -> torch.Tensor:

TransformerEngine/transformer_engine/pytorch/module/base.py

Line 616 in be055eb

def set_extra_state(self, state: torch.Tensor) -> None:

I've also added a unit test.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Update checkpointing functions in fusible ops with changes from TE 2.0
Add tests for checkpointing fusible ops

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2025-04-01T01:47:13Z

/te-ci pytorch

* Debug checkpointing with te.Sequential Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Peter Dykas <wdykas@nvidia.com>

Debug checkpointing with te.Sequential

4d03b6f

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the bug Something isn't working label Apr 1, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

8edec96

for more information, see https://pre-commit.ci

timmoon10 merged commit 0da6044 into NVIDIA:main Apr 9, 2025
20 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Debug checkpointing with te.Sequential#1629

[PyTorch] Debug checkpointing with te.Sequential#1629
timmoon10 merged 2 commits intoNVIDIA:mainfrom
timmoon10:debug-te-sequential-checkpoint

timmoon10 commented Apr 1, 2025

Uh oh!

timmoon10 commented Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

timmoon10 commented Apr 1, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 commented Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant