[PyTorch] Debug amax reductions in eval mode and async amax reductions by timmoon10 · Pull Request #728 · NVIDIA/TransformerEngine

timmoon10 · 2024-03-21T04:18:15Z

This PR fixes two bugs related to amax reductions:

In the forward pass, we only launch an amax reduction if a module is in training mode. However, we always launch amax reductions in the backward pass. This causes runtime errors in the backward pass of modules in evaluation mode, e.g. LoRA frozen layers. Confusing no_grad and evaluation mode seems to be a common mistake in PyTorch. This PR fixes this by checking if a module is in training mode in its backward pass, similar to how we do it in the forward pass.
When asynchronous amax reductions are enabled, we currently sync the reduction after the first module's amax and scale update. I would appreciate sanity checking, since I would expect this to have caused non-deterministic numerical errors in the scale updates. This PR avoids this by making sure the async reduction is finished before any amax and scale updates. See Async amax reduction #118 (comment).

I've attempted to keep this PR small since #575 touches a lot of the amax reduction logic. In the future, I think it would be worthwhile reworking the async amax reductions since it currently doesn't do much overlapping (it is launched when entering fp8_autocast and synchronized before the first TE module's forward).

Do not update backward FP8 scales when in eval mode. Make sure to finish async amax reductions before scale update. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-03-21T04:21:48Z

/te-ci pytorch

ksivaman · 2024-03-21T05:28:18Z

Note: #575 already revokes the async amax reduction and addresses these fixes (including overhauling the current system for amax reduction/update), and given these would land in the same release, shall we close this? @timmoon10

Signed-off-by: Tim Moon <tmoon@nvidia.com>

…ction Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-03-27T01:05:18Z

/te-ci pytorch

Signed-off-by: Tim Moon <tmoon@nvidia.com>

…duction

Debug amax reductions

fb28d76

Do not update backward FP8 scales when in eval mode. Make sure to finish async amax reductions before scale update. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the bug Something isn't working label Mar 21, 2024

timmoon10 requested review from ksivaman and ptrendx March 21, 2024 04:18

timmoon10 changed the title ~~[PyTorch] Debug async amax reductions and amax reductions in eval mode~~ [PyTorch] Debugamax reductions in eval mode and async amax reductions Mar 21, 2024

timmoon10 changed the title ~~[PyTorch] Debugamax reductions in eval mode and async amax reductions~~ [PyTorch] Debug amax reductions in eval mode and async amax reductions Mar 21, 2024

timmoon10 added 2 commits March 27, 2024 00:38

Set CUDA context before loading NVRTC kernels

d5dd65d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'nvrtc-invalid-context-bugfix' into debug-lora-amax-redu…

05d7861

…ction Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 marked this pull request as draft March 27, 2024 01:05

timmoon10 force-pushed the debug-lora-amax-reduction branch from 2235c17 to 05d7861 Compare March 28, 2024 02:59

timmoon10 and others added 3 commits March 27, 2024 20:00

Perform FP8 cast on gathered layernorm output in LayerNormLinear

1f590d0

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'layernormlinear-fp8-cast-debug' into debug-lora-amax-re…

1c807cd

…duction

Merge branch 'main' into debug-lora-amax-reduction

bcad43c

timmoon10 closed this Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Debug amax reductions in eval mode and async amax reductions#728

[PyTorch] Debug amax reductions in eval mode and async amax reductions#728
timmoon10 wants to merge 6 commits intoNVIDIA:mainfrom
timmoon10:debug-lora-amax-reduction

timmoon10 commented Mar 21, 2024

Uh oh!

timmoon10 commented Mar 21, 2024

Uh oh!

ksivaman commented Mar 21, 2024

Uh oh!

timmoon10 commented Mar 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timmoon10 commented Mar 21, 2024

Uh oh!

timmoon10 commented Mar 21, 2024

Uh oh!

ksivaman commented Mar 21, 2024

Uh oh!

timmoon10 commented Mar 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants