[PyTorch] Don't use autograd hook for bwd reduction by ksivaman · Pull Request #781 · NVIDIA/TransformerEngine

ksivaman · 2024-04-15T19:16:56Z

Introduced in #575; using torch.autograd.graph.register_multi_grad_hook is the suspected reason for hangs in certain workloads. This PR uses a different design to achieve the reduction of amaxes for gradient tensors by directly calling it from the backward pass of the modules if needed.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

cyanguwa

LGTM

ksivaman · 2024-04-15T20:54:59Z

/te-ci pytorch

Don't use autograd hook for bwd reduction Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Don't use autograd hook for bwd reduction

22cef73

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman requested review from cyanguwa and timmoon10 April 15, 2024 19:16

cyanguwa approved these changes Apr 15, 2024

View reviewed changes

ksivaman merged commit f69e45b into NVIDIA:main Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Don't use autograd hook for bwd reduction#781

[PyTorch] Don't use autograd hook for bwd reduction#781
ksivaman merged 1 commit intoNVIDIA:mainfrom
ksivaman:fix_hang_for_non_cg_path

ksivaman commented Apr 15, 2024

Uh oh!

cyanguwa left a comment

Uh oh!

ksivaman commented Apr 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ksivaman commented Apr 15, 2024

Uh oh!

cyanguwa left a comment

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Apr 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants