PyTorch-compatible backward API by tohtana · Pull Request #7665 · deepspeedai/DeepSpeed

tohtana · 2025-11-03T21:02:24Z

Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API.
Here is the usage as described in the documentation:

    loss = model_engine(batch)
    model_engine.backward(loss)

In this example,

Only accepts a (scalar) loss value
Need to call engine's backward API

In contrast, in standard PyTorch, you can do:

    output = model(batch)
    output.backward(out_grad)

There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model.

If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results.

The document explains we can call _backward_epilogue manually (possibly backward_prologue as well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported.

This PR introduces the same .backward() behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs.

To implement post-backward hooks, we had to use some torch internal APIs. See comments for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way model_engine.backward(loss).

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

sfc-gh-truwase · 2025-11-04T11:51:25Z

@tohtana, this is a very exciting usability improvement. Please remember to update the documentation.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2025-11-14T02:14:44Z

@sfc-gh-truwase I think this PR is now ready for review, though the latest change on HF transformer causes an error with test_zero_nesting_init.py::TestNestedParallelInit::test_nested_parallel_init.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API. Here is the usage as described in the documentation: ```python loss = model_engine(batch) model_engine.backward(loss) ``` In this example, 1. Only accepts a (scalar) loss value 1. Need to call engine's backward API In contrast, in standard PyTorch, you can do: ```python output = model(batch) output.backward(out_grad) ``` There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model. If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results. The [document](https://deepspeed.readthedocs.io/en/latest/training.html#jointly-training-models-with-shared-loss) explains we can call `_backward_epilogue` manually (possibly `backward_prologue` as well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported. This PR introduces the same `.backward()` behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs. To implement post-backward hooks, we had to use some torch internal APIs. See [comments](https://github.com/deepspeedai/DeepSpeed/blob/73f7ff1aab9d1387eb7dd4eca7453a25024533f4/deepspeed/runtime/engine.py#L424) for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way `model_engine.backward(loss)`. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: rraminen <rraminen@amd.com>

The new backward API introduced in #7665 broke the nested backward call used in the pipeline engine. Tests in unit/checkpoint/test_pipeline.py hang because of this issue. This PR fixes the issue by manually setting a flag to properly call backward hooks. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

The new backward API introduced in deepspeedai#7665 broke the nested backward call used in the pipeline engine. Tests in unit/checkpoint/test_pipeline.py hang because of this issue. This PR fixes the issue by manually setting a flag to properly call backward hooks. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>

… step (#7981) ## Summary ZeRO-1/2 + `offload_optimizer` + `gradient_accumulation_steps=1` with multiple `engine.backward()` calls per optimizer step (via `set_gradient_accumulation_boundary()`, formalized in #7665) silently drops all but the last backward's gradient. `copy_grads_in_partition` only called `async_accumulate_grad_in_cpu_via_gpu` under `if gradient_accumulation_steps > 1`, so with `ga_steps=1` intermediate backwards' reduced grads were never stored. The boundary `async_inplace_copy_grad_to_fp32_buffer_from_gpu` then overwrote (not added) the fp32 buffer with the last chunk only. ZeRO-3 + offload and non-offload ZeRO-1/2 are unaffected. ## Fix Replace the `ga > 1` gate with one that fires exactly when a CPU accumulator is needed: ```python if self.micro_step_id > 0 or not self.is_gradient_accumulation_boundary: self.async_accumulate_grad_in_cpu_via_gpu(param) ``` - `ga_steps=1` + single `backward()` → skipped. No CPU buffer, no extra copy. Fast path preserved. - `ga_steps=1` + multi-backward → accumulates correctly across calls. - `ga_steps>1` → identical to prior behaviour. ## Measurement 2x H100, 3-layer MLP, Adam, lr=1e-3, N=4 backwards/step, ga_steps=1 Max param diff vs no-offload reference: | | fp32 | bf16 | | ------ | ------------------------------------ | -------------------- | | Before | 2.00e-03 (wrong, around 2 x lr) | — | | After | 7.45e-09 (noise) | 0.00e+00 | ## Tests New `tests/unit/v1/zero/test_zero2_offload_multi_backward.py`, parametrized over ZeRO-1/2: multi-backward offload matches no-offload / single-backward unchanged / multi-step state-leak guard / single-backward allocates no CPU buffer (perf guard) / `ga_steps>1` + offload unchanged (#7967 regression guard). --------- Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

tohtana added 11 commits October 17, 2025 16:23

rename backward prologue method

8a27283

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

refactor loss scaling

d85cfd9

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

refactor backward

bded5c8

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix for bf16 optimizer

1f413d6

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

simplify preprocess/postprocess of backward

cc87977

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix order of backward postprocess

95018a3

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

enable non-scalar backward only for ZeROOptimizer

80d0e7d

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix zero+fp16 case

db70476

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

add config to enable allow_user_backward

50b29d8

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix flag for error handling

076b187

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

resolve conflict

f6748d1

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana and others added 18 commits November 6, 2025 16:24

add test cases

280b1fa

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/backward_non_scalar

5d5e64e

fix format

6ce26f3

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

return scaled loss from engine's backward

0c579d5

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/backward_non_scalar

1615036

remove option to enable user backward

c8758f7

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

add hook utility

9962f2c

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix for z2

a8f15a0

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix scaling

1d0a721

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

exclude unused params from counter

39372ac

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

set default flag

7eacbc7

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

handle non-zero optimizer

6cce937

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

call epilogue in engine's backward

b72b5a7

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

prevent hooks from being called from nested backward

1307a87

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

run post hook fo rz3

98cc865

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/backward_non_scalar

adb6990

added comments

9328dfa

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

remove hard-coded tolerances

01b3251

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana added 2 commits November 13, 2025 15:24

add test for multiple engines

78f7ad4

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

update document

73f7ff1

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana marked this pull request as ready for review November 14, 2025 02:13

tohtana requested review from loadams and tjruwase as code owners November 14, 2025 02:13

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

Comment thread deepspeed/runtime/zero/stage3.py

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

Comment thread deepspeed/runtime/utils.py Outdated

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

Comment thread deepspeed/runtime/utils.py Outdated

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

Comment thread deepspeed/runtime/zero/stage3.py Outdated

tohtana added 5 commits November 17, 2025 00:29

remove deprecated comment

26308cd

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

simplify utility func to count effective grad nodes

9963546

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix combination with leaf module

b730f46

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

refactor tests

08b1599

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

refactor tests

92d3068

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

sfc-gh-truwase approved these changes Nov 18, 2025

View reviewed changes

tohtana and others added 3 commits November 18, 2025 13:32

fix loss scaling

ebac40b

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/backward_non_scalar

fcf7c8c

Merge branch 'master' into tohtana/backward_non_scalar

90e1b7d

tohtana enabled auto-merge (squash) November 19, 2025 00:01

tohtana merged commit 53e91a0 into deepspeedai:master Nov 19, 2025
12 checks passed

eternalNight mentioned this pull request Nov 28, 2025

[BUG] None grad triggers exception in the backward hook #7708

Closed

tohtana mentioned this pull request Dec 9, 2025

[BUG]Incorrect gradient computation in ZeRO-2 with DeepSpeed ≥ 0.17.6 #7718

Closed

tdrussell mentioned this pull request Jan 12, 2026

[BUG] Gradients are summed instead of averaged when using gradient accumulation steps with pipeline parallelism #7773

Closed

tohtana mentioned this pull request Jan 18, 2026

Fix backward for pipeline engine #7787

Merged

roycho96 mentioned this pull request Apr 20, 2026

Fix ZeRO-1/2 CPU-offloaded gradient loss with multiple backward() per step #7981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch-compatible backward API#7665

PyTorch-compatible backward API#7665
tohtana merged 39 commits intodeepspeedai:masterfrom
tohtana:tohtana/backward_non_scalar

tohtana commented Nov 3, 2025 •

edited

Loading

Uh oh!

sfc-gh-truwase commented Nov 4, 2025

Uh oh!

tohtana commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tohtana commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase commented Nov 4, 2025

Uh oh!

tohtana commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tohtana commented Nov 3, 2025 •

edited

Loading