Skip to content

PyTorch-compatible backward API#7665

Merged
tohtana merged 39 commits intodeepspeedai:masterfrom
tohtana:tohtana/backward_non_scalar
Nov 19, 2025
Merged

PyTorch-compatible backward API#7665
tohtana merged 39 commits intodeepspeedai:masterfrom
tohtana:tohtana/backward_non_scalar

Conversation

@tohtana
Copy link
Copy Markdown
Collaborator

@tohtana tohtana commented Nov 3, 2025

Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API.
Here is the usage as described in the documentation:

    loss = model_engine(batch)
    model_engine.backward(loss)

In this example,

  1. Only accepts a (scalar) loss value
  2. Need to call engine's backward API

In contrast, in standard PyTorch, you can do:

    output = model(batch)
    output.backward(out_grad)

There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model.

If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results.

The document explains we can call _backward_epilogue manually (possibly backward_prologue as well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported.

This PR introduces the same .backward() behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs.

To implement post-backward hooks, we had to use some torch internal APIs. See comments for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way model_engine.backward(loss).

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@sfc-gh-truwase
Copy link
Copy Markdown
Collaborator

@tohtana, this is a very exciting usability improvement. Please remember to update the documentation.

tohtana and others added 18 commits November 6, 2025 16:24
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana tohtana marked this pull request as ready for review November 14, 2025 02:13
@tohtana
Copy link
Copy Markdown
Collaborator Author

tohtana commented Nov 14, 2025

@sfc-gh-truwase I think this PR is now ready for review, though the latest change on HF transformer causes an error with test_zero_nesting_init.py::TestNestedParallelInit::test_nested_parallel_init.

Comment thread deepspeed/runtime/zero/stage3.py
Comment thread deepspeed/runtime/utils.py Outdated
Comment thread deepspeed/runtime/utils.py Outdated
Comment thread deepspeed/runtime/zero/stage3.py Outdated
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana tohtana enabled auto-merge (squash) November 19, 2025 00:01
@tohtana tohtana merged commit 53e91a0 into deepspeedai:master Nov 19, 2025
12 checks passed
rraminen pushed a commit to rraminen/DeepSpeed that referenced this pull request Dec 1, 2025
Currently DeepSpeed's backward API has more constraints compared to
PyTorch's normal backward API.
Here is the usage as described in the documentation:
```python
    loss = model_engine(batch)
    model_engine.backward(loss)
```

In this example,
1. Only accepts a (scalar) loss value
1. Need to call engine's backward API

In contrast, in standard PyTorch, you can do:
```python
    output = model(batch)
    output.backward(out_grad)
```

There are several use cases that rely on this flexibility. For example,
combining multiple models or using loss functions defined separately
from the main model.

If you attempt the same pattern with a DeepSpeed engine, some
preprocessing and postprocessing steps will be silently skipped, which
can lead to incorrect results.

The
[document](https://deepspeed.readthedocs.io/en/latest/training.html#jointly-training-models-with-shared-loss)
explains we can call `_backward_epilogue` manually (possibly
`backward_prologue` as well). However, it's easy for users to miss these
calls, and passing a non-scalar gradient is still not supported.

This PR introduces the same `.backward()` behavior as PyTorch, allowing
.backward() to be called directly on tensors and supporting non-scalar
outputs.

To implement post-backward hooks, we had to use some torch internal
APIs. See
[comments](https://github.com/deepspeedai/DeepSpeed/blob/73f7ff1aab9d1387eb7dd4eca7453a25024533f4/deepspeed/runtime/engine.py#L424)
for more details. When the internal APIs are not available, DeepSpeed
engine only accepts the traditional way `model_engine.backward(loss)`.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: rraminen <rraminen@amd.com>
sfc-gh-truwase pushed a commit that referenced this pull request Jan 20, 2026
The new backward API introduced in #7665 broke the nested backward call
used in the pipeline engine. Tests in unit/checkpoint/test_pipeline.py
hang because of this issue.
This PR fixes the issue by manually setting a flag to properly call
backward hooks.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
phalani-paladugu pushed a commit to phalani-paladugu/DeepSpeed that referenced this pull request Jan 29, 2026
The new backward API introduced in deepspeedai#7665 broke the nested backward call
used in the pipeline engine. Tests in unit/checkpoint/test_pipeline.py
hang because of this issue.
This PR fixes the issue by manually setting a flag to properly call
backward hooks.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
delock pushed a commit that referenced this pull request Apr 22, 2026
… step (#7981)

## Summary

ZeRO-1/2 + `offload_optimizer` + `gradient_accumulation_steps=1` with
multiple `engine.backward()` calls per optimizer step (via
`set_gradient_accumulation_boundary()`, formalized in #7665) silently
drops all but the last backward's gradient.

`copy_grads_in_partition` only called
`async_accumulate_grad_in_cpu_via_gpu` under `if
gradient_accumulation_steps > 1`, so with `ga_steps=1` intermediate
backwards' reduced grads were never stored. The boundary
`async_inplace_copy_grad_to_fp32_buffer_from_gpu` then overwrote (not
added) the fp32 buffer with the last chunk only.

ZeRO-3 + offload and non-offload ZeRO-1/2 are unaffected.

## Fix

Replace the `ga > 1` gate with one that fires exactly when a CPU
accumulator is needed:

```python
if self.micro_step_id > 0 or not self.is_gradient_accumulation_boundary:
    self.async_accumulate_grad_in_cpu_via_gpu(param)
```

- `ga_steps=1` + single `backward()` → skipped. No CPU buffer, no extra
copy. Fast path preserved.
- `ga_steps=1` + multi-backward → accumulates correctly across calls.
- `ga_steps>1` → identical to prior behaviour.

## Measurement

2x H100, 3-layer MLP, Adam, lr=1e-3, N=4 backwards/step, ga_steps=1
Max param diff vs no-offload reference:

|        | fp32                                 | bf16                 |
| ------ | ------------------------------------ | -------------------- |
| Before | 2.00e-03 (wrong, around 2 x lr) | —                    |
| After  | 7.45e-09  (noise)             | 0.00e+00 |

## Tests

New `tests/unit/v1/zero/test_zero2_offload_multi_backward.py`,
parametrized over ZeRO-1/2:
multi-backward offload matches no-offload / single-backward unchanged /
multi-step state-leak guard / single-backward allocates no CPU buffer
(perf guard) / `ga_steps>1` + offload unchanged (#7967 regression
guard).

---------

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants