PyTorch-compatible backward API#7665
Merged
tohtana merged 39 commits intodeepspeedai:masterfrom Nov 19, 2025
Merged
Conversation
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Collaborator
|
@tohtana, this is a very exciting usability improvement. Please remember to update the documentation. |
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Collaborator
Author
|
@sfc-gh-truwase I think this PR is now ready for review, though the latest change on HF transformer causes an error with |
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
sfc-gh-truwase
approved these changes
Nov 18, 2025
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
rraminen
pushed a commit
to rraminen/DeepSpeed
that referenced
this pull request
Dec 1, 2025
Currently DeepSpeed's backward API has more constraints compared to
PyTorch's normal backward API.
Here is the usage as described in the documentation:
```python
loss = model_engine(batch)
model_engine.backward(loss)
```
In this example,
1. Only accepts a (scalar) loss value
1. Need to call engine's backward API
In contrast, in standard PyTorch, you can do:
```python
output = model(batch)
output.backward(out_grad)
```
There are several use cases that rely on this flexibility. For example,
combining multiple models or using loss functions defined separately
from the main model.
If you attempt the same pattern with a DeepSpeed engine, some
preprocessing and postprocessing steps will be silently skipped, which
can lead to incorrect results.
The
[document](https://deepspeed.readthedocs.io/en/latest/training.html#jointly-training-models-with-shared-loss)
explains we can call `_backward_epilogue` manually (possibly
`backward_prologue` as well). However, it's easy for users to miss these
calls, and passing a non-scalar gradient is still not supported.
This PR introduces the same `.backward()` behavior as PyTorch, allowing
.backward() to be called directly on tensors and supporting non-scalar
outputs.
To implement post-backward hooks, we had to use some torch internal
APIs. See
[comments](https://github.com/deepspeedai/DeepSpeed/blob/73f7ff1aab9d1387eb7dd4eca7453a25024533f4/deepspeed/runtime/engine.py#L424)
for more details. When the internal APIs are not available, DeepSpeed
engine only accepts the traditional way `model_engine.backward(loss)`.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: rraminen <rraminen@amd.com>
sfc-gh-truwase
pushed a commit
that referenced
this pull request
Jan 20, 2026
The new backward API introduced in #7665 broke the nested backward call used in the pipeline engine. Tests in unit/checkpoint/test_pipeline.py hang because of this issue. This PR fixes the issue by manually setting a flag to properly call backward hooks. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
phalani-paladugu
pushed a commit
to phalani-paladugu/DeepSpeed
that referenced
this pull request
Jan 29, 2026
The new backward API introduced in deepspeedai#7665 broke the nested backward call used in the pipeline engine. Tests in unit/checkpoint/test_pipeline.py hang because of this issue. This PR fixes the issue by manually setting a flag to properly call backward hooks. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
delock
pushed a commit
that referenced
this pull request
Apr 22, 2026
… step (#7981) ## Summary ZeRO-1/2 + `offload_optimizer` + `gradient_accumulation_steps=1` with multiple `engine.backward()` calls per optimizer step (via `set_gradient_accumulation_boundary()`, formalized in #7665) silently drops all but the last backward's gradient. `copy_grads_in_partition` only called `async_accumulate_grad_in_cpu_via_gpu` under `if gradient_accumulation_steps > 1`, so with `ga_steps=1` intermediate backwards' reduced grads were never stored. The boundary `async_inplace_copy_grad_to_fp32_buffer_from_gpu` then overwrote (not added) the fp32 buffer with the last chunk only. ZeRO-3 + offload and non-offload ZeRO-1/2 are unaffected. ## Fix Replace the `ga > 1` gate with one that fires exactly when a CPU accumulator is needed: ```python if self.micro_step_id > 0 or not self.is_gradient_accumulation_boundary: self.async_accumulate_grad_in_cpu_via_gpu(param) ``` - `ga_steps=1` + single `backward()` → skipped. No CPU buffer, no extra copy. Fast path preserved. - `ga_steps=1` + multi-backward → accumulates correctly across calls. - `ga_steps>1` → identical to prior behaviour. ## Measurement 2x H100, 3-layer MLP, Adam, lr=1e-3, N=4 backwards/step, ga_steps=1 Max param diff vs no-offload reference: | | fp32 | bf16 | | ------ | ------------------------------------ | -------------------- | | Before | 2.00e-03 (wrong, around 2 x lr) | — | | After | 7.45e-09 (noise) | 0.00e+00 | ## Tests New `tests/unit/v1/zero/test_zero2_offload_multi_backward.py`, parametrized over ZeRO-1/2: multi-backward offload matches no-offload / single-backward unchanged / multi-step state-leak guard / single-backward allocates no CPU buffer (perf guard) / `ga_steps>1` + offload unchanged (#7967 regression guard). --------- Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API.
Here is the usage as described in the documentation:
In this example,
In contrast, in standard PyTorch, you can do:
There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model.
If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results.
The document explains we can call
_backward_epiloguemanually (possiblybackward_prologueas well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported.This PR introduces the same
.backward()behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs.To implement post-backward hooks, we had to use some torch internal APIs. See comments for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way
model_engine.backward(loss).