Fix ZeRO-1/2 CPU-offloaded gradient loss with multiple backward() per step by roycho96 · Pull Request #7981 · deepspeedai/DeepSpeed

roycho96 · 2026-04-20T15:39:20Z

Summary

ZeRO-1/2 + offload_optimizer + gradient_accumulation_steps=1 with multiple engine.backward() calls per optimizer step (via set_gradient_accumulation_boundary(), formalized in #7665) silently drops all but the last backward's gradient.

copy_grads_in_partition only called async_accumulate_grad_in_cpu_via_gpu under if gradient_accumulation_steps > 1, so with ga_steps=1 intermediate backwards' reduced grads were never stored. The boundary async_inplace_copy_grad_to_fp32_buffer_from_gpu then overwrote (not added) the fp32 buffer with the last chunk only.

ZeRO-3 + offload and non-offload ZeRO-1/2 are unaffected.

Fix

Replace the ga > 1 gate with one that fires exactly when a CPU accumulator is needed:

if self.micro_step_id > 0 or not self.is_gradient_accumulation_boundary:
    self.async_accumulate_grad_in_cpu_via_gpu(param)

ga_steps=1 + single backward() → skipped. No CPU buffer, no extra copy. Fast path preserved.
ga_steps=1 + multi-backward → accumulates correctly across calls.
ga_steps>1 → identical to prior behaviour.

Measurement

2x H100, 3-layer MLP, Adam, lr=1e-3, N=4 backwards/step, ga_steps=1
Max param diff vs no-offload reference:

	fp32	bf16
Before	2.00e-03 (wrong, around 2 x lr)	—
After	7.45e-09 (noise)	0.00e+00

Tests

New tests/unit/v1/zero/test_zero2_offload_multi_backward.py, parametrized over ZeRO-1/2:
multi-backward offload matches no-offload / single-backward unchanged / multi-step state-leak guard / single-backward allocates no CPU buffer (perf guard) / ga_steps>1 + offload unchanged (#7967 regression guard).

… step Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

delock · 2026-04-21T07:24:24Z

Hi @roycho96 can you fix formatting? Thanks!

roycho96 · 2026-04-21T09:58:17Z

Hi @roycho96 can you fix formatting? Thanks!

Done! Thank you for the review!

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

Fix ZeRO-1/2 CPU-offloaded gradient loss with multiple backward() per…

70e4e69

… step Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

roycho96 requested review from loadams, tjruwase and tohtana as code owners April 20, 2026 15:39

delock approved these changes Apr 21, 2026

View reviewed changes

fix formatting

efd10ee

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

roycho96 force-pushed the fix/zero2-offload-ga1-multi-backward branch from 95d73e2 to efd10ee Compare April 21, 2026 09:59

delock merged commit aeb10bb into deepspeedai:master Apr 22, 2026
9 checks passed

roycho96 deleted the fix/zero2-offload-ga1-multi-backward branch April 22, 2026 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ZeRO-1/2 CPU-offloaded gradient loss with multiple backward() per step#7981

Fix ZeRO-1/2 CPU-offloaded gradient loss with multiple backward() per step#7981
delock merged 2 commits intodeepspeedai:masterfrom
roycho96:fix/zero2-offload-ga1-multi-backward

roycho96 commented Apr 20, 2026

Uh oh!

delock commented Apr 21, 2026

Uh oh!

roycho96 commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

roycho96 commented Apr 20, 2026

Summary

Fix

Measurement

Tests

Uh oh!

delock commented Apr 21, 2026

Uh oh!

roycho96 commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants