Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions deepspeed/runtime/pipe/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ def __init__(self, *super_args, **super_kwargs):
super().__init__(*super_args, **super_kwargs)
assert isinstance(self.module, PipelineModule), "model must base PipelineModule"

assert self.zero_optimization_stage() < 2, "ZeRO-2 and ZeRO-3 are incompatible with pipeline parallelism"

# We schedule the all-reduces, so disable it in super().backward()
self.enable_backward_allreduce = False
assert not self.elasticity_enabled(), "Elasticity is not currently supported" \
Expand Down
3 changes: 3 additions & 0 deletions deepspeed/runtime/pipe/module.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,9 @@ def forward(self, inputs):
x = layer(x)
return x

.. note::
Pipeline parallelism is not compatible with ZeRO-2 and ZeRO-3.

Args:
layers (Iterable): A sequence of layers defining pipeline structure. Can be a ``torch.nn.Sequential`` module.
num_stages (int, optional): The degree of pipeline parallelism. If not specified, ``topology`` must be provided.
Expand Down
6 changes: 0 additions & 6 deletions docs/_tutorials/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,15 +276,9 @@ For example, a machine with 16 GPUs must have as much local CPU memory as 16 tim

DeepSpeed provides a `LayerSpec` class that delays the construction of
modules until the model layers have been partitioned across workers.
<<<<<<< HEAD
Then each worker will allocate only the layers it's assigned to. So, comparing to the
example from the previous paragraph, using `LayerSpec` a machine with 16 GPUs will need to
allocate a total of 1x model size on its CPU memory and not 16x.
=======
Then each worker will allocate only the layers it's assigned to. So, continuing the
example from the previous paragraph, a machine with 16 GPUs will need to allocate a
total of 1x model size on its CPU, compared to 16x in the LayerSpec example.
>>>>>>> [squash] Staging zero infinity v1 (#168)

Here is an example of the abbreviated AlexNet model, but expressed only
with `LayerSpec`s. Note that the syntax is almost unchanged: `nn.ReLU(inplace=True)`
Expand Down
23 changes: 13 additions & 10 deletions docs/code-docs/source/optimizers.rst
Original file line number Diff line number Diff line change
@@ -1,24 +1,27 @@
Optimizers
===================
==========

DeepSpeed offers high-performance implementations of ``Adam`` optimizer on CPU; ``FusedAdam``, ``FusedAdam``, ``OneBitAdam`` optimizers on GPU.

Adam (CPU)
----------------------------
----------

.. autoclass:: deepspeed.ops.adam.DeepSpeedCPUAdam


FusedAdam (GPU)
----------------------------
---------------

.. autoclass:: deepspeed.ops.adam.FusedAdam


FusedLamb (GPU)
----------------------------
---------------

.. autoclass:: deepspeed.ops.lamb.FusedLamb


OneBitAdam (GPU)
----------------------------
<<<<<<< HEAD
.. autoclass:: deepspeed.runtime.fp16.onebit.adam.OneBitAdam
=======
.. autoclass:: deepspeed.runtime.fp16.OneBitAdam
>>>>>>> [squash] Staging zero infinity v1 (#168)
----------------

.. autoclass:: deepspeed.runtime.fp16.onebit.adam.OnebitAdam