From 3bd034c25793bba305ab3ed206eced49b1ab9a5f Mon Sep 17 00:00:00 2001
From: Shaden Smith <Shaden.Smith@microsoft.com>
Date: Mon, 19 Apr 2021 00:21:19 -0700
Subject: [PATCH 1/3] zinf tutorial

---
 docs/_tutorials/zero.md | 73 +++++++++++++++++++++++++++++++----------
 1 file changed, 56 insertions(+), 17 deletions(-)

diff --git a/docs/_tutorials/zero.md b/docs/_tutorials/zero.md
index 82c3414ff44f..a8132fcf0c1e 100644
--- a/docs/_tutorials/zero.md
+++ b/docs/_tutorials/zero.md
@@ -14,6 +14,8 @@ ZeRO leverages the aggregate computation and memory resources of data parallelis
 
 * **Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO will automatically collect and partition them during the forward and backward passes.
 
+In addition, ZeRO-3 includes the *infinity offload engine* to form ZeRO-Infinity ([paper](https://arxiv.org/abs/2104.07857)), which can offload to both CPU and NVMe memory for huge memory savings.
+
 ## Training environment
 We use the DeepSpeed [Megatron-LM](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM) GPT-2 code for this exercise. You can step through the Megatron-LM [tutorial](/tutorials/megatron/) to familiarize yourself with the code. We will train the models in this tutorial on [NVIDIA Tesla V100-SXM3 Tensor Core GPUs](https://www.nvidia.com/en-us/data-center/v100/) with 32GB RAM.
 
@@ -108,36 +110,33 @@ Here is a screenshot of nvidia-smi showing GPU activity during training:
 
 ### Training trillion-scale models with ZeRO-Infinity
 
-Stage 3 can be enabled in the JSON configuration. A full description of these
-configurations is available [here](/docs/config-json/#zero-optimizations-for-fp16-training).
+ZeRO Stage 3 can be enabled in the JSON configuration. A full description of
+these configurations is available [here](/docs/config-json/#zero-optimizations-for-fp16-training).
+
+
 
 ```json
   "zero_optimization": {
     "stage": 3,
-    "cpu_offload": true,
-    "cpu_offload_params": true,
     "contiguous_gradients": true,
     "stage3_max_live_parameters": 1e9,
     "stage3_max_reuse_distance": 1e9,
     "stage3_prefetch_bucket_size": 1e7,
     "stage3_param_persistence_threshold": 1e5,
     "reduce_bucket_size": 1e7,
-    "sub_group_size": 1e9
-  }
+    "sub_group_size": 1e9,
+    "offload_optimizer": {
+      "device": "cpu"
+    },
+    "offload_param": {
+      "device": "cpu"
+    }
 }
 ```
 
 
 
 
-#### Registering external parameters with ZeRO-3
-
-**Deprecated:**
-DeepSpeed version `0.3.15` introduced automatic external parameter
-registration and this step is no longer needed.
-{: .notice--info}
-
-
 
 #### Allocating Massive Megatron-LM Models
 
@@ -175,6 +174,46 @@ for more details.
         self.init_method(self.position_embeddings.weight)
     ```
 
+#### Memory-centric tiling
+ZeRO-Infinity includes a replacement for `Linear` layers that further reduces memory.
+The `deepspeed.zero.TiledLinearReturnBias` module exploits the data fetch and release
+pattern of ZeRO-3 to reduce the working memory requirements by breaking down
+a large operator into smaller tiles that can be executed sequentially.
+
+We optionally tile the model parallel linear layers found in each Transformer layer. Note
+that model parallelism and tiling can be combined by specifying the corresponding
+base class when building the  layer.
+
+We include the changes for one example from Megatron-LM's [ParallelMLP](https://github.com/microsoft/DeepSpeedExamples/blob/bdf8e59aede8c8e0577e8d4d557298ca8515268f/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py#L82). Three more
+model-parallel layers in `transformer.py` proceed similarly.
+
+
+```diff
+
+@@ -1,6 +1,9 @@
+-self.dense_h_to_4h = mpu.ColumnParallelLinear(
++self.dense_h_to_4h = deepspeed.zero.TiledLinearReturnBias(
+     args.hidden_size,
+     4 * args.hidden_size,
++    in_splits=args.tile_factor,
++    out_splits=4*args.tile_factor,
++    linear_cls=mpu.ColumnParallelLinear,
+     gather_output=False,
+     init_method=init_method,
+     skip_bias_add=True)
+```
+
+Note that we scale `in_splits` and `out_splits` proportionally with `input_size` and `output_size`.  This
+results in tiles of fixed size `[hidden/tile_factor, hidden/tile_factor]`.
+
+#### Registering external parameters
+
+**Deprecated:**
+DeepSpeed version `0.3.15` introduced automatic external parameter
+registration and this step is no longer needed.
+{: .notice--info}
+
+
 ## Extracting weights
 
 If you need to take the pretrained weights out of Deepspeed here is what you can do for getting fp16 weights:
@@ -182,14 +221,14 @@ If you need to take the pretrained weights out of Deepspeed here is what you can
 - under ZeRO-2 `state_dict` contains the fp16 model weights and these can be saved normally with `torch.save`.
 - under ZeRO-3 `state_dict` contains just the placeholders since the model weights are partitioned across multiple GPUs. If you want to get to these weights enable:
 
-```
+```json
     "zero_optimization": {
         "stage3_gather_fp16_weights_on_model_save": true
     },
 ```
 And then save the model using:
 
-```
+```python
             if self.deepspeed:
                 self.deepspeed.save_fp16_model(output_dir, output_file)
 ```
@@ -201,7 +240,7 @@ You can use this method to save ZeRO-2 weights as well.
 
 If you'd like to get the fp32 weights, we supply a special script that can do offline consolidation. It requires no configuration files or GPUs. Here is an example of its usage:
 
-```
+``` bash
 $ cd /path/to/checkpoints_dir
 $ ./zero_to_fp32.py global_step1 pytorch_model.bin
 Processing zero checkpoint at global_step1

From 2a886e170ee4ae0afe5f1f13f7493d03b66fcaa0 Mon Sep 17 00:00:00 2001
From: Shaden Smith <Shaden.Smith@microsoft.com>
Date: Mon, 19 Apr 2021 07:35:56 -0700
Subject: [PATCH 2/3] more megatron integration docs

---
 docs/_tutorials/zero.md | 73 ++++++++++++++++++++++++++++-------------
 1 file changed, 51 insertions(+), 22 deletions(-)

diff --git a/docs/_tutorials/zero.md b/docs/_tutorials/zero.md
index a8132fcf0c1e..217160400e29 100644
--- a/docs/_tutorials/zero.md
+++ b/docs/_tutorials/zero.md
@@ -12,7 +12,7 @@ ZeRO leverages the aggregate computation and memory resources of data parallelis
 
 * **Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
 
-* **Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO will automatically collect and partition them during the forward and backward passes.
+* **Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
 
 In addition, ZeRO-3 includes the *infinity offload engine* to form ZeRO-Infinity ([paper](https://arxiv.org/abs/2104.07857)), which can offload to both CPU and NVMe memory for huge memory savings.
 
@@ -110,30 +110,48 @@ Here is a screenshot of nvidia-smi showing GPU activity during training:
 
 ### Training trillion-scale models with ZeRO-Infinity
 
-ZeRO Stage 3 can be enabled in the JSON configuration. A full description of
-these configurations is available [here](/docs/config-json/#zero-optimizations-for-fp16-training).
+ZeRO-3, the third stage of ZeRO, partitions the full model state (i.e.,
+weights, gradients, and optimizer states) to scale memory savings linearly
+with the degree of data parallelism. ZeRO-3 can be enabled in the JSON
+configuration. A full description of these configurations is available
+[here](/docs/config-json/#zero-optimizations-for-fp16-training).
 
 
+#### Offloading to CPU and NVMe with ZeRO-Infinity
 
-```json
+ZeRO-Infinity uses DeepSpeed's infinity offload engine to offload the full
+model state to CPU or NVMe memory, allowing for even larger model sizes. Offloading
+can be enabled inside the DeepSpeed configuration:
+
+```diff
+@@ -6,5 +6,11 @@
   "zero_optimization": {
     "stage": 3,
     "contiguous_gradients": true,
     "stage3_max_live_parameters": 1e9,
     "stage3_max_reuse_distance": 1e9,
-    "stage3_prefetch_bucket_size": 1e7,
-    "stage3_param_persistence_threshold": 1e5,
-    "reduce_bucket_size": 1e7,
-    "sub_group_size": 1e9,
-    "offload_optimizer": {
-      "device": "cpu"
-    },
-    "offload_param": {
-      "device": "cpu"
-    }
-}
+     "stage3_prefetch_bucket_size": 1e7,
+     "stage3_param_persistence_threshold": 1e5,
+     "reduce_bucket_size": 1e7,
+-    "sub_group_size": 1e9
++    "sub_group_size": 1e9,
++    "offload_optimizer": {
++      "device": "cpu"
++    },
++    "offload_param": {
++      "device": "cpu"
++    }
+  }
 ```
 
+**ZeRO-Infinity vs ZeRO-Offload:**
+DeepSpeed first included offloading capabilities with ZeRO-Offload,
+a system for offloading optimizer and gradient states to CPU memory
+within ZeRO-2. ZeRO-Infinity is the next generation of offloading
+capabilities accessible to ZeRO-3. ZeRO-Infinity is able to offload
+more data than ZeRO-Offload and has more effective bandwidth utilization
+and overlapping of computation and communication.
+{: .notice--info}
 
 
 
@@ -157,7 +175,7 @@ for more details.
         model = GPT2Model(num_tokentypes=0, parallel_output=True)
     ```
 
-2. Gather the position embeddings weight for initialization. DeepSpeed will automatically
+2. Gather the embeddings weight for initialization. DeepSpeed will automatically
 gather a module's parameters during its constructor and for its forward and backward pass.
 However, additional accesses must coordinate with DeepSpeed to ensure that parameter data
 is gathered and subsequently partitioned. If the tensor is modified, the `modifier_rank`
@@ -172,24 +190,35 @@ for more details.
                                            modifier_rank=0):
         # Initialize the position embeddings.
         self.init_method(self.position_embeddings.weight)
+
+    ...
+
+    self.tokentype_embeddings = torch.nn.Embedding(...)
+    with deepspeed.zero.GatheredParameters(self.tokentype_embeddings.weight,
+                                        modifier_rank=0):
+        # Initialize the token-type embeddings.
+        self.init_method(self.tokentype_embeddings.weight)
     ```
 
 #### Memory-centric tiling
 ZeRO-Infinity includes a replacement for `Linear` layers that further reduces memory.
-The `deepspeed.zero.TiledLinearReturnBias` module exploits the data fetch and release
-pattern of ZeRO-3 to reduce the working memory requirements by breaking down
-a large operator into smaller tiles that can be executed sequentially.
-
 We optionally tile the model parallel linear layers found in each Transformer layer. Note
 that model parallelism and tiling can be combined by specifying the corresponding
-base class when building the  layer.
+base class when building the layer.
+The `deepspeed.zero.TiledLinear` module exploits the data fetch and release
+pattern of ZeRO-3 to reduce the working memory requirements by breaking down
+a large operator into smaller tiles that can be executed sequentially.
 
 We include the changes for one example from Megatron-LM's [ParallelMLP](https://github.com/microsoft/DeepSpeedExamples/blob/bdf8e59aede8c8e0577e8d4d557298ca8515268f/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py#L82). Three more
 model-parallel layers in `transformer.py` proceed similarly.
 
+The model parallel layers of Megatron-LM have a special form in which the
+additive `bias` of the layer is delayed and instead returned from `forward()`
+to be fused with a later operator. DeepSpeed's
+`deepspeed.zero.TiledLinearReturnBias` subclass of `TiledLinear` simply also
+forwards the returned `bias` parameter without accumulating.
 
 ```diff
-
 @@ -1,6 +1,9 @@
 -self.dense_h_to_4h = mpu.ColumnParallelLinear(
 +self.dense_h_to_4h = deepspeed.zero.TiledLinearReturnBias(

From 9a453f324588232d3ea1624d714a752b7c445fc3 Mon Sep 17 00:00:00 2001
From: Shaden Smith <Shaden.Smith@microsoft.com>
Date: Mon, 19 Apr 2021 09:12:48 -0700
Subject: [PATCH 3/3] ZInf + tiling docs

---
 deepspeed/runtime/zero/tiling.py |   7 +-
 docs/code-docs/source/zero3.rst  | 135 ++++++++++++++++++++-----------
 2 files changed, 91 insertions(+), 51 deletions(-)

diff --git a/deepspeed/runtime/zero/tiling.py b/deepspeed/runtime/zero/tiling.py
index c6f912500741..d78fc81515e4 100644
--- a/deepspeed/runtime/zero/tiling.py
+++ b/deepspeed/runtime/zero/tiling.py
@@ -216,9 +216,10 @@ def copy_params_from(self, other):
                     self.bias.copy_(other.bias)
 
         .. note::
-            If ZeRO-3 is enabled, this is a collective operation and the updated parameters of
-            data-parallel rank 0 will be visibly on all ranks. See
-            :class:`deepspeed.zero.GatheredParameters` for more information.
+            If ZeRO-3 is enabled, this is a collective operation and the
+            updated parameters of data-parallel rank 0 will be visible on all
+            ranks. See :class:`deepspeed.zero.GatheredParameters` for more
+            information.
 
 
         Args:
diff --git a/docs/code-docs/source/zero3.rst b/docs/code-docs/source/zero3.rst
index 0192a69b5bb3..daced77d9093 100644
--- a/docs/code-docs/source/zero3.rst
+++ b/docs/code-docs/source/zero3.rst
@@ -1,5 +1,5 @@
-ZeRO-3 Offload
-##############
+ZeRO
+####
 
 The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across
 data-parallel processes by partitioning the three model states (optimizer
@@ -8,13 +8,31 @@ replicating them. By doing this, it boosts memory efficiency compared to
 classic data-parallelism while retaining its computational granularity and
 communication efficiency.
 
-ZeRO-Offload further increases memory efficiency by offloading the
-optimizer's states and computations to the CPU. The model parameters can also
-be offloaded for even more memory savings!
+#. **ZeRO Stage 1**: The optimizer states (e.g., for `Adam optimizer <https://arxiv.org/abs/1412.6980>`_, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
+
+#. **ZeRO Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
+
+#. **ZeRO Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
+
+In addition, ZeRO-3 includes the *infinity offload engine* to form
+ZeRO-Infinity ([paper](https://arxiv.org/abs/2104.07857)), which can offload
+all model states to both CPU and NVMe memory for huge memory savings.
+
+
+For a deep dive of our algorithms, please see our `papers <https://www.deepspeed.ai/#publications>`_ on `ZeRO
+<https://arxiv.org/abs/1910.02054>`_, `ZeRO-Offload
+<https://arxiv.org/abs/2101.06840>`_,
+and `ZeRO-Infinity <https://arxiv.org/abs/2104.07857>`_.
+
+.. note::
+    DeepSpeed first included offloading capabilities with **ZeRO-Offload**, a
+    system for offloading optimizer and gradient states to CPU memory within
+    ZeRO-2. **ZeRO-Infinity** is the next generation of offloading
+    capabilities, accessible to ZeRO-3. ZeRO-Infinity has all of the savings
+    of ZeRO-Offload, plus is able to offload more the model weights and has
+    more effective bandwidth utilization and overlapping of computation and
+    communication.
 
-For more information on our algorithms, please see our papers on `ZeRO
-<https://arxiv.org/abs/1910.02054>`_ and `ZeRO-Offload
-<https://arxiv.org/abs/2101.06840>`_.
 
 
 Getting Started
@@ -28,14 +46,15 @@ our `config guide <https://www.deepspeed.ai/docs/config-json/#zero-optimizations
 for a complete list of options for configuration and performance tuning.
 
 .. note::
-        ZeRO-3 Offload works best with our heavily optimized
+        ZeRO-Infinity and ZeRO-Offload work best with our heavily optimized
         :class:`deepspeed.ops.adam.DeepSpeedCPUAdam` optimizer. We recommend using
         our `optimizer config <https://www.deepspeed.ai/docs/config-json/#optimizer-parameters>`_
         to instruct :meth:`deepspeed.initialize` to build the optimizer for you.
 
 
-Example ZeRO-3 Offload Configurations
-=====================================
+
+Example ZeRO-3 Configurations
+=============================
 
 #. Use ZeRO to partition the optimizer states (stage 1), gradients (stage 2),
    and parameters (stage 3).
@@ -46,8 +65,6 @@ Example ZeRO-3 Offload Configurations
         {
             "zero_optimization": {
                 "stage": 3,
-                "overlap_comm": true
-
             },
             "fp16": {
                 "enabled": true
@@ -68,14 +85,13 @@ Example ZeRO-3 Offload Configurations
         }
 
 
-#. Additionally offload the optimizer states and computations to the CPU.
+#. Additionally offload the optimizer states and computations to the CPU with ZeRO-Infinity.
 
     .. code-block:: python
 
         {
             "zero_optimization": {
                 "stage": 3,
-                "overlap_comm": true
                 "offload_optimizer": {
                     "device": "cpu"
                 }
@@ -91,7 +107,6 @@ Example ZeRO-3 Offload Configurations
         {
             "zero_optimization": {
                 "stage": 3,
-                "overlap_comm": true
                 "offload_optimizer": {
                     "device": "cpu"
                 }
@@ -103,14 +118,13 @@ Example ZeRO-3 Offload Configurations
         }
 
 
-#. Save even MORE memory by offloading to NVMe (if available):
+#. Save even MORE memory by offloading to NVMe (if available on your system):
 
     .. code-block:: python
 
         {
             "zero_optimization": {
                 "stage": 3,
-                "overlap_comm": true
                 "offload_optimizer": {
                     "device": "nvme",
                     "nvme_path": "/nvme_data"
@@ -134,6 +148,9 @@ granularity of (sub)module ``forward()`` methods. The backward pass is
 handled similarly. This strategy has two underlying assumptions:
 
 #. The forward and backward passes of submodules must individually fit in device memory.
+   If this not the case, :class:`deepspeed.zero.TiledLinear` implements
+   **memory-centric tiling** and works with ZeRO-3 to break linear layers
+   into a sequence of smaller submodules that can fit in memory.
 
 #. A module's parameters are only accessed within its own ``__init__`` and ``forward()`` methods.
    Otherwise, DeepSpeed must be instructed to collect and re-partition the parameter.
@@ -153,6 +170,7 @@ you can simply allocate your model in our context:
         model = MyLargeModel()
 
 
+.. autoclass:: deepspeed.zero.Init
     :members:
 
 
@@ -185,46 +203,56 @@ parameters are accessed outside of the module that created them. To do so, use
 Registering External Parameters
 ===============================
 
-Consider the following pattern common in language models such as GPT:
+ZeRO-3 will automatically collect and partition the model parameters as they
+are needed during the forward and backward passes. However, in some cases a
+parameter may be used outside of its module's forward pass. We call these
+*external* parameters. ZeRO-3 can coordinate these parameters if they are
+registered either automatically or manually.
 
-.. code-block:: python
 
-    class LanguageModel(torch.nn.Module):
-        ...
-        def forward(self, inputs):
-            embeds = self.embeddings(inputs)
-            ...
-            logits = compute_logits(output, self.embeddings.weight)
-            ...
+.. note::
+    DeepSpeed version ``0.3.15`` includes automatic external parameter
+    discovery and registration to support the most common cases. Parameters
+    can still be manually registered if they cannot be automatically
+    detected.
 
 
-The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and
-``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
-because it is used in the training loop outside of its owning module's
-forward pass. DeepSpeed will coordinate external parameters if they are
-registered prior to the first forward pass.
+DeepSpeed can automatically detect the following external parameter scenarios:
 
-Consider the following pattern common in language models such as GPT:
 
-.. code-block:: python
+#. Parameter access: consider the following pattern common in language models such as GPT:
 
-    class LanguageModel(torch.nn.Module):
-        ...
-        def forward(self, inputs):
-            embeds = self.embeddings(inputs)
-            ...
-            logits = compute_logits(output, self.embeddings.weight)
-            ...
+   The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and
+   ``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
+   because it is used in the training loop outside of its owning module's
+   forward pass.
 
 
-The tensor ``embeddings.weight`` is used in both ``embeddings.forward()`` and
-``compute_logits()``. We call ``embeddings.weight`` an *external* parameter
-because it is used in the training loop outside of its owning module's
-forward pass. DeepSpeed will coordinate external parameters if they are
-registered prior to the first forward pass.
+   .. code-block:: python
+
+       class LanguageModel(torch.nn.Module):
+           ...
+           def forward(self, inputs):
+               embeds = self.embeddings(inputs)
+               ...
+               logits = compute_logits(output, self.embeddings.weight)
+               ...
+
+
+#. Returning a parameter:
+
+   ``CustomLinear`` returns both an output and its own ``bias`` parameter. DeepSpeed
+   will detect the external ``bias`` parameter and register it with submodules that
+   use ``CustomLinear``.
+
+   .. code-block:: python
+
+       class CustomLinear(torch.nn.Linear):
+           def forward(self, *input):
+               output = super().forward(*input)
+               return output, self.bias
+
 
-.. note::
-    Most models should not need to manually register parameters.
 
 .. autofunction:: deepspeed.zero.register_external_parameter
 
@@ -234,5 +262,16 @@ registered prior to the first forward pass.
 Memory-Centric Tiling
 ---------------------
 
+To reduce the working memory requirements of DL training for large models,
+ZeRO-Infinity includes technique called *memory-centric tiling* that exploits
+the data fetch and release pattern of ZeRO-3 to reduce the working memory
+requirements by breaking down a large operator into smaller tiles that can be
+executed sequentially. When combined with ZeRO-3, the parameter and gradients
+of each tile can be fetched and released one at a time, reducing the working
+memory proportional to the number of tiles. Therefore, ZeRO-Infinity can
+support operators of arbitrary sizes, without refactoring for model
+parallelism to fit them in limited GPU memory.
+
+
 .. autoclass:: deepspeed.zero.TiledLinear
     :members: