codedecde · jeffra · Oct 21, 2021 · Oct 21, 2021 · Oct 21, 2021 · Oct 21, 2021
@@ -1,6 +1,23 @@
-# Training an Masked Language Model with PyTorch and Deepspeed
-
-In this tutorial, we will create and train a Transformer encoder on the Masked Language Modeling (MLM) task. Then we will show the changes necessary to integrate Deepspeed, and show some of the advantages of doing so.
+# Training a Masked Language Model with PyTorch and DeepSpeed
+
+In this tutorial, we will create and train a Transformer encoder on the Masked Language Modeling (MLM) task. Then we will show the changes necessary to integrate DeepSpeed, and show some of the advantages of doing so.
+
+Table of contents
+=================
+
+<!--toc-start-->
+  * [(1) Training a Transformer Encoder (BERT / Roberta) model for MLM](#1-training-a-transformer-encoder-bert--roberta-model-for-mlm)
+    * [1.0 Some Good Practices](#10-some-good-practices)
+    * [1.1 The Masked Language Modeling Task](#11-the-masked-language-modeling-task)
+    * [1.2 Creating a Transformer model](#12-creating-a-transformer-model)
+    * [1.3 Training the Model](#13-training-the-model)
+  * [(2) Integrating DeepSpeed For More Efficient Training](#2-integrating-deepspeed-for-more-efficient-training)
+    * [2.0 Core DeepSpeed Code Changes](#20-core-deepspeed-code-changes)
+    * [2.1 Launching Training](#21-launching-training)
+    * [2.2 Mixed Precision Training (fp16)](#22-mixed-precision-training-fp16)
+    * [2.3 Zero Redundancy Optimizer (ZeRO)](#23-zero-redundancy-optimizer-zero)
+  * [References](#references)
+<!--toc-end-->
 
 ## 1. Training a Transformer Encoder (BERT / Roberta) model for MLM
 
@@ -128,52 +145,154 @@ The parameters are explained in more details in the docstring of `train`.
 
 ---
 
-## 2. Integrating Deepspeed For More Efficient Training
+## 2. Integrating DeepSpeed For More Efficient Training
 
+In this next section we'll add DeepSpeed to the model presented in Section 1 and turn on several features.
 
-## References
-> <a id="1">[1]</a> 
-[Vaswani, Asish et al. Attention is all you need. 
-_In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17)_](https://arxiv.org/pdf/1706.03762.pdf)
->
-> <a id="2">[2]</a>
-[Devlin, Jacob et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. _In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT'19)_](https://aclanthology.org/N19-1423.pdf)
->
-> <a id="3">[3]</a>
-[Liu, Yinhan et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. _ArXiv abs/1907.11692 (2019)_](https://arxiv.org/pdf/1907.11692.pdf)
+## 2.0 Core DeepSpeed Code Changes
+
+Please see the [Writing DeepSpeed Models](https://www.deepspeed.ai/getting-started/#writing-deepspeed-models) instructions written on modifying an existing model to use DeepSpeed. Also we will heavily rely on the [DeepSpeed API documentation](https://deepspeed.readthedocs.io/en/latest/) and [config JSON documentation](https://www.deepspeed.ai/docs/config-json/) going forward.
+
+Please install DeepSpeed via `pip install deepspeed` if you haven't already done so, after installing you can check if your current version and other information via `ds_report`. For this tutorial we assume a DeepSpeed version of >= 0.5.4 and a torch version >= 1.6. Please upgrade via `pip install --upgrade deepspeed` if you are running an older version of DeepSpeed.
+
+### Add deepspeed.initialize + config
+
+Our first task is to identify where to add `deepspeed.initialize()` to the existing code in order to use the DeepSpeed training engine. Please see the [deepspeed.initialize API documentation](https://deepspeed.readthedocs.io/en/latest/initialize.html#training-initialization) for more details. This needs to be done after the model has been created and before the training loop has started. Most of our edits will be inside the `train` function inside [train_bert.py](./train_bert.py).
+
+After the model is created and before the optimizer is created we want to add the following lines:
+
+```python
+ds_config = {
+  "train_micro_batch_size_per_gpu": batch_size,
+  "optimizer": {
+      "type": "Adam",
+      "params": {
+          "lr": 1e-4
+      }
+  },
+}
+model, _, _, _ = deepspeed.initialize(model=model, 
+                                      model_parameters=model.parameters(), 
+                                      config=ds_config)
+```
+
+This will create the DeepSpeed training engine based on the previously instantiated model and the new `ds_config` dictionary. We can now also remove the previous lines of code that created an Adam optimizer, this will now be done via the DeepSpeed engine. It should be noted, you can optionally created your own optimizer and pass it into `deepspeed.initialize` however DeepSpeed is able to make further performance optimizations by instantiating its own optimizers.
+
+### Update the training-loop
+
+Next we want to update our training-loop to use the new model engine with the following changes:
+
+* `optimizer.zero_grad()` can be removed
+  * The DeepSpeed engine will do this for you at the right time.
+* Replace `loss.backward()` with `model.backward(loss)`
+  * There are several cases where the engine will properly scale the loss when using certain features (e.g., fp16, gradient-accumulation).
+* Replace `optimizer.step()` with `model.step()`
+  * The optimizer step is handled by the engine now and is responsible for dispatching to the right optimizer depending on certain features.
+
+### Update checkpoint save and load
+
+Immediately after our new `deepspeed.initialize` you will see a checkpoint load and in the training-loop you will see a few checkpoint save calls. DeepSpeed handles the complexities of checkpoint saving for you so we can simplify these codepaths in the following way. Please refer to the [model checkpoint API documentation](https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html) for more details.
+
+__Checkpoint saving__: DeepSpeed will construct and save the state_dict for you, we can replace the *two* checkpoint saving snippets (i.e., `state_dict` construction and `torch.save`)  and replace them with the snippet below. The `client_state` being passed in here is an example of state outside the view of DeepSpeed that will be saved with the checkpoint.
+
+```python
+model.save_checkpoint(save_dir=exp_dir, client_state={'checkpoint_step': step})
+```
+
+__Checkpoint loading__: The checkpoint loading is happening right before the training-loop starts. It invokes the `load_model_checkpoint` function which consists of around 30 lines of code. We can replace the `load_model_checkpoint(load_checkpoint_dir, model, optimizer)` call with the following snippet:
+
+```python
+_, client_state = model.load_checkpoint(load_dir=load_checkpoint_dir)
+checkpoint_step = client_state['checkpoint_step']
+```
+
+## 2.1 Launching Training
+
+We are now ready to launch our training! As a convenience, DeepSpeed provides its own launcher that is seamlessly compatible with internal clusters at MSFT (e.g., ITP). You can now try running your model on your available GPU(s) with the command below. By default this will attempt to run data-parallel training across all available GPUs on the current machine + any external machines listed in your `/job/hostfile`. Please read [more details about the DeepSpeed launcher](https://www.deepspeed.ai/getting-started/#launching-deepspeed-training) on our website.
+
+```bash
+deepspeed train_bert.py --checkpoint_dir .
+```
+
+---
+📌 **Note:** If using the deepspeed launcher you should not pass the `--local_rank` explicitly. This will be done by the launcher in the same way as if you launched with `torch.distributed.launch` from PyTorch.
+
+---
 
----------
+## 2.2 Mixed Precision Training (fp16)
 
+Now that we are setup to use the DeepSpeed engine with our model we can start trying out a few different features of DeepSpeed. One feature is mixed precision training that utilizes half precision (floating-point 16 or fp16) data types. If you want to learn more about how mixed precision training works please refer to the Mixed Precision Training paper [[3]](https://arxiv.org/pdf/1710.03740v3.pdf) from Baidu and NVIDIA on the topic.
 
-## Scratch pad (TODO Remove)
-good practices
-  * experiment directory saving training metadata
-  * pytests
+To enable this mode in DeepSpeed we need to update our `ds_config` before the engine is created. Please see [fp16 training options](https://www.deepspeed.ai/docs/config-json/#fp16-training-options) in the config documentation for more information. In our case let's simple enable it by adding the following to our `ds_config` dictionary:
 
-data
-  * what is masked LM
-  * what does the code do (link to code)
+```python
+  "fp16": {
+    "enabled": True
+  }
+```
 
-model
-  * core params for transformer model (e.g., #layers, attn)
+The memory reduction by switching from fp32 to fp16 results in the *model parameters* using half the amount of GPU memory, however the overall GPU memory reduction is not as simple. Since fp16 has half the available bits as fp32 it is not able to represent the same expressiveness as fp32, which can result in numeric instabilities during training. We are able to get around these instabilities in most cases by keeping some states in fp16 and others remain in fp32 (see Section 3 in [[3]](https://arxiv.org/pdf/1710.03740v3.pdf) if you'd like to learn more).
 
-how to run
-  * launching on CPU (slow) launch on single GPU (fast)
-  * different train params
+The primary reason to utilize fp16 training is due to *Tensor Cores*. If you are training with NVIDIA V100 or A100 GPUs they include Tensor Cores which in some cases can accelerate computation by as much as 8x if certain conditions are met. One of the most important conditions is that your model parameters are stored as fp16. For more details on other conditions and tips to better utilize these cores please see this guide from NVIDIA on [Tips for Optimizing GPU Performance Using Tensor Cores](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/).
+
+---
+📌 **Note:** At the start of training you will probably see several log messages about loss scaling and overflows, this is normal. In order for fp16 training to be numerically stable we utilize a common technique called "loss scaling" (similar to Section 3.2 in [[3]](https://arxiv.org/pdf/1710.03740v3.pdf)). This attempts to find a scaling value to mitigate gradient over/under-flows during training.
+
+---
 
-------
+## 2.3 Zero Redundancy Optimizer (ZeRO)
 
-deepspeed additions
-  * deepspeed.init, training loop, ckpt changes
+ZeRO leverages the aggregate computation and memory resources of data parallelism to reduce the memory and compute requirements of each device (GPU) used for model training. ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the available devices (GPUs and CPUs) in the distributed training hardware. Concretely, ZeRO is being implemented as incremental stages of optimizations, where optimizations in earlier stages are available in the later stages. To deep dive into ZeRO, please see our three papers [[4](https://arxiv.org/pdf/1910.02054.pdf), [5](https://www.usenix.org/system/files/atc21-ren-jie.pdf), [6](https://arxiv.org/abs/2104.07857)] that explore different optimizations in this space. We will focus on two features of ZeRO here, ZeRO Stage 1 and ZeRO-Offload. For further information, please refer to our [tutorial deep diving ZeRO](https://www.deepspeed.ai/tutorials/zero/) and our [tutorial deep diving ZeRO Offload](https://www.deepspeed.ai/tutorials/zero-offload/) on our website.
+
+* ZeRO Stage 1: The optimizer states (e.g., for the Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
+* ZeRO-Offload: Supports efficiently offloading optimizer memory and computation from the GPU to the host CPU. ZeRO-Offload enables large models with up to 13 billion parameters to be trained on a single GPU.
+
+To enable ZeRO Stage 1 in DeepSpeed we need to again update our `ds_config` before the engine is created. Please see [ZeRO optimizations](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training) in the DeepSpeed config documentation for more information. In our case let's simply enable stage 1 it by adding the following to our `ds_config` dictionary:
+
+```python
+  "zero_optimization": {
+    "stage": 1
+  }
+```
+
+We can re-run our training now with ZeRO stage 1 enabled and should see a per-GPU memory reduction as we scale up the total number of GPUs. Typically you can now use this extra GPU memory to either scale up your model size or scale up your per-GPU training batch size. However, if we only have 1 GPU available we probably want to enable ZeRO-Offload to allow us to train larger model sizes. Please update your `ds_config` to include the following:
+
+```python
+  "zero_optimization": {
+    "stage": 1,
+    "offload_optimizer": {
+      "device": "cpu"
+    }
+  }
+```
+
+This config will now allow us to train a much larger model than we were previously able to do. For example on a single P40 GPU with 24GB of memory we are unable to train a 2 billion parameter model (i.e., `--num_layers 24 --h_dim 4096`), however with ZeRO-Offload we now can!
+
+```bash
+deepspeed train_bert.py --checkpoint_dir . --num_layers 24 --h_dim 4096
+```
+
+---
+📌 **Note:** Earlier on when we setup `deepspeed.initialize` we chose not to explicitly pass an optimizer and instead let the DeepSpeed engine instantiate one for us. This is especially useful now that we are using ZeRO-Offload. DeepSpeed includes a highly optimized version of Adam that executes purely on CPU. This means that DeepSpeed will detect if you are using ZeRO-Offload w. Adam and switch to optimized CPUAdam variant.
+
+---
+
+## References
+> <a id="1">[1]</a> 
+[Vaswani et al. Attention is all you need. 
+In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17)](https://arxiv.org/pdf/1706.03762.pdf)
+
+> <a id="2">[2]</a>
+[J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT'19)](https://aclanthology.org/N19-1423.pdf)
+
+> <a id="3">[3]</a>
+[P. Micikevicius et al. Mixed Precision Training (ICLR'18)](https://arxiv.org/pdf/1710.03740v3.pdf)
 
-launching across multiple GPUs
+> <a id="4">[4]></a>
+[S. Rajbhandari, J. Rasley, O. Ruwase, Y. He. ZeRO: memory optimizations toward training trillion parameter models. (SC‘20)](https://arxiv.org/pdf/1910.02054.pdf)
 
-fp16
-  * how to enable
-  * show memory reduction when enabled via nvidia-smi
-  * brief overview of how fp16 training works (e.g., loss scaling)
+> <a id="5">[5]</a>
+[J. Ren, S. Rajbhandari, R. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, Y. He. ZeRO-Offload: Democratizing Billion-Scale Model Training. (ATC'21)](https://www.usenix.org/system/files/atc21-ren-jie.pdf)
 
-zero
-  * introduce how zero reduces memory
-  * introduce zero offload
-  * update config to use z1 + offload to showcase a model that can only run with offload enabled
+> <a id="1">[6]</a> 
+[S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, Y. He. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning (SC'21)](https://arxiv.org/abs/2104.07857)
@@ -725,6 +725,7 @@ def train(
     )
     model = model.to(device)
     logger.info("Model Creation Done")
+    logger.info(f"Total number of model parameters: {sum([p.numel() for p in model.parameters()]):,d}")
     ################################
     ###### Create Optimizer #######
     ################################