From a81c115d652c314623e949f8abe456d48930ee37 Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Sat, 20 Feb 2021 14:05:02 -0800 Subject: [PATCH] make bold+italic work without escaping _ --- docs/_pages/config-json.md | 116 ++++++++++++++++----------------- docs/_tutorials/onebit-adam.md | 2 +- 2 files changed, 59 insertions(+), 59 deletions(-) diff --git a/docs/_pages/config-json.md b/docs/_pages/config-json.md index 8d99627c03cd..198e868568ae 100755 --- a/docs/_pages/config-json.md +++ b/docs/_pages/config-json.md @@ -4,40 +4,40 @@ title: "DeepSpeed Configuration JSON" ### Batch Size Related Parameters -**Note:** configuring ***train\_batch\_size*** is required. +**Note:** configuring **train_batch_size** is required. {: .notice--warning} -***train\_batch\_size***: [integer] +**train_batch_size**: [integer] | Value | Example | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | -| The effective training batch size. This is the amount of data samples that leads to one step of model update. ***train\_batch\_size*** is aggregated by the batch size that a single GPU processes in one forward/backward pass (a.k.a., ***train\_step\_batch\_size***), the gradient accumulation steps (a.k.a., ***gradient\_accumulation\_steps***), and the number of GPUs. | `32` | +| The effective training batch size. This is the amount of data samples that leads to one step of model update. **train_batch_size** is aggregated by the batch size that a single GPU processes in one forward/backward pass (a.k.a., **train_step_batch_size**), the gradient accumulation steps (a.k.a., **gradient_accumulation_steps**), and the number of GPUs. | `32` | -***train\_micro\_batch\_size\_per\_gpu***: [integer] +**train_micro_batch_size_per_gpu**: [integer] | Description | Default | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------ | -| Batch size to be processed by one GPU in one step (without gradient accumulation). When specified, ***gradient\_accumulation\_steps*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***gradient\_accumulation\_steps*** in the configuration JSON. | ***train\_batch\_size*** value | +| Batch size to be processed by one GPU in one step (without gradient accumulation). When specified, **gradient_accumulation_steps** is automatically calculated using **train_batch_size** and number of GPUs. Should not be concurrently specified with **gradient_accumulation_steps** in the configuration JSON. | **train_batch_size** value | -***gradient\_accumulation\_steps***: [integer] +**gradient_accumulation_steps**: [integer] | Description | Default | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | -| Number of training steps to accumulate gradients before averaging and applying them. This feature is sometimes useful to improve scalability since it results in less frequent communication of gradients between steps. Another impact of this feature is the ability to train with larger batch sizes per GPU. When specified, ***train\_step\_batch\_size*** is automatically calculated using ***train\_batch\_size*** and number of GPUs. Should not be concurrently specified with ***train\_step\_batch\_size*** in the configuration JSON. | `1` | +| Number of training steps to accumulate gradients before averaging and applying them. This feature is sometimes useful to improve scalability since it results in less frequent communication of gradients between steps. Another impact of this feature is the ability to train with larger batch sizes per GPU. When specified, **train_step_batch_size** is automatically calculated using **train_batch_size** and number of GPUs. Should not be concurrently specified with **train_step_batch_size** in the configuration JSON. | `1` | ### Optimizer Parameters -***optimizer***: [dictionary] +**optimizer**: [dictionary] | Fields | Value | Example | | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------- | | type | The optimizer name. DeepSpeed natively supports **Adam**, **AdamW**, **OneBitAdam**, and **Lamb** optimizers and will import other optimizers from [torch](https://pytorch.org/docs/stable/optim.html). | `"Adam"` | | params | Dictionary of parameters to instantiate optimizer. The parameter names must match the optimizer constructor signature (e.g., for [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam)). | `{"lr": 0.001, "eps": 1e-8}` | - Example of ***optimizer*** with Adam + Example of **optimizer** with Adam ```json "optimizer": { @@ -60,7 +60,7 @@ The Adam optimizer also supports the following two params keys/values in additio | torch\_adam | Use torch's implementation of adam instead of our fused adam implementation | false | | adam\_w\_mode | Apply L2 regularization (also known as AdamW) | true | - Another example of ***optimizer*** with 1-bit Adam specific parameters is as follows. + Another example of **optimizer** with 1-bit Adam specific parameters is as follows. ```json "optimizer": { @@ -81,14 +81,14 @@ The Adam optimizer also supports the following two params keys/values in additio ### Scheduler Parameters -***scheduler***: [dictionary] +**scheduler**: [dictionary] | Fields | Value | Example | | ------ | ---------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- | | type | The scheduler name. See [here](https://deepspeed.readthedocs.io/en/latest/deepspeed.pt.html) for list of support schedulers. | `"WarmupLR"` | | params | Dictionary of parameters to instantiate scheduler. The parameter names should match scheduler constructor signature. | `{"warmup_min_lr": 0, "warmup_max_lr": 0.001}` | -Example of ***scheduler*** +Example of **scheduler** ```json "scheduler": { @@ -103,25 +103,25 @@ Example of ***scheduler*** ### Communication options -***fp32\_allreduce***: [boolean] +**fp32_allreduce**: [boolean] | Description | Default | | -------------------------------------------------------------- | ------- | | During gradient averaging perform allreduce with 32 bit values | `false` | -***prescale\_gradients***: [boolean] +**prescale_gradients**: [boolean] | Description | Default | | -------------------------------------- | ------- | | Scale gradients before doing allreduce | `false` | -***gradient_predivide_factor***: [float] +**gradient_predivide_factor**: [float] | Description | Default | | ------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | Before gradient averaging predivide gradients by a specified factor, can sometimes help with fp16 stability when scaling to large numbers of GPUs | `1.0` | -***sparse\_gradients***: [boolean] +**sparse_gradients**: [boolean] | Description | Default | | ------------------------------------------------------------------------------------------------------------------------ | ------- | @@ -132,7 +132,7 @@ Example of ***scheduler*** **Note:** this mode cannot be combined with the `amp` mode described below. {: .notice--warning} -***fp16***: [dictionary] +**fp16**: [dictionary] | Description | Default | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | @@ -149,48 +149,48 @@ Example of ***scheduler*** } ``` -***fp16:enabled***: [boolean] +**fp16:enabled**: [boolean] | Description | Default | | -------------------------------------------------------------------------------------- | ------- | -| ***enabled*** is a **fp16** parameter indicating whether or not FP16 training enabled. | `false` | +| **enabled** is a **fp16** parameter indicating whether or not FP16 training enabled. | `false` | -***fp16:loss\_scale***: [float] +**fp16:loss_scale**: [float] | Description | Default | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | -| ***loss\_scale*** is a ***fp16*** parameter representing the loss scaling value for FP16 training. The default value of 0.0 results in dynamic loss scaling, otherwise the value will be used for static fixed loss scaling. | `0.0` | +| **loss_scale** is a **fp16** parameter representing the loss scaling value for FP16 training. The default value of 0.0 results in dynamic loss scaling, otherwise the value will be used for static fixed loss scaling. | `0.0` | -***fp16:initial\_scale\_power***: [integer] +**fp16:initial_scale_power**: [integer] | Description | Default | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | -| ***initial\_scale\_power*** is a **fp16** parameter representing the power of the initial dynamic loss scale value. The actual loss scale is computed as 2***initial\_scale\_power***. | `32` | +| **initial_scale_power** is a **fp16** parameter representing the power of the initial dynamic loss scale value. The actual loss scale is computed as 2**initial_scale_power**. | `32` | -***fp16:loss\_scale\_window***: [integer] +**fp16:loss_scale_window**: [integer] | Description | Default | | --------------------------------------------------------------------------------------------------------------------------------- | ------- | -| ***loss\_scale\_window*** is a **fp16** parameter representing the window over which to raise/lower the dynamic loss scale value. | `1000` | +| **loss_scale_window** is a **fp16** parameter representing the window over which to raise/lower the dynamic loss scale value. | `1000` | -***fp16:hysteresis***: [integer] +**fp16:hysteresis**: [integer] | Description | Default | | ---------------------------------------------------------------------------------------------- | ------- | -| ***hysteresis*** is a **fp16** parameter representing the delay shift in dynamic loss scaling. | `2` | +| **hysteresis** is a **fp16** parameter representing the delay shift in dynamic loss scaling. | `2` | -***fp16:min\_loss\_scale***: [integer] +**fp16:min_loss_scale**: [integer] | Description | Default | | -------------------------------------------------------------------------------------------------- | ------- | -| ***min\_loss\_scale*** is a **fp16** parameter representing the minimum dynamic loss scale value. | `1000` | +| **min_loss_scale** is a **fp16** parameter representing the minimum dynamic loss scale value. | `1000` | ### Automatic mixed precision (AMP) training options **Note:** this mode cannot be combined with the `fp16` mode described above. In addition this mode is not currently compatible with ZeRO. {: .notice--warning} -***amp***: [dictionary] +**amp**: [dictionary] | Description | Default | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | @@ -205,11 +205,11 @@ Example of ***scheduler*** } ``` -***amp:enabled***: [boolean] +**amp:enabled**: [boolean] | Description | Default | | ---------------------------------------------------------------------------------------- | ------- | -| ***enabled*** is an **amp** parameter indicating whether or not AMP training is enabled. | `false` | +| **enabled** is an **amp** parameter indicating whether or not AMP training is enabled. | `false` | ***amp params***: [various] @@ -219,7 +219,7 @@ Example of ***scheduler*** ### Gradient Clipping -***gradient\_clipping***: [float] +**gradient_clipping**: [float] | Description | Default | | ----------------------------------- | ------- | @@ -243,55 +243,55 @@ Enabling and configuring ZeRO memory optimizations } ``` -***zero\_optimization***: [dictionary] +**zero_optimization**: [dictionary] | Description | Default | | --------------------------------------------------------------------------------------------------------- | ------- | | Enable ZeRO memory optimization wrapper for FP16 Training. Currently compatible only with Adam optimizer. | `false` | -***stage***: [integer] +**stage**: [integer] | Description | Default | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | Chooses different stages of ZeRO Optimizer. Stage 0, 1, and 2 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitiong, respectively. | `0` | -***allgather_partitions***: [boolean] +**allgather_partitions**: [boolean] | Description | Default | | ------------------------------------------------------------------------------------------------------------------------------------------------ | ------- | | Chooses between allgather collective or a series of broadcast collectives to gather updated parameters from all the GPUs at the end of each step | `true` | -***allgather_bucket_size***: [boolean] +**allgather_bucket_size**: [boolean] | Description | Default | | ------------------------------------------------------------------------------------------------------------ | ------- | | Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes | `5e8` | -***overlap_comm***: [boolean] +**overlap_comm**: [boolean] | Description | Default | | ---------------------------------------------------------------------------- | ------- | | Attempts to overlap the reduction of the gradients with backward computation | `false` | -***reduce_scatter***: [boolean] +**reduce_scatter**: [boolean] | Description | Default | | ----------------------------------------------------------------------- | ------- | | Uses reduce or reduce scatter instead of allreduce to average gradients | `true` | -***reduce_bucket_size***: [boolean] +**reduce_bucket_size**: [boolean] | Description | Default | | ------------------------------------------------------------------------------------------------------------------- | ------- | | Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes | `5e8` | -***contiguous_gradients***: [boolean] +**contiguous_gradients**: [boolean] | Description | Default | | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass. Only useful when running very large models. | `False` | -***cpu_offload***: [boolean] +**cpu_offload**: [boolean] | Description | Default | | ------------------------------------------------------------------------------------------------------------------------ | ------- | @@ -300,19 +300,19 @@ Enabling and configuring ZeRO memory optimizations ### Logging -***steps\_per\_print***: [integer] +**steps_per_print**: [integer] | Description | Default | | ------------------------------ | ------- | | Print train loss every N steps | `10` | -***wall\_clock\_breakdown***: [boolean] +**wall_clock_breakdown**: [boolean] | Description | Default | | ----------------------------------------------------------------------- | ------- | | Enable timing of the latency of forward/backward/update training phases | `false` | -***dump_state***: [boolean] +**dump_state**: [boolean] | Description | Default | | -------------------------------------------------------------------- | ------- | @@ -330,31 +330,31 @@ Enabling and configuring ZeRO memory optimizations } } ``` -***enabled***: [boolean] +**enabled**: [boolean] | Description | Default | | --------------------------- | ------- | | Enables the flops profiler. | `false` | -***profile\_step***: [integer] +**profile_step**: [integer] | Description | Default | | --------------------------------------------------------------------------------------------------------------- | ------- | | The global training step at which to profile. Note that warm up steps are needed for accurate time measurement. | `1` | -***module\_depth***: [integer] +**module_depth**: [integer] | Description | Default | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | The depth of the model at which to print the aggregated module information. When set to `-1`, it prints information on the innermost modules (with the maximum depth). | `-1` | -***top\_modules***: [integer] +**top_modules**: [integer] | Description | Default | | ---------------------------------------------------------------------------- | ------- | | Limits the aggregated profile output to the number of top modules specified. | `3` | -***detailed***: [boolean] +**detailed**: [boolean] | Description | Default | | -------------------------------------------- | ------- | @@ -371,39 +371,39 @@ Enabling and configuring ZeRO memory optimizations "profile": false } ``` -***partition\_activations***: [boolean] +**partition_activations**: [boolean] | Description | Default | | ------------------------------------------------------------- | ------- | | Enables partition activation when used with model parallelism | `false` | -***cpu\_checkpointing***: [boolean] +**cpu_checkpointing**: [boolean] | Description | Default | | --------------------------------------------------------------------------- | ------- | | Offloads partitioned activations to CPU if partition_activations is enabled | `false` | -***contiguous\_memory\_optimization***: [boolean] +**contiguous_memory_optimization**: [boolean] | Description | Default | | -------------------------------------------------------------------- | ------- | | Copies partitioned activations so that they are contiguous in memory | `false` | -***number_checkpoints***: [integer] +**number_checkpoints**: [integer] | Description | Default | | -------------------------------------------------------------------------------------------------------- | ------- | | Total number of activation checkpoints used to allocate memory buffer for contiguous_memoty_optimization | `None` | -***synchronize\_checkpoint\_boundary***: [boolean] +**synchronize_checkpoint_boundary**: [boolean] | Description | Default | | ------------------------------------------------------------- | ------- | | Inserts torch.cuda.synchronize() at each checkpoint boundary. | `false` | -***profile***: [boolean] +**profile**: [boolean] | Description | Default | | --------------------------------------------------------------- | ------- | @@ -411,7 +411,7 @@ Enabling and configuring ZeRO memory optimizations ### Sparse Attention -***sparse\_attention***: [dictionary] +**sparse_attention**: [dictionary] | Fields | Value | Example | | -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | @@ -429,7 +429,7 @@ Enabling and configuring ZeRO memory optimizations | global\_block\_end\_indices | A list of integers determining end indices of global window blocks. By default this is not used. But if it is set, it must have the same size of global\_block\_indices parameter, and combining this two parameters, for each index i, blocks from global\_block\_indices[i] to global\_block\_end\_indices[i], exclusive, are considered as global attention; used in `"variable"` and `"bslongformer"` modes. | None | | num\_sliding\_window\_blocks | An integer determining the number of blocks in sliding local attention window; used in `"bigbird"` and `"bslongformer"` modes. | 3 | - Example of ***sparse\_attention*** + Example of **sparse_attention** ```json "sparse_attention": { diff --git a/docs/_tutorials/onebit-adam.md b/docs/_tutorials/onebit-adam.md index c8eee07586aa..3f365c92bb6b 100644 --- a/docs/_tutorials/onebit-adam.md +++ b/docs/_tutorials/onebit-adam.md @@ -162,7 +162,7 @@ Table 1. Fine-tuning configuration ### 2.3 Performance Results for BingBertSQuAD Fine-tuning -***Accuracy:*** +**Accuracy:** The results are summarized in the table below. The total batch size is set to 96 and training is conducted on 32 GPUs for 2 epochs. A set of parameters (seeds and learning rates) were tried and the best ones were selected. We fixed the learning rate to 3e-5. The table below shows the F1 and the EM scores we achieved that are on-par or better than the [HuggingFace results](https://github.com/huggingface/transformers/tree/master/examples/question-answering).