Skip to content
1 change: 1 addition & 0 deletions docs/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
"label": "Features",
"collapsed": true,
"items": [
"features/mixed_precision_training_with_booster",
"features/mixed_precision_training",
"features/gradient_accumulation_with_booster",
"features/gradient_accumulation",
Expand Down
20 changes: 8 additions & 12 deletions docs/source/en/basics/define_your_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,6 @@

Author: Guangyang Lu, Shenggui Li, Siqi Mai

> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster API](../basics/booster_api.md) for more information.


**Prerequisite:**
- [Distributed Training](../concepts/distributed_training.md)
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
Expand All @@ -24,8 +21,7 @@ In this tutorial, we will cover how to define your configuration file.
## Configuration Definition

In a configuration file, there are two types of variables. One serves as feature specification and the other serves
as hyper-parameters. All feature-related variables are reserved keywords. For example, if you want to use mixed precision
training, you need to use the variable name `fp16` in the config file and follow a pre-defined format.
as hyper-parameters. All feature-related variables are reserved keywords. For example, if you want to use 1D tensor parallelism, you need to use the variable name `parallel` in the config file and follow a pre-defined format.

### Feature Specification

Expand All @@ -37,14 +33,13 @@ To illustrate the use of config file, we use mixed precision training as an exam
follow the steps below.

1. create a configuration file (e.g. `config.py`, the file name can be anything)
2. define the mixed precision configuration in the config file. For example, in order to use mixed precision training
natively provided by PyTorch, you can just write these lines of code below into your config file.
2. define the hybrid parallelism configuration in the config file. For example, in order to use 1D tensor parallel, you can just write these lines of code below into your config file.

```python
from colossalai.amp import AMP_TYPE

fp16 = dict(
mode=AMP_TYPE.TORCH
parallel = dict(
data=1,
pipeline=1,
tensor=dict(size=2, mode='1d'),
)
```

Expand All @@ -57,7 +52,7 @@ the current directory.
colossalai.launch(config='./config.py', ...)
```

In this way, Colossal-AI knows what features you want to use and will inject this feature during `colossalai.initialize`.
In this way, Colossal-AI knows what features you want to use and will inject this feature.

### Global Hyper-parameters

Expand All @@ -83,3 +78,4 @@ colossalai.launch(config='./config.py', ...)
print(gpc.config.BATCH_SIZE)

```
<!-- doc-test-command: echo "define_your_config.md does not need test" -->
3 changes: 2 additions & 1 deletion docs/source/en/features/mixed_precision_training.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Auto Mixed Precision Training
# Auto Mixed Precision Training (Outdated)

Author: Chuanrui Wang, Shenggui Li, Yongbin Li

Expand Down Expand Up @@ -365,3 +365,4 @@ Use the following command to start the training scripts. You can change `--nproc
```python
python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
```
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training.py -->
251 changes: 251 additions & 0 deletions docs/source/en/features/mixed_precision_training_with_booster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
# Auto Mixed Precision Training (Latest)

Author: [Mingyan Jiang](https://github.com/jiangmingyan)

**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Training Booster](../basics/booster_api.md)

**Related Paper**
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)


## Introduction

AMP stands for automatic mixed precision training.
In Colossal-AI, we have incorporated different implementations of mixed precision training:

1. torch.cuda.amp
2. apex.amp
3. naive amp


| Colossal-AI | support tensor parallel | support pipeline parallel | fp16 extent |
| ----------- | ----------------------- | ------------------------- | ----------- |
| AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
| AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
| AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |

The first two rely on the original implementation of PyTorch (version 1.6 and above) and NVIDIA Apex.
The last method is similar to Apex O2 level.
Among these methods, apex AMP is not compatible with tensor parallelism.
This is because that tensors are split across devices in tensor parallelism, thus, it is required to communicate among different processes to check if inf or nan occurs in the whole model weights.
We modified the torch amp implementation so that it is compatible with tensor parallelism now.

> ❌️ fp16 and zero are not compatible
>
> ⚠️ Pipeline only support naive AMP currently

We recommend you to use torch AMP as it generally gives better accuracy than naive AMP if no pipeline is used.

## Table of Contents

In this tutorial we will cover:

1. [AMP introduction](#amp-introduction)
2. [AMP in Colossal-AI](#amp-in-colossal-ai)
3. [Hands-on Practice](#hands-on-practice)

## AMP Introduction

Automatic Mixed Precision training is a mixture of FP16 and FP32 training.

Half-precision float point format (FP16) has lower arithmetic complexity and higher compute efficiency. Besides, fp16 requires half of the storage needed by fp32 and saves memory & network bandwidth, which makes more memory available for large batch size and model size.

However, there are other operations, like reductions, which require the dynamic range of fp32 to avoid numeric overflow/underflow. That's the reason why we introduce automatic mixed precision, attempting to match each operation to its appropriate data type, which can reduce the memory footprint and augment training efficiency.

<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
<figcaption>Illustration of an ordinary AMP (figure from <a href="https://arxiv.org/abs/2108.05818">PatrickStar paper</a>)</figcaption>
</figure>

## AMP in Colossal-AI

We supported three AMP training methods and allowed the user to train with AMP with no code. If you want to train with amp, just assign `mixed_precision` with `fp16` when you instantiate the `Booster`. Now booster support torch amp, the other two(apex amp, naive amp) are still started by `colossalai.initialize`, if needed, please refer to [this](./mixed_precision_training.md). Next we will support `bf16`, `fp8`.

### Start with Booster
instantiate `Booster` with `mixed_precision="fp16"`, then you can train with torch amp.
<!--- doc-test-ignore-start -->
```python
"""
Mapping:
'fp16': torch amp
'fp16_apex': apex amp,
'bf16': bf16,
'fp8': fp8,
'fp16_naive': naive amp
"""
from colossalai import Booster
booster = Booster(mixed_precision='fp16',...)
```
<!--- doc-test-ignore-end -->
or you can create a `FP16TorchMixedPrecision` object, such as:
<!--- doc-test-ignore-start -->
```python
from colossalai.mixed_precision import FP16TorchMixedPrecision
mixed_precision = FP16TorchMixedPrecision(
init_scale=2.**16,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000)
booster = Booster(mixed_precision=mixed_precision,...)
```
<!--- doc-test-ignore-end -->
The same goes for other types of amps.


### Torch AMP Configuration

{{ autodoc:colossalai.booster.mixed_precision.FP16TorchMixedPrecision }}

### Apex AMP Configuration

For this mode, we rely on the Apex implementation for mixed precision training.
We support this plugin because it allows for finer control on the granularity of mixed precision.
For example, O2 level (optimization level 2) will keep batch normalization in fp32.

If you look for more details, please refer to [Apex Documentation](https://nvidia.github.io/apex/).

{{ autodoc:colossalai.booster.mixed_precision.FP16ApexMixedPrecision }}

### Naive AMP Configuration

In Naive AMP mode, we achieved mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
This AMP mode will cast all operations into fp16.
The following code block shows the mixed precision api for this mode.

{{ autodoc:colossalai.booster.mixed_precision.FP16NaiveMixedPrecision }}

When using `colossalai.booster`, you are required to first instantiate a model, an optimizer and a criterion.
The output model is converted to AMP model of smaller memory consumption.
If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
Otherwise, try smaller models or checkout more parallelization training techniques!


## Hands-on Practice

Now we will introduce the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example.

### Step 1. Import libraries in train.py

Create a `train.py` and import the necessary dependencies. Remember to install `scipy` and `timm` by running
`pip install timm scipy`.

```python
import os
from pathlib import Path

import torch
from timm.models import vit_base_patch16_224
from titans.utils import barrier_context
from torchvision import datasets, transforms

import colossalai
from colossalai.booster import Booster
from colossalai.booster.plugin import TorchDDPPlugin
from colossalai.logging import get_dist_logger
from colossalai.nn.lr_scheduler import LinearWarmupLR
```

### Step 2. Initialize Distributed Environment

We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
for other initialization methods.

```python
# initialize distributed setting
parser = colossalai.get_default_parser()
args = parser.parse_args()

# launch from torch
colossalai.launch_from_torch(config=dict())

```

### Step 3. Create training components

Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
obtained from the environment variable `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
to a path on your machine. Data will be automatically downloaded to the root path.

```python
# define the constants
NUM_EPOCHS = 2
BATCH_SIZE = 128

# build model
model = vit_base_patch16_224(drop_rate=0.1)

# build dataloader
train_dataset = datasets.Caltech101(
root=Path(os.environ['DATA']),
download=True,
transform=transforms.Compose([
transforms.Resize(256),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
Gray2RGB(),
transforms.Normalize([0.5, 0.5, 0.5],
[0.5, 0.5, 0.5])
]))

# build optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)

# build loss
criterion = torch.nn.CrossEntropyLoss()

# lr_scheduler
lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=NUM_EPOCHS)
```

### Step 4. Inject AMP Feature

Create a `MixedPrecision`(if needed) and `TorchDDPPlugin` object, call `colossalai.boost` convert the training components to be running with FP16.

```python
plugin = TorchDDPPlugin()
train_dataloader = plugin.prepare_dataloader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
booster = Booster(mixed_precision='fp16', plugin=plugin)

# if you need to customize the config, do like this
# >>> from colossalai.mixed_precision import FP16TorchMixedPrecision
# >>> mixed_precision = FP16TorchMixedPrecision(
# >>> init_scale=2.**16,
# >>> growth_factor=2.0,
# >>> backoff_factor=0.5,
# >>> growth_interval=2000)
# >>> plugin = TorchDDPPlugin()
# >>> booster = Booster(mixed_precision=mixed_precision, plugin=plugin)

# boost model, optimizer, criterion, dataloader, lr_scheduler
model, optimizer, criterion, dataloader, lr_scheduler = booster.boost(model, optimizer, criterion, dataloader, lr_scheduler)
```

### Step 5. Train with Booster

Use booster in a normal training loops.

```python
model.train()
for epoch in range(NUM_EPOCHS):
for img, label in enumerate(train_dataloader):
img = img.cuda()
label = label.cuda()
optimizer.zero_grad()
output = model(img)
loss = criterion(output, label)
booster.backward(loss, optimizer)
optimizer.step()
lr_scheduler.step()
```

### Step 6. Invoke Training Scripts

Use the following command to start the training scripts. You can change `--nproc_per_node` to use a different number of GPUs.

```shell
colossalai run --nproc_per_node 1 train.py --config config/config.py
```
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training_with_booster.py -->
19 changes: 9 additions & 10 deletions docs/source/zh-Hans/basics/define_your_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

作者: Guangyang Lu, Shenggui Li, Siqi Mai

> ⚠️ 此页面上的信息已经过时并将被废弃。请在[Booster API](../basics/booster_api.md)页面查阅更新。

**预备知识:**
- [分布式训练](../concepts/distributed_training.md)
- [Colossal-AI 总览](../concepts/colossalai_overview.md)
Expand All @@ -20,7 +18,7 @@

## 配置定义

在一个配置文件中,有两种类型的变量。一种是作为特征说明,另一种是作为超参数。所有与特征相关的变量都是保留关键字。例如,如果您想使用混合精度训练,需要在 config 文件中使用变量名`fp16`,并遵循预先定义的格式。
在一个配置文件中,有两种类型的变量。一种是作为特征说明,另一种是作为超参数。所有与特征相关的变量都是保留关键字。例如,如果您想使用`1D`张量并行,需要在 config 文件中使用变量名`fp16`,并遵循预先定义的格式。

### 功能配置

Expand All @@ -29,13 +27,13 @@ Colossal-AI 提供了一系列的功能来加快训练速度。每个功能都
为了说明配置文件的使用,我们在这里使用混合精度训练作为例子。您需要遵循以下步骤。

1. 创建一个配置文件(例如 `config.py`,您可以指定任意的文件名)。
2. 在配置文件中定义混合精度的配置。例如,为了使用 PyTorch 提供的原始混合精度训练,您只需将下面这几行代码写入您的配置文件中。

```python
from colossalai.amp import AMP_TYPE
2. 在配置文件中定义混合并行的配置。例如,为了使用`1D`张量并行,您只需将下面这几行代码写入您的配置文件中。

fp16 = dict(
mode=AMP_TYPE.TORCH
```python
parallel = dict(
data=1,
pipeline=1,
tensor=dict(size=2, mode='1d'),
)
```

Expand All @@ -47,7 +45,7 @@ Colossal-AI 提供了一系列的功能来加快训练速度。每个功能都
colossalai.launch(config='./config.py', ...)
```

这样,Colossal-AI 便知道您想使用什么功能,并会在 `colossalai.initialize` 期间注入您所需要的功能
这样,Colossal-AI 便知道您想使用什么功能,并注入您所需要的功能

### 全局超参数

Expand All @@ -71,3 +69,4 @@ colossalai.launch(config='./config.py', ...)
print(gpc.config.BATCH_SIZE)

```
<!-- doc-test-command: echo "define_your_config.md does not need test" -->
3 changes: 2 additions & 1 deletion docs/source/zh-Hans/features/mixed_precision_training.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 自动混合精度训练 (AMP)
# 自动混合精度训练 (旧版本)

作者: Chuanrui Wang, Shenggui Li, Yongbin Li

Expand Down Expand Up @@ -342,3 +342,4 @@ for epoch in range(gpc.config.NUM_EPOCHS):
```python
python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
```
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training.py -->
Loading