Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,15 @@
"collapsed": true,
"items": [
"basics/command_line_tool",
"basics/define_your_config",
"basics/launch_colossalai",
"basics/booster_api",
"basics/booster_plugins",
"basics/define_your_config",
"basics/initialize_features",
"basics/engine_trainer",
"basics/configure_parallelization",
"basics/model_checkpoint",
"basics/colotensor_concept",
"basics/booster_api"
"basics/colotensor_concept"
]
},
{
Expand Down
64 changes: 64 additions & 0 deletions docs/source/en/basics/booster_plugins.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Booster Plugins

Author: [Hongxin Liu](https://github.com/ver217)

**Prerequisite:**
- [Booster API](./booster_api.md)

## Introduction

As mentioned in [Booster API](./booster_api.md), we can use booster plugins to customize the parallel training. In this tutorial, we will introduce how to use booster plugins.

We currently provide the following plugins:

- [Low Level Zero Plugin](#low-level-zero-plugin): It wraps the `colossalai.zero.low_level.LowLevelZeroOptimizer` and can be used to train models with zero-dp. It only supports zero stage-1 and stage-2.
- [Gemini Plugin](#gemini-plugin): It wraps the [Gemini](../features/zero_with_chunk.md) which implements Zero-3 with chunk-based and heterogeneous memory management.
- [Torch DDP Plugin](#torch-ddp-plugin): It is a wrapper of `torch.nn.parallel.DistributedDataParallel` and can be used to train models with data parallelism.
- [Torch FSDP Plugin](#torch-fsdp-plugin): It is a wrapper of `torch.distributed.fsdp.FullyShardedDataParallel` and can be used to train models with zero-dp.

More plugins are coming soon.

## Plugins

### Low Level Zero Plugin

This plugin implements Zero-1 and Zero-2 (w/wo CPU offload), using `reduce` and `gather` to synchronize gradients and weights.

Zero-1 can be regarded as a better substitute of Torch DDP, which is more memory efficient and faster. It can be easily used in hybrid parallelism.

Zero-2 does not support local gradient accumulation. Though you can accumulate gradient if you insist, it cannot reduce communication cost. That is to say, it's not a good idea to use Zero-2 with pipeline parallelism.

{{ autodoc:colossalai.booster.plugin.LowLevelZeroPlugin }}

We've tested compatibility on some famous models, following models may not be supported:

- `timm.models.convit_base`
- dlrm and deepfm models in `torchrec`
- `diffusers.VQModel`
- `transformers.AlbertModel`
- `transformers.AlbertForPreTraining`
- `transformers.BertModel`
- `transformers.BertForPreTraining`
- `transformers.GPT2DoubleHeadsModel`

Compatibility problems will be fixed in the future.

### Gemini Plugin

This plugin implements Zero-3 with chunk-based and heterogeneous memory management. It can train large models without much loss in speed. It also does not support local gradient accumulation. More details can be found in [Gemini Doc](../features/zero_with_chunk.md).

{{ autodoc:colossalai.booster.plugin.GeminiPlugin }}

### Torch DDP Plugin

More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).

{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }}

### Torch FSDP Plugin

> ⚠ This plugin is not available when torch version is lower than 1.12.0.

More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/fsdp.html).

{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}
64 changes: 64 additions & 0 deletions docs/source/zh-Hans/basics/booster_plugins.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Booster 插件

作者: [Hongxin Liu](https://github.com/ver217)

**前置教程:**
- [Booster API](./booster_api.md)

## 引言

正如 [Booster API](./booster_api.md) 中提到的,我们可以使用 booster 插件来自定义并行训练。在本教程中,我们将介绍如何使用 booster 插件。

我们现在提供以下插件:

- [Low Level Zero 插件](#low-level-zero-plugin): 它包装了 `colossalai.zero.low_level.LowLevelZeroOptimizer`,可用于使用 Zero-dp 训练模型。它仅支持 Zero 阶段1和阶段2。
- [Gemini 插件](#gemini-plugin): 它包装了 [Gemini](../features/zero_with_chunk.md),Gemini 实现了基于Chunk内存管理和异构内存管理的 Zero-3。
- [Torch DDP 插件](#torch-ddp-plugin): 它包装了 `torch.nn.parallel.DistributedDataParallel` 并且可用于使用数据并行训练模型。
- [Torch FSDP 插件](#torch-fsdp-plugin): 它包装了 `torch.distributed.fsdp.FullyShardedDataParallel` 并且可用于使用 Zero-dp 训练模型。

更多插件即将推出。

## 插件

### Low Level Zero 插件

该插件实现了 Zero-1 和 Zero-2(使用/不使用 CPU 卸载),使用`reduce`和`gather`来同步梯度和权重。

Zero-1 可以看作是 Torch DDP 更好的替代品,内存效率更高,速度更快。它可以很容易地用于混合并行。

Zero-2 不支持局部梯度累积。如果您坚持使用,虽然可以积累梯度,但不能降低通信成本。也就是说,同时使用流水线并行和 Zero-2 并不是一个好主意。

{{ autodoc:colossalai.booster.plugin.LowLevelZeroPlugin }}

我们已经测试了一些主流模型的兼容性,可能不支持以下模型:

- `timm.models.convit_base`
- dlrm and deepfm models in `torchrec`
- `diffusers.VQModel`
- `transformers.AlbertModel`
- `transformers.AlbertForPreTraining`
- `transformers.BertModel`
- `transformers.BertForPreTraining`
- `transformers.GPT2DoubleHeadsModel`

兼容性问题将在未来修复。

### Gemini 插件

这个插件实现了基于Chunk内存管理和异构内存管理的 Zero-3。它可以训练大型模型而不会损失太多速度。它也不支持局部梯度累积。更多详细信息,请参阅 [Gemini 文档](../features/zero_with_chunk.md).

{{ autodoc:colossalai.booster.plugin.GeminiPlugin }}

### Torch DDP 插件

更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).

{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }}

### Torch FSDP 插件

> ⚠ 如果 torch 版本低于 1.12.0,此插件将不可用。

更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/fsdp.html).

{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}