From ed5d525d812e4cbf2f811c18b7d0cb0765d88921 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Tue, 9 May 2023 17:09:01 +0800 Subject: [PATCH 01/30] [booster] update booster tutorials#3717 --- docs/source/en/basics/colossalai_booster.md | 124 +++++++++++++++++ .../zh-Hans/basics/colossalai_booster.md | 125 ++++++++++++++++++ 2 files changed, 249 insertions(+) create mode 100644 docs/source/en/basics/colossalai_booster.md create mode 100644 docs/source/zh-Hans/basics/colossalai_booster.md diff --git a/docs/source/en/basics/colossalai_booster.md b/docs/source/en/basics/colossalai_booster.md new file mode 100644 index 000000000000..fc33e8cbe039 --- /dev/null +++ b/docs/source/en/basics/colossalai_booster.md @@ -0,0 +1,124 @@ +# colossal-ai booster + +**Prerequisite:** +- [Distributed Training](../concepts/distributed_training.md) +- [Colossal-AI Overview](../concepts/colossalai_overview.md) + +## Introduction +In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, user can integrate their model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. + +### Plugin +

Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows:

+ +***GeminiPlugin:***

This plugin wrapps the Gemini acceleration solution, that ZeRO with chunk-based memory management.

+ +***TorchDDPPlugin:***

This plugin wrapps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines.

+ +***LowLevelZeroPlugin:***

This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs.

+ +### API of booster +Booster.__init__(...): +* Args: + * device (str or torch.device): The device to run the training. Default: 'cuda'. + * mixed_precision (str or MixedPrecision): The mixed precision to run the training. Default: None.If the argument is a string, it can be 'fp16', 'fp16_apex', 'bf16', or 'fp8'.'fp16' would use PyTorch AMP while 'fp16_apex' would use Nvidia Apex. + * plugin (Plugin): The plugin to run the training. Default: None. +* Return: + * booster (Booster) + + +booster.boost(...): This function is called to boost objects. (e.g. model, optimizer, criterion). +* Args: + * model (nn.Module): The model to be boosted. + * optimizer (Optimizer): The optimizer to be boosted. + * criterion (Callable): The criterion to be boosted. + * dataloader (DataLoader): The dataloader to be boosted. + * lr_scheduler (LRScheduler): The lr_scheduler to be boosted. +* Return: + * model, optimizer, criterion, dataloader, lr_scheduler + +booster.backward(loss, optimizer): This function run the backward operation +* Args: + * loss (torch.Tensor) + * optimizer (Optimizer) + +booster.no_sync(model) :A context manager to disable gradient synchronizations across processes. + +booster.save_model(...): This function is called to save model checkpoints +* Args: + * model: nn.Module, + * checkpoint: str, + * prefix: str = None, + * shard: bool = False, # if saved as shards + * size_per_shard: int = 1024 # the max length of shard + +booster.load_model(...): +* Args: + * model: nn.Module, + * checkpoint: str, + * strict: bool = True + +booster.save_optimizer(...): This function is called to save optimizer checkpoints +* Args: + * optimizer: Optimizer, + * checkpoint: str, + * shard: bool = False, # if saved as shards + * size_per_shard: int = 1024 # the max length of shard + +booster.load_optimizer(...): +* Args: + * optimizer: Optimizer, + * checkpoint: str, + +booster.save_lr_scheduler(...): This function is called to save lr scheduler checkpoints +* Args: + * lr_scheduler: LRScheduler, + * checkpoint: str, + +booster.load_lr_scheduler(...): +* Args: + * lr_scheduler: LRScheduler, + * checkpoint: str, + +## usage +In a typical workflow, you need to launch distributed environment at the beginning of training script and create objects needed (such as models, optimizers, loss function, data loaders etc.) firstly, then call `colossalai.booster` to inject features into these objects, After that, you can use our booster API and these returned objects to continue the rest of your training processes. + +

A pseudo-code example is like below:

+ +```python +import torch +from torch.optim import SGD +from torchvision.models import resnet18 + +import colossalai +from colossalai.booster import Booster +from colossalai.booster.plugin import TorchDDPPlugin + +def train(): + colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost') + plugin = TorchDDPPlugin() + booster = Booster(plugin=plugin) + model = resnet18() + criterion = lambda x: x.mean() + optimizer = SGD((model.parameters()), lr=0.001) + scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1) + model, optimizer, criterion, _, scheduler = booster.boost(model, optimizer, criterion, lr_scheduler=scheduler) + + x = torch.randn(4, 3, 224, 224) + x = x.to('cuda') + output = model(x) + loss = criterion(output) + booster.backward(loss, optimizer) + optimizer.clip_grad_by_norm(1.0) + optimizer.step() + scheduler.step() + + save_path = "./model" + booster.save_model(model, save_path, True, True, "", 10, use_safetensors=use_safetensors) + + new_model = resnet18() + booster.load_model(new_model, save_path) +``` + +if you want to run a example, [click here](../../../../examples/tutorial/new_api/cifar_resnet/README.md) + +[more design detailers](https://github.com/hpcaitech/ColossalAI/discussions/3046) diff --git a/docs/source/zh-Hans/basics/colossalai_booster.md b/docs/source/zh-Hans/basics/colossalai_booster.md new file mode 100644 index 000000000000..703fb484e3be --- /dev/null +++ b/docs/source/zh-Hans/basics/colossalai_booster.md @@ -0,0 +1,125 @@ +# booster 使用 + +**预备知识:** +- [分布式训练](../concepts/distributed_training.md) +- [Colossal-AI 总览](../concepts/colossalai_overview.md) + +## 简介 +在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好的将我们的并行策略整合到模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 +在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用中我们要注意的细节。 + +### Plugin +

Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下:

+ +***GeminiPlugin:***

GeminiPlugin插件封装了 gemini 加速解决方案,即具有基于块的内存管理的 ZeRO优化方案。

+ +***TorchDDPPlugin:***

TorchDDPPlugin插件封装了DDP加速方案,实现了模块级别的数据并行,可以跨多机运行。

+ +***LowLevelZeroPlugin:***

LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:跨数据并行工作器/GPU 的分片优化器状态。阶段 2:分片优化器状态 + 跨数据并行工作者/GPU 的梯度。

+ +### API of booster +Booster.__init__(...): +* 参数: + * device (str or torch.device): 行训练的设备。默认值:'cuda'。 + * mixed_precision (str or MixedPrecision): 运行训练的混合精度。默认值:None。如果参数是字符串,则它可以是“fp16”、“fp16_apex”、“bf16”或“fp8”。“fp16”将使用 PyTorch AMP,而“fp16_apex”将使用 Nvidia Apex。 + * plugin (Plugin): 运行训练的插件。默认值:None。 + * booster (Booster) + + +booster.boost(...): 调用此函数来注入特性到对象中。 (例如模型、优化器、标准) +* 参数: + * model (nn.Module): 被注入的模型对象。 + * optimizer (Optimizer): 被注入的优化器对象。 + * criterion (Callable): 被注入的criterion对象。 + * dataloader (DataLoader): 被注入的dataloader对象. + * lr_scheduler (LRScheduler): 被注入的lr_scheduler对象. +* 返回值: + * model, optimizer, criterion, dataloader, lr_scheduler + +booster.backward(loss, optimizer): 调用该函数执行反向传播操作。 +* 参数: + * loss (torch.Tensor) + * optimizer (Optimizer) + +booster.no_sync(model) :返回一个上下文管理器,用于禁用跨进程的梯度同步。 + +booster.save_model(...): 调用此函数以保存模型。 +* 参数: + * model: nn.Module, + * checkpoint: str, + * prefix: str = None, + * shard: bool = False, # if saved as shards + * size_per_shard: int = 1024 # the max length of shard + +booster.load_model(...): 调用该函数加载模型。 +* 参数: + * model: nn.Module, + * checkpoint: str, + * strict: bool = True + +booster.save_optimizer(...): 调用此函数以保存优化器。 +* 参数: + * optimizer: Optimizer, + * checkpoint: str, + * shard: bool = False, # if saved as shards + * size_per_shard: int = 1024 # the max length of shard + +booster.load_optimizer(...): 调用此函数以加载优化器。 +* 参数: + * optimizer: Optimizer, + * checkpoint: str, + +booster.save_lr_scheduler(...): 调用此函数以保存学习率更新器。 +* 参数: + * lr_scheduler: LRScheduler, + * checkpoint: str, + +booster.load_lr_scheduler(...): 调用此函数以加载学习率更新器。 +* 参数: + * lr_scheduler: LRScheduler, + * checkpoint: str, + +## usage + +在使用colossalai训练时,首先需要在训练脚本的开头启动分布式环境,并创建需要使用的模型、优化器、损失函数、数据加载器等对象等。之后,调用`colossalai.booster` 将特征注入到这些对象中,您就可以使用我们的booster API去进行您接下来的训练流程。 + +

以下是一个伪代码示例,将展示如何使用我们的booster API进行模型训练:

+ +```python +import torch +from torch.optim import SGD +from torchvision.models import resnet18 + +import colossalai +from colossalai.booster import Booster +from colossalai.booster.plugin import TorchDDPPlugin + +def train(): + colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost') + plugin = TorchDDPPlugin() + booster = Booster(plugin=plugin) + model = resnet18() + criterion = lambda x: x.mean() + optimizer = SGD((model.parameters()), lr=0.001) + scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1) + model, optimizer, criterion, _, scheduler = booster.boost(model, optimizer, criterion, lr_scheduler=scheduler) + + x = torch.randn(4, 3, 224, 224) + x = x.to('cuda') + output = model(x) + loss = criterion(output) + booster.backward(loss, optimizer) + optimizer.clip_grad_by_norm(1.0) + optimizer.step() + scheduler.step() + + save_path = "./model" + booster.save_model(model, save_path, True, True, "", 10, use_safetensors=use_safetensors) + + new_model = resnet18() + booster.load_model(new_model, save_path) +``` + +如果您想运行一个可执行的例子, [请点击](../../../../examples/tutorial/new_api/cifar_resnet/README.md) + +[更多的设计细节请参考](https://github.com/hpcaitech/ColossalAI/discussions/3046) From 2a2e889a6f169f5bc73e0c6bd4a34f8e7818186d Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Tue, 9 May 2023 18:05:42 +0800 Subject: [PATCH 02/30] [booster] update booster tutorials#3717, fix --- docs/source/zh-Hans/features/1D_tensor_parallel.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/zh-Hans/features/1D_tensor_parallel.md b/docs/source/zh-Hans/features/1D_tensor_parallel.md index 2ddc27c7b50f..74954dac8f48 100644 --- a/docs/source/zh-Hans/features/1D_tensor_parallel.md +++ b/docs/source/zh-Hans/features/1D_tensor_parallel.md @@ -23,7 +23,7 @@ ```math \left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right] ``` -这就是所谓的行并行方式.
+这就是所谓的行并行方式. 为了计算 ```math From 9362e150e2f529497dfb4814c54cb33dcd7a33c8 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Thu, 11 May 2023 14:46:43 +0800 Subject: [PATCH 03/30] [booster] update booster tutorials#3717, update setup doc --- docs/source/en/get_started/installation.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/en/get_started/installation.md b/docs/source/en/get_started/installation.md index 290879219074..93f9d074ead4 100644 --- a/docs/source/en/get_started/installation.md +++ b/docs/source/en/get_started/installation.md @@ -39,13 +39,13 @@ cd ColossalAI pip install -r requirements/requirements.txt # install colossalai -pip install . +CUDA_EXT=1 pip install . ``` -If you don't want to install and enable CUDA kernel fusion (compulsory installation when using fused optimizer): +If you don't want to install and enable CUDA kernel fusion (compulsory installation when using fused optimizer), just don't specify the `CUDA_EXT`: ```shell -CUDA_EXT=1 pip install . +pip install . ``` From 52d7e930fed0223a0ca438ee5a5c7ccc63ff4d81 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Thu, 11 May 2023 14:47:15 +0800 Subject: [PATCH 04/30] [booster] update booster tutorials#3717, update setup doc --- docs/source/en/basics/launch_colossalai.md | 14 +++++++++++--- docs/source/zh-Hans/basics/launch_colossalai.md | 11 ++++++++++- docs/source/zh-Hans/get_started/installation.md | 6 +++--- 3 files changed, 24 insertions(+), 7 deletions(-) diff --git a/docs/source/en/basics/launch_colossalai.md b/docs/source/en/basics/launch_colossalai.md index be487f8539a5..334757ea75af 100644 --- a/docs/source/en/basics/launch_colossalai.md +++ b/docs/source/en/basics/launch_colossalai.md @@ -87,14 +87,13 @@ import colossalai args = colossalai.get_default_parser().parse_args() # launch distributed environment -colossalai.launch(config=, +colossalai.launch(config=args.config, rank=args.rank, world_size=args.world_size, host=args.host, port=args.port, backend=args.backend ) - ``` @@ -107,12 +106,21 @@ First, we need to set the launch method in our code. As this is a wrapper of the use `colossalai.launch_from_torch`. The arguments required for distributed environment such as rank, world size, host and port are all set by the PyTorch launcher and can be read from the environment variable directly. +config.py +```python +BATCH_SIZE = 512 +LEARNING_RATE = 3e-3 +WEIGHT_DECAY = 0.3 +NUM_EPOCHS = 2 +``` +train.py ```python import colossalai colossalai.launch_from_torch( - config=, + config="./config.py", ) +... ``` Next, we can easily start multiple processes with `colossalai run` in your terminal. Below is an example to run the code diff --git a/docs/source/zh-Hans/basics/launch_colossalai.md b/docs/source/zh-Hans/basics/launch_colossalai.md index ca927de578d5..54fe7221dc7a 100644 --- a/docs/source/zh-Hans/basics/launch_colossalai.md +++ b/docs/source/zh-Hans/basics/launch_colossalai.md @@ -93,12 +93,21 @@ PyTorch自带的启动器需要在每个节点上都启动命令才能启动多 首先,我们需要在代码里指定我们的启动方式。由于这个启动器是PyTorch启动器的封装,那么我们自然而然应该使用`colossalai.launch_from_torch`。 分布式环境所需的参数,如 rank, world size, host 和 port 都是由 PyTorch 启动器设置的,可以直接从环境变量中读取。 +config.py +```python +BATCH_SIZE = 512 +LEARNING_RATE = 3e-3 +WEIGHT_DECAY = 0.3 +NUM_EPOCHS = 2 +``` +train.py ```python import colossalai colossalai.launch_from_torch( - config=, + config="./config.py", ) +... ``` 接下来,我们可以轻松地在终端使用`colossalai run`来启动训练。下面的命令可以在当前机器上启动一个4卡的训练任务。 diff --git a/docs/source/zh-Hans/get_started/installation.md b/docs/source/zh-Hans/get_started/installation.md index 72f85393814f..8858ae0fa262 100755 --- a/docs/source/zh-Hans/get_started/installation.md +++ b/docs/source/zh-Hans/get_started/installation.md @@ -38,13 +38,13 @@ cd ColossalAI pip install -r requirements/requirements.txt # install colossalai -pip install . +CUDA_EXT=1 pip install . ``` -如果您不想安装和启用 CUDA 内核融合(使用融合优化器时强制安装): +如果您不想安装和启用 CUDA 内核融合(使用融合优化器时强制安装)您可以不添加`CUDA_EXT=1`: ```shell -NO_CUDA_EXT=1 pip install . +pip install . ``` From 111315dae633bab98c06eca26ecc9202472c7614 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 13:37:31 +0800 Subject: [PATCH 05/30] [booster] update booster tutorials#3717, update setup doc --- docs/source/en/basics/booster_api.md | 87 ++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 docs/source/en/basics/booster_api.md diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md new file mode 100644 index 000000000000..fa05eb44c812 --- /dev/null +++ b/docs/source/en/basics/booster_api.md @@ -0,0 +1,87 @@ +# Booster API + +**Prerequisite:** +- [Distributed Training](../concepts/distributed_training.md) +- [Colossal-AI Overview](../concepts/colossalai_overview.md) + +## Introduction +In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, user can integrate their model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. + +### Plugin +Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows: + +***GeminiPlugin:*** This plugin wrapps the Gemini acceleration solution, that ZeRO with chunk-based memory management. + +***TorchDDPPlugin:*** This plugin wrapps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines. + +***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs. + +### API of booster + + +{{ autodoc:colossalai.booster.Booster.__init__ }} + +{{ autodoc:colossalai.booster.Booster.boost }} + +{{ autodoc:colossalai.booster.Booster.backward }} + +{{ autodoc:colossalai.booster.Booster.no_sync }} + +{{ autodoc:colossalai.booster.Booster.save_model }} + +{{ autodoc:colossalai.booster.Booster.load_model }} + +{{ autodoc:colossalai.booster.Booster.save_optimizer }} + +{{ autodoc:colossalai.booster.Booster.load_optimizer }} + +{{ autodoc:colossalai.booster.Booster.save_lr_scheduler }} + +{{ autodoc:colossalai.booster.Booster.load_lr_scheduler }} + +## usage +In a typical workflow, you need to launch distributed environment at the beginning of training script and create objects needed (such as models, optimizers, loss function, data loaders etc.) firstly, then call `colossalai.booster` to inject features into these objects, After that, you can use our booster API and these returned objects to continue the rest of your training processes. + +A pseudo-code example is like below: + +```python +import torch +from torch.optim import SGD +from torchvision.models import resnet18 + +import colossalai +from colossalai.booster import Booster +from colossalai.booster.plugin import TorchDDPPlugin + +def train(): + colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost') + plugin = TorchDDPPlugin() + booster = Booster(plugin=plugin) + model = resnet18() + criterion = lambda x: x.mean() + optimizer = SGD((model.parameters()), lr=0.001) + scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1) + model, optimizer, criterion, _, scheduler = booster.boost(model, optimizer, criterion, lr_scheduler=scheduler) + + x = torch.randn(4, 3, 224, 224) + x = x.to('cuda') + output = model(x) + loss = criterion(output) + booster.backward(loss, optimizer) + optimizer.clip_grad_by_norm(1.0) + optimizer.step() + scheduler.step() + + save_path = "./model" + booster.save_model(model, save_path, True, True, "", 10, use_safetensors=use_safetensors) + + new_model = resnet18() + booster.load_model(new_model, save_path) +``` + +if you want to run a example, [click here](../../../../examples/tutorial/new_api/cifar_resnet/README.md) + +[more design detailers](https://github.com/hpcaitech/ColossalAI/discussions/3046) + + + From 24987bd1cbfe2c2806c8d4ecd024225e6c1b9375 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 13:37:51 +0800 Subject: [PATCH 06/30] [booster] update booster tutorials#3717, update setup doc --- docs/source/zh-Hans/basics/booster_api.md | 125 ++++++++++++++++++++++ 1 file changed, 125 insertions(+) create mode 100644 docs/source/zh-Hans/basics/booster_api.md diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md new file mode 100644 index 000000000000..703fb484e3be --- /dev/null +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -0,0 +1,125 @@ +# booster 使用 + +**预备知识:** +- [分布式训练](../concepts/distributed_training.md) +- [Colossal-AI 总览](../concepts/colossalai_overview.md) + +## 简介 +在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好的将我们的并行策略整合到模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 +在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用中我们要注意的细节。 + +### Plugin +

Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下:

+ +***GeminiPlugin:***

GeminiPlugin插件封装了 gemini 加速解决方案,即具有基于块的内存管理的 ZeRO优化方案。

+ +***TorchDDPPlugin:***

TorchDDPPlugin插件封装了DDP加速方案,实现了模块级别的数据并行,可以跨多机运行。

+ +***LowLevelZeroPlugin:***

LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:跨数据并行工作器/GPU 的分片优化器状态。阶段 2:分片优化器状态 + 跨数据并行工作者/GPU 的梯度。

+ +### API of booster +Booster.__init__(...): +* 参数: + * device (str or torch.device): 行训练的设备。默认值:'cuda'。 + * mixed_precision (str or MixedPrecision): 运行训练的混合精度。默认值:None。如果参数是字符串,则它可以是“fp16”、“fp16_apex”、“bf16”或“fp8”。“fp16”将使用 PyTorch AMP,而“fp16_apex”将使用 Nvidia Apex。 + * plugin (Plugin): 运行训练的插件。默认值:None。 + * booster (Booster) + + +booster.boost(...): 调用此函数来注入特性到对象中。 (例如模型、优化器、标准) +* 参数: + * model (nn.Module): 被注入的模型对象。 + * optimizer (Optimizer): 被注入的优化器对象。 + * criterion (Callable): 被注入的criterion对象。 + * dataloader (DataLoader): 被注入的dataloader对象. + * lr_scheduler (LRScheduler): 被注入的lr_scheduler对象. +* 返回值: + * model, optimizer, criterion, dataloader, lr_scheduler + +booster.backward(loss, optimizer): 调用该函数执行反向传播操作。 +* 参数: + * loss (torch.Tensor) + * optimizer (Optimizer) + +booster.no_sync(model) :返回一个上下文管理器,用于禁用跨进程的梯度同步。 + +booster.save_model(...): 调用此函数以保存模型。 +* 参数: + * model: nn.Module, + * checkpoint: str, + * prefix: str = None, + * shard: bool = False, # if saved as shards + * size_per_shard: int = 1024 # the max length of shard + +booster.load_model(...): 调用该函数加载模型。 +* 参数: + * model: nn.Module, + * checkpoint: str, + * strict: bool = True + +booster.save_optimizer(...): 调用此函数以保存优化器。 +* 参数: + * optimizer: Optimizer, + * checkpoint: str, + * shard: bool = False, # if saved as shards + * size_per_shard: int = 1024 # the max length of shard + +booster.load_optimizer(...): 调用此函数以加载优化器。 +* 参数: + * optimizer: Optimizer, + * checkpoint: str, + +booster.save_lr_scheduler(...): 调用此函数以保存学习率更新器。 +* 参数: + * lr_scheduler: LRScheduler, + * checkpoint: str, + +booster.load_lr_scheduler(...): 调用此函数以加载学习率更新器。 +* 参数: + * lr_scheduler: LRScheduler, + * checkpoint: str, + +## usage + +在使用colossalai训练时,首先需要在训练脚本的开头启动分布式环境,并创建需要使用的模型、优化器、损失函数、数据加载器等对象等。之后,调用`colossalai.booster` 将特征注入到这些对象中,您就可以使用我们的booster API去进行您接下来的训练流程。 + +

以下是一个伪代码示例,将展示如何使用我们的booster API进行模型训练:

+ +```python +import torch +from torch.optim import SGD +from torchvision.models import resnet18 + +import colossalai +from colossalai.booster import Booster +from colossalai.booster.plugin import TorchDDPPlugin + +def train(): + colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost') + plugin = TorchDDPPlugin() + booster = Booster(plugin=plugin) + model = resnet18() + criterion = lambda x: x.mean() + optimizer = SGD((model.parameters()), lr=0.001) + scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1) + model, optimizer, criterion, _, scheduler = booster.boost(model, optimizer, criterion, lr_scheduler=scheduler) + + x = torch.randn(4, 3, 224, 224) + x = x.to('cuda') + output = model(x) + loss = criterion(output) + booster.backward(loss, optimizer) + optimizer.clip_grad_by_norm(1.0) + optimizer.step() + scheduler.step() + + save_path = "./model" + booster.save_model(model, save_path, True, True, "", 10, use_safetensors=use_safetensors) + + new_model = resnet18() + booster.load_model(new_model, save_path) +``` + +如果您想运行一个可执行的例子, [请点击](../../../../examples/tutorial/new_api/cifar_resnet/README.md) + +[更多的设计细节请参考](https://github.com/hpcaitech/ColossalAI/discussions/3046) From c3d44adfdf84936ca6f4b82fc0846891eba222d5 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 13:38:58 +0800 Subject: [PATCH 07/30] [booster] update booster tutorials#3717, update setup doc --- docs/source/zh-Hans/basics/booster_api.md | 92 +++++++---------------- 1 file changed, 27 insertions(+), 65 deletions(-) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 703fb484e3be..47903426f679 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -9,81 +9,41 @@ 在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用中我们要注意的细节。 ### Plugin -

Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下:

+Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下: -***GeminiPlugin:***

GeminiPlugin插件封装了 gemini 加速解决方案,即具有基于块的内存管理的 ZeRO优化方案。

+***GeminiPlugin:*** GeminiPlugin插件封装了 gemini 加速解决方案,即具有基于块的内存管理的 ZeRO优化方案。 -***TorchDDPPlugin:***

TorchDDPPlugin插件封装了DDP加速方案,实现了模块级别的数据并行,可以跨多机运行。

+***TorchDDPPlugin:*** TorchDDPPlugin插件封装了DDP加速方案,实现了模块级别的数据并行,可以跨多机运行。 -***LowLevelZeroPlugin:***

LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:跨数据并行工作器/GPU 的分片优化器状态。阶段 2:分片优化器状态 + 跨数据并行工作者/GPU 的梯度。

+***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:跨数据并行工作器/GPU 的分片优化器状态。阶段 2:分片优化器状态 + 跨数据并行工作者/GPU 的梯度。 ### API of booster -Booster.__init__(...): -* 参数: - * device (str or torch.device): 行训练的设备。默认值:'cuda'。 - * mixed_precision (str or MixedPrecision): 运行训练的混合精度。默认值:None。如果参数是字符串,则它可以是“fp16”、“fp16_apex”、“bf16”或“fp8”。“fp16”将使用 PyTorch AMP,而“fp16_apex”将使用 Nvidia Apex。 - * plugin (Plugin): 运行训练的插件。默认值:None。 - * booster (Booster) - - -booster.boost(...): 调用此函数来注入特性到对象中。 (例如模型、优化器、标准) -* 参数: - * model (nn.Module): 被注入的模型对象。 - * optimizer (Optimizer): 被注入的优化器对象。 - * criterion (Callable): 被注入的criterion对象。 - * dataloader (DataLoader): 被注入的dataloader对象. - * lr_scheduler (LRScheduler): 被注入的lr_scheduler对象. -* 返回值: - * model, optimizer, criterion, dataloader, lr_scheduler - -booster.backward(loss, optimizer): 调用该函数执行反向传播操作。 -* 参数: - * loss (torch.Tensor) - * optimizer (Optimizer) - -booster.no_sync(model) :返回一个上下文管理器,用于禁用跨进程的梯度同步。 - -booster.save_model(...): 调用此函数以保存模型。 -* 参数: - * model: nn.Module, - * checkpoint: str, - * prefix: str = None, - * shard: bool = False, # if saved as shards - * size_per_shard: int = 1024 # the max length of shard - -booster.load_model(...): 调用该函数加载模型。 -* 参数: - * model: nn.Module, - * checkpoint: str, - * strict: bool = True - -booster.save_optimizer(...): 调用此函数以保存优化器。 -* 参数: - * optimizer: Optimizer, - * checkpoint: str, - * shard: bool = False, # if saved as shards - * size_per_shard: int = 1024 # the max length of shard - -booster.load_optimizer(...): 调用此函数以加载优化器。 -* 参数: - * optimizer: Optimizer, - * checkpoint: str, - -booster.save_lr_scheduler(...): 调用此函数以保存学习率更新器。 -* 参数: - * lr_scheduler: LRScheduler, - * checkpoint: str, - -booster.load_lr_scheduler(...): 调用此函数以加载学习率更新器。 -* 参数: - * lr_scheduler: LRScheduler, - * checkpoint: str, + +{{ autodoc:colossalai.booster.Booster.__init__ }} + +{{ autodoc:colossalai.booster.Booster.boost }} + +{{ autodoc:colossalai.booster.Booster.backward }} + +{{ autodoc:colossalai.booster.Booster.no_sync }} + +{{ autodoc:colossalai.booster.Booster.save_model }} + +{{ autodoc:colossalai.booster.Booster.load_model }} + +{{ autodoc:colossalai.booster.Booster.save_optimizer }} + +{{ autodoc:colossalai.booster.Booster.load_optimizer }} + +{{ autodoc:colossalai.booster.Booster.save_lr_scheduler }} + +{{ autodoc:colossalai.booster.Booster.load_lr_scheduler }} ## usage 在使用colossalai训练时,首先需要在训练脚本的开头启动分布式环境,并创建需要使用的模型、优化器、损失函数、数据加载器等对象等。之后,调用`colossalai.booster` 将特征注入到这些对象中,您就可以使用我们的booster API去进行您接下来的训练流程。 -

以下是一个伪代码示例,将展示如何使用我们的booster API进行模型训练:

+以下是一个伪代码示例,将展示如何使用我们的booster API进行模型训练: ```python import torch @@ -123,3 +83,5 @@ def train(): 如果您想运行一个可执行的例子, [请点击](../../../../examples/tutorial/new_api/cifar_resnet/README.md) [更多的设计细节请参考](https://github.com/hpcaitech/ColossalAI/discussions/3046) + + From e8d7b9468006b1ebacd1593c64fcf47351587f2f Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 13:40:12 +0800 Subject: [PATCH 08/30] [booster] update booster tutorials#3717, update setup doc --- docs/sidebars.json | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/sidebars.json b/docs/sidebars.json index 44287c17eadf..2732704a5cab 100644 --- a/docs/sidebars.json +++ b/docs/sidebars.json @@ -32,7 +32,8 @@ "basics/engine_trainer", "basics/configure_parallelization", "basics/model_checkpoint", - "basics/colotensor_concept" + "basics/colotensor_concept", + "basics/booster_api" ] }, { From 68e84be98342ec03c9c5d74318b512604e805489 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 13:43:53 +0800 Subject: [PATCH 09/30] [booster] update booster tutorials#3717, rename colossalai booster.md --- docs/source/en/basics/colossalai_booster.md | 124 ----------------- .../zh-Hans/basics/colossalai_booster.md | 125 ------------------ 2 files changed, 249 deletions(-) delete mode 100644 docs/source/en/basics/colossalai_booster.md delete mode 100644 docs/source/zh-Hans/basics/colossalai_booster.md diff --git a/docs/source/en/basics/colossalai_booster.md b/docs/source/en/basics/colossalai_booster.md deleted file mode 100644 index fc33e8cbe039..000000000000 --- a/docs/source/en/basics/colossalai_booster.md +++ /dev/null @@ -1,124 +0,0 @@ -# colossal-ai booster - -**Prerequisite:** -- [Distributed Training](../concepts/distributed_training.md) -- [Colossal-AI Overview](../concepts/colossalai_overview.md) - -## Introduction -In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, user can integrate their model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. - -### Plugin -

Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows:

- -***GeminiPlugin:***

This plugin wrapps the Gemini acceleration solution, that ZeRO with chunk-based memory management.

- -***TorchDDPPlugin:***

This plugin wrapps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines.

- -***LowLevelZeroPlugin:***

This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs.

- -### API of booster -Booster.__init__(...): -* Args: - * device (str or torch.device): The device to run the training. Default: 'cuda'. - * mixed_precision (str or MixedPrecision): The mixed precision to run the training. Default: None.If the argument is a string, it can be 'fp16', 'fp16_apex', 'bf16', or 'fp8'.'fp16' would use PyTorch AMP while 'fp16_apex' would use Nvidia Apex. - * plugin (Plugin): The plugin to run the training. Default: None. -* Return: - * booster (Booster) - - -booster.boost(...): This function is called to boost objects. (e.g. model, optimizer, criterion). -* Args: - * model (nn.Module): The model to be boosted. - * optimizer (Optimizer): The optimizer to be boosted. - * criterion (Callable): The criterion to be boosted. - * dataloader (DataLoader): The dataloader to be boosted. - * lr_scheduler (LRScheduler): The lr_scheduler to be boosted. -* Return: - * model, optimizer, criterion, dataloader, lr_scheduler - -booster.backward(loss, optimizer): This function run the backward operation -* Args: - * loss (torch.Tensor) - * optimizer (Optimizer) - -booster.no_sync(model) :A context manager to disable gradient synchronizations across processes. - -booster.save_model(...): This function is called to save model checkpoints -* Args: - * model: nn.Module, - * checkpoint: str, - * prefix: str = None, - * shard: bool = False, # if saved as shards - * size_per_shard: int = 1024 # the max length of shard - -booster.load_model(...): -* Args: - * model: nn.Module, - * checkpoint: str, - * strict: bool = True - -booster.save_optimizer(...): This function is called to save optimizer checkpoints -* Args: - * optimizer: Optimizer, - * checkpoint: str, - * shard: bool = False, # if saved as shards - * size_per_shard: int = 1024 # the max length of shard - -booster.load_optimizer(...): -* Args: - * optimizer: Optimizer, - * checkpoint: str, - -booster.save_lr_scheduler(...): This function is called to save lr scheduler checkpoints -* Args: - * lr_scheduler: LRScheduler, - * checkpoint: str, - -booster.load_lr_scheduler(...): -* Args: - * lr_scheduler: LRScheduler, - * checkpoint: str, - -## usage -In a typical workflow, you need to launch distributed environment at the beginning of training script and create objects needed (such as models, optimizers, loss function, data loaders etc.) firstly, then call `colossalai.booster` to inject features into these objects, After that, you can use our booster API and these returned objects to continue the rest of your training processes. - -

A pseudo-code example is like below:

- -```python -import torch -from torch.optim import SGD -from torchvision.models import resnet18 - -import colossalai -from colossalai.booster import Booster -from colossalai.booster.plugin import TorchDDPPlugin - -def train(): - colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost') - plugin = TorchDDPPlugin() - booster = Booster(plugin=plugin) - model = resnet18() - criterion = lambda x: x.mean() - optimizer = SGD((model.parameters()), lr=0.001) - scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1) - model, optimizer, criterion, _, scheduler = booster.boost(model, optimizer, criterion, lr_scheduler=scheduler) - - x = torch.randn(4, 3, 224, 224) - x = x.to('cuda') - output = model(x) - loss = criterion(output) - booster.backward(loss, optimizer) - optimizer.clip_grad_by_norm(1.0) - optimizer.step() - scheduler.step() - - save_path = "./model" - booster.save_model(model, save_path, True, True, "", 10, use_safetensors=use_safetensors) - - new_model = resnet18() - booster.load_model(new_model, save_path) -``` - -if you want to run a example, [click here](../../../../examples/tutorial/new_api/cifar_resnet/README.md) - -[more design detailers](https://github.com/hpcaitech/ColossalAI/discussions/3046) diff --git a/docs/source/zh-Hans/basics/colossalai_booster.md b/docs/source/zh-Hans/basics/colossalai_booster.md deleted file mode 100644 index 703fb484e3be..000000000000 --- a/docs/source/zh-Hans/basics/colossalai_booster.md +++ /dev/null @@ -1,125 +0,0 @@ -# booster 使用 - -**预备知识:** -- [分布式训练](../concepts/distributed_training.md) -- [Colossal-AI 总览](../concepts/colossalai_overview.md) - -## 简介 -在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好的将我们的并行策略整合到模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 -在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用中我们要注意的细节。 - -### Plugin -

Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下:

- -***GeminiPlugin:***

GeminiPlugin插件封装了 gemini 加速解决方案,即具有基于块的内存管理的 ZeRO优化方案。

- -***TorchDDPPlugin:***

TorchDDPPlugin插件封装了DDP加速方案,实现了模块级别的数据并行,可以跨多机运行。

- -***LowLevelZeroPlugin:***

LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:跨数据并行工作器/GPU 的分片优化器状态。阶段 2:分片优化器状态 + 跨数据并行工作者/GPU 的梯度。

- -### API of booster -Booster.__init__(...): -* 参数: - * device (str or torch.device): 行训练的设备。默认值:'cuda'。 - * mixed_precision (str or MixedPrecision): 运行训练的混合精度。默认值:None。如果参数是字符串,则它可以是“fp16”、“fp16_apex”、“bf16”或“fp8”。“fp16”将使用 PyTorch AMP,而“fp16_apex”将使用 Nvidia Apex。 - * plugin (Plugin): 运行训练的插件。默认值:None。 - * booster (Booster) - - -booster.boost(...): 调用此函数来注入特性到对象中。 (例如模型、优化器、标准) -* 参数: - * model (nn.Module): 被注入的模型对象。 - * optimizer (Optimizer): 被注入的优化器对象。 - * criterion (Callable): 被注入的criterion对象。 - * dataloader (DataLoader): 被注入的dataloader对象. - * lr_scheduler (LRScheduler): 被注入的lr_scheduler对象. -* 返回值: - * model, optimizer, criterion, dataloader, lr_scheduler - -booster.backward(loss, optimizer): 调用该函数执行反向传播操作。 -* 参数: - * loss (torch.Tensor) - * optimizer (Optimizer) - -booster.no_sync(model) :返回一个上下文管理器,用于禁用跨进程的梯度同步。 - -booster.save_model(...): 调用此函数以保存模型。 -* 参数: - * model: nn.Module, - * checkpoint: str, - * prefix: str = None, - * shard: bool = False, # if saved as shards - * size_per_shard: int = 1024 # the max length of shard - -booster.load_model(...): 调用该函数加载模型。 -* 参数: - * model: nn.Module, - * checkpoint: str, - * strict: bool = True - -booster.save_optimizer(...): 调用此函数以保存优化器。 -* 参数: - * optimizer: Optimizer, - * checkpoint: str, - * shard: bool = False, # if saved as shards - * size_per_shard: int = 1024 # the max length of shard - -booster.load_optimizer(...): 调用此函数以加载优化器。 -* 参数: - * optimizer: Optimizer, - * checkpoint: str, - -booster.save_lr_scheduler(...): 调用此函数以保存学习率更新器。 -* 参数: - * lr_scheduler: LRScheduler, - * checkpoint: str, - -booster.load_lr_scheduler(...): 调用此函数以加载学习率更新器。 -* 参数: - * lr_scheduler: LRScheduler, - * checkpoint: str, - -## usage - -在使用colossalai训练时,首先需要在训练脚本的开头启动分布式环境,并创建需要使用的模型、优化器、损失函数、数据加载器等对象等。之后,调用`colossalai.booster` 将特征注入到这些对象中,您就可以使用我们的booster API去进行您接下来的训练流程。 - -

以下是一个伪代码示例,将展示如何使用我们的booster API进行模型训练:

- -```python -import torch -from torch.optim import SGD -from torchvision.models import resnet18 - -import colossalai -from colossalai.booster import Booster -from colossalai.booster.plugin import TorchDDPPlugin - -def train(): - colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost') - plugin = TorchDDPPlugin() - booster = Booster(plugin=plugin) - model = resnet18() - criterion = lambda x: x.mean() - optimizer = SGD((model.parameters()), lr=0.001) - scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1) - model, optimizer, criterion, _, scheduler = booster.boost(model, optimizer, criterion, lr_scheduler=scheduler) - - x = torch.randn(4, 3, 224, 224) - x = x.to('cuda') - output = model(x) - loss = criterion(output) - booster.backward(loss, optimizer) - optimizer.clip_grad_by_norm(1.0) - optimizer.step() - scheduler.step() - - save_path = "./model" - booster.save_model(model, save_path, True, True, "", 10, use_safetensors=use_safetensors) - - new_model = resnet18() - booster.load_model(new_model, save_path) -``` - -如果您想运行一个可执行的例子, [请点击](../../../../examples/tutorial/new_api/cifar_resnet/README.md) - -[更多的设计细节请参考](https://github.com/hpcaitech/ColossalAI/discussions/3046) From 6052a5d1cd28cf61541592383ef4e82cf5b739a2 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 13:45:37 +0800 Subject: [PATCH 10/30] [booster] update booster tutorials#3717, rename colossalai booster.md --- docs/source/zh-Hans/basics/launch_colossalai.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/zh-Hans/basics/launch_colossalai.md b/docs/source/zh-Hans/basics/launch_colossalai.md index 54fe7221dc7a..39b09deae085 100644 --- a/docs/source/zh-Hans/basics/launch_colossalai.md +++ b/docs/source/zh-Hans/basics/launch_colossalai.md @@ -74,7 +74,7 @@ import colossalai args = colossalai.get_default_parser().parse_args() # launch distributed environment -colossalai.launch(config=, +colossalai.launch(config=args.config, rank=args.rank, world_size=args.world_size, host=args.host, From 21d3af1cb7f9596698470d131ce8d38ae22c3d14 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 13:49:06 +0800 Subject: [PATCH 11/30] [booster] update booster tutorials#3717, rename colossalai booster.md --- docs/source/en/basics/booster_api.md | 2 +- docs/source/zh-Hans/basics/booster_api.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index fa05eb44c812..ea57d197470b 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -39,7 +39,7 @@ Plugin is an important component that manages parallel configuration (eg: The ge {{ autodoc:colossalai.booster.Booster.load_lr_scheduler }} -## usage +## Usage In a typical workflow, you need to launch distributed environment at the beginning of training script and create objects needed (such as models, optimizers, loss function, data loaders etc.) firstly, then call `colossalai.booster` to inject features into these objects, After that, you can use our booster API and these returned objects to continue the rest of your training processes. A pseudo-code example is like below: diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 47903426f679..82c06155b68d 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -17,7 +17,7 @@ Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加 ***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:跨数据并行工作器/GPU 的分片优化器状态。阶段 2:分片优化器状态 + 跨数据并行工作者/GPU 的梯度。 -### API of booster +### Booster接口 {{ autodoc:colossalai.booster.Booster.__init__ }} @@ -39,7 +39,7 @@ Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加 {{ autodoc:colossalai.booster.Booster.load_lr_scheduler }} -## usage +## 使用方法及示例 在使用colossalai训练时,首先需要在训练脚本的开头启动分布式环境,并创建需要使用的模型、优化器、损失函数、数据加载器等对象等。之后,调用`colossalai.booster` 将特征注入到这些对象中,您就可以使用我们的booster API去进行您接下来的训练流程。 From 98709913779dca54b5f10fca87302e1819780529 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 13:52:34 +0800 Subject: [PATCH 12/30] [booster] update booster tutorials#3717, fix --- docs/source/zh-Hans/get_started/installation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/zh-Hans/get_started/installation.md b/docs/source/zh-Hans/get_started/installation.md index 8858ae0fa262..bcf473c3c1bf 100755 --- a/docs/source/zh-Hans/get_started/installation.md +++ b/docs/source/zh-Hans/get_started/installation.md @@ -41,7 +41,7 @@ pip install -r requirements/requirements.txt CUDA_EXT=1 pip install . ``` -如果您不想安装和启用 CUDA 内核融合(使用融合优化器时强制安装)您可以不添加`CUDA_EXT=1`: +如果您不想安装和启用 CUDA 内核融合(使用融合优化器时强制安装),您可以不添加`CUDA_EXT=1`: ```shell pip install . From 6692c110c5d4ef7162142a45acb005c97bb5b771 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 13:55:06 +0800 Subject: [PATCH 13/30] [booster] update booster tutorials#3717, fix --- docs/source/en/get_started/installation.md | 2 +- docs/source/zh-Hans/get_started/installation.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/get_started/installation.md b/docs/source/en/get_started/installation.md index 93f9d074ead4..b626edb19e8e 100644 --- a/docs/source/en/get_started/installation.md +++ b/docs/source/en/get_started/installation.md @@ -29,7 +29,7 @@ CUDA_EXT=1 pip install colossalai ## Download From Source -> The version of Colossal-AI will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :) +> The version of Colossal-AI will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. ```shell git clone https://github.com/hpcaitech/ColossalAI.git diff --git a/docs/source/zh-Hans/get_started/installation.md b/docs/source/zh-Hans/get_started/installation.md index bcf473c3c1bf..e0d726c74f64 100755 --- a/docs/source/zh-Hans/get_started/installation.md +++ b/docs/source/zh-Hans/get_started/installation.md @@ -28,7 +28,7 @@ CUDA_EXT=1 pip install colossalai ## 从源安装 -> 此文档将与版本库的主分支保持一致。如果您遇到任何问题,欢迎给我们提 issue :) +> 此文档将与版本库的主分支保持一致。如果您遇到任何问题,欢迎给我们提 issue。 ```shell git clone https://github.com/hpcaitech/ColossalAI.git From 9cc14e30b4466e418e9e6450e9ec09910ecda020 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 14:21:20 +0800 Subject: [PATCH 14/30] [booster] update tutorials#3717, update booster api doc --- docs/source/en/basics/booster_api.md | 6 ++++-- docs/source/zh-Hans/basics/booster_api.md | 5 ++++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index ea57d197470b..4c1ee2bab058 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -1,9 +1,13 @@ # Booster API +author: Mingyan Jiang **Prerequisite:** - [Distributed Training](../concepts/distributed_training.md) - [Colossal-AI Overview](../concepts/colossalai_overview.md) +**Example Code** +- [Train with Booster](../../../../examples/tutorial/new_api/cifar_resnet/README.md) + ## Introduction In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, user can integrate their model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. @@ -79,8 +83,6 @@ def train(): booster.load_model(new_model, save_path) ``` -if you want to run a example, [click here](../../../../examples/tutorial/new_api/cifar_resnet/README.md) - [more design detailers](https://github.com/hpcaitech/ColossalAI/discussions/3046) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 82c06155b68d..53ea1db310e5 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -1,9 +1,12 @@ # booster 使用 - +作者: Mingyan Jiang **预备知识:** - [分布式训练](../concepts/distributed_training.md) - [Colossal-AI 总览](../concepts/colossalai_overview.md) +**示例代码** +- [使用booster训练](../../../../examples/tutorial/new_api/cifar_resnet/README.md) + ## 简介 在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好的将我们的并行策略整合到模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用中我们要注意的细节。 From 602c3aeb8616c43b4fae3af83fb9f75c8c883de5 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 14:31:17 +0800 Subject: [PATCH 15/30] [booster] update tutorials#3717, modify file --- docs/source/en/basics/booster_api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 4c1ee2bab058..6f08686756c2 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -9,7 +9,7 @@ author: Mingyan Jiang - [Train with Booster](../../../../examples/tutorial/new_api/cifar_resnet/README.md) ## Introduction -In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, user can integrate their model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. +In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate their model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. ### Plugin Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows: From f997d87792eea28926beea8002a4d769207ed3fc Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 14:32:03 +0800 Subject: [PATCH 16/30] [booster] update tutorials#3717, modify file --- docs/source/en/basics/booster_api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 6f08686756c2..85fbd041deca 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -9,7 +9,7 @@ author: Mingyan Jiang - [Train with Booster](../../../../examples/tutorial/new_api/cifar_resnet/README.md) ## Introduction -In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate their model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. +In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate your model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. ### Plugin Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows: From 607222438dff73795836cc9cfda7df43b3cbc493 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 14:33:38 +0800 Subject: [PATCH 17/30] [booster] update tutorials#3717, modify file --- docs/source/zh-Hans/basics/booster_api.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 53ea1db310e5..ea86168f4214 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -83,8 +83,6 @@ def train(): booster.load_model(new_model, save_path) ``` -如果您想运行一个可执行的例子, [请点击](../../../../examples/tutorial/new_api/cifar_resnet/README.md) - [更多的设计细节请参考](https://github.com/hpcaitech/ColossalAI/discussions/3046) From 138d29242a78bf975e7053808783e15ce36f462b Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 14:38:55 +0800 Subject: [PATCH 18/30] [booster] update tutorials#3717, modify file --- docs/source/en/basics/booster_api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 85fbd041deca..3df8a6ad16d1 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -44,7 +44,7 @@ Plugin is an important component that manages parallel configuration (eg: The ge {{ autodoc:colossalai.booster.Booster.load_lr_scheduler }} ## Usage -In a typical workflow, you need to launch distributed environment at the beginning of training script and create objects needed (such as models, optimizers, loss function, data loaders etc.) firstly, then call `colossalai.booster` to inject features into these objects, After that, you can use our booster API and these returned objects to continue the rest of your training processes. +In a typical workflow, you should launch distributed environment at the beginning of training script and create objects needed (such as models, optimizers, loss function, data loaders etc.) firstly, then call `colossalai.booster` to inject features into these objects, After that, you can use our booster APIs and these returned objects to continue the rest of your training processes. A pseudo-code example is like below: From 5a2ef21fc0713e678f79c8ae8945ca1e13aa2ed1 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 14:47:06 +0800 Subject: [PATCH 19/30] [booster] update tutorials#3717, modify file --- docs/source/zh-Hans/basics/booster_api.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index ea86168f4214..3e7f275188f7 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -8,17 +8,19 @@ - [使用booster训练](../../../../examples/tutorial/new_api/cifar_resnet/README.md) ## 简介 -在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好的将我们的并行策略整合到模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 +在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用中我们要注意的细节。 ### Plugin Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下: -***GeminiPlugin:*** GeminiPlugin插件封装了 gemini 加速解决方案,即具有基于块的内存管理的 ZeRO优化方案。 +***GeminiPlugin:*** GeminiPlugin插件封装了 gemini 加速解决方案,即具有基于块内存管理的 ZeRO优化方案。 -***TorchDDPPlugin:*** TorchDDPPlugin插件封装了DDP加速方案,实现了模块级别的数据并行,可以跨多机运行。 +***TorchDDPPlugin:*** TorchDDPPlugin插件封装了DDP加速方案,实现了模型级别的数据并行,可以跨多机运行。 -***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:跨数据并行工作器/GPU 的分片优化器状态。阶段 2:分片优化器状态 + 跨数据并行工作者/GPU 的梯度。 +***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:切分优化器参数,分发到各并发进程或并发GPU上。阶段 2:切分优化器参数及梯度到各并发进程或并发GPU上。 + +***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs. ### Booster接口 @@ -44,7 +46,7 @@ Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加 ## 使用方法及示例 -在使用colossalai训练时,首先需要在训练脚本的开头启动分布式环境,并创建需要使用的模型、优化器、损失函数、数据加载器等对象等。之后,调用`colossalai.booster` 将特征注入到这些对象中,您就可以使用我们的booster API去进行您接下来的训练流程。 +在使用colossalai训练时,首先需要在训练脚本的开头启动分布式环境,并创建需要使用的模型、优化器、损失函数、数据加载器等对象。之后,调用`colossalai.booster` 将特征注入到这些对象中,您就可以使用我们的booster API去进行您接下来的训练流程。 以下是一个伪代码示例,将展示如何使用我们的booster API进行模型训练: From 9c20d0ac7936cff346da0e27f758691a4642e1c4 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 14:49:46 +0800 Subject: [PATCH 20/30] [booster] update tutorials#3717, modify file --- docs/source/zh-Hans/basics/booster_api.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 3e7f275188f7..1d741550356e 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -8,13 +8,13 @@ - [使用booster训练](../../../../examples/tutorial/new_api/cifar_resnet/README.md) ## 简介 -在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 -在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用中我们要注意的细节。 +在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 +在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用时我们要注意的细节。 ### Plugin Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下: -***GeminiPlugin:*** GeminiPlugin插件封装了 gemini 加速解决方案,即具有基于块内存管理的 ZeRO优化方案。 +***GeminiPlugin:*** GeminiPlugin插件封装了 gemini 加速解决方案,即基于块内存管理的 ZeRO优化方案。 ***TorchDDPPlugin:*** TorchDDPPlugin插件封装了DDP加速方案,实现了模型级别的数据并行,可以跨多机运行。 From 08101d05b1e7bf89eacad21fb94c804d3cbb36b2 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 14:50:25 +0800 Subject: [PATCH 21/30] [booster] update tutorials#3717, modify file --- docs/source/zh-Hans/basics/booster_api.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 1d741550356e..38517a676e9f 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -20,8 +20,6 @@ Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加 ***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:切分优化器参数,分发到各并发进程或并发GPU上。阶段 2:切分优化器参数及梯度到各并发进程或并发GPU上。 -***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs. - ### Booster接口 {{ autodoc:colossalai.booster.Booster.__init__ }} From ba4d77a5c3fb138a179ce289573b56f5f09c16c2 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 15:02:25 +0800 Subject: [PATCH 22/30] [booster] update tutorials#3717, fix reference link --- docs/source/en/basics/booster_api.md | 2 +- docs/source/zh-Hans/basics/booster_api.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 3df8a6ad16d1..54df1215eff2 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -6,7 +6,7 @@ author: Mingyan Jiang - [Colossal-AI Overview](../concepts/colossalai_overview.md) **Example Code** -- [Train with Booster](../../../../examples/tutorial/new_api/cifar_resnet/README.md) +- [Train with Booster](ColossalAI/examples/tutorial/new_api/README.md) ## Introduction In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate your model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 38517a676e9f..366c34f85225 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -5,7 +5,7 @@ - [Colossal-AI 总览](../concepts/colossalai_overview.md) **示例代码** -- [使用booster训练](../../../../examples/tutorial/new_api/cifar_resnet/README.md) +- [使用booster训练](ColossalAI/examples/tutorial/new_api/README.md) ## 简介 在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 From 8a4feb1c5e5479414090de3f73dcebbab77f1047 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 15:09:41 +0800 Subject: [PATCH 23/30] [booster] update tutorials#3717, fix reference link --- docs/source/en/basics/booster_api.md | 2 +- docs/source/zh-Hans/basics/booster_api.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 54df1215eff2..3df8a6ad16d1 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -6,7 +6,7 @@ author: Mingyan Jiang - [Colossal-AI Overview](../concepts/colossalai_overview.md) **Example Code** -- [Train with Booster](ColossalAI/examples/tutorial/new_api/README.md) +- [Train with Booster](../../../../examples/tutorial/new_api/cifar_resnet/README.md) ## Introduction In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate your model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 366c34f85225..38517a676e9f 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -5,7 +5,7 @@ - [Colossal-AI 总览](../concepts/colossalai_overview.md) **示例代码** -- [使用booster训练](ColossalAI/examples/tutorial/new_api/README.md) +- [使用booster训练](../../../../examples/tutorial/new_api/cifar_resnet/README.md) ## 简介 在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 From e045350acf889f6c17a86af4a62a784eeff24eac Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 15:16:17 +0800 Subject: [PATCH 24/30] [booster] update tutorials#3717, fix reference link --- docs/source/en/basics/booster_api.md | 2 +- docs/source/zh-Hans/basics/booster_api.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 3df8a6ad16d1..a7a31446a348 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -6,7 +6,7 @@ author: Mingyan Jiang - [Colossal-AI Overview](../concepts/colossalai_overview.md) **Example Code** -- [Train with Booster](../../../../examples/tutorial/new_api/cifar_resnet/README.md) +- [Train with Booster](/examples/tutorial/new_api/cifar_resnet/README.md) ## Introduction In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate your model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 38517a676e9f..ea8d677f9230 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -5,7 +5,7 @@ - [Colossal-AI 总览](../concepts/colossalai_overview.md) **示例代码** -- [使用booster训练](../../../../examples/tutorial/new_api/cifar_resnet/README.md) +- [使用booster训练](/examples/tutorial/new_api/cifar_resnet/README.md) ## 简介 在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 From f4a0bcf06c0fdf7ec71cce163082aa615ddcc080 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 15:25:10 +0800 Subject: [PATCH 25/30] [booster] update tutorials#3717, fix reference link --- docs/source/en/basics/booster_api.md | 2 +- docs/source/zh-Hans/basics/booster_api.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index a7a31446a348..3df8a6ad16d1 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -6,7 +6,7 @@ author: Mingyan Jiang - [Colossal-AI Overview](../concepts/colossalai_overview.md) **Example Code** -- [Train with Booster](/examples/tutorial/new_api/cifar_resnet/README.md) +- [Train with Booster](../../../../examples/tutorial/new_api/cifar_resnet/README.md) ## Introduction In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate your model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index ea8d677f9230..38517a676e9f 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -5,7 +5,7 @@ - [Colossal-AI 总览](../concepts/colossalai_overview.md) **示例代码** -- [使用booster训练](/examples/tutorial/new_api/cifar_resnet/README.md) +- [使用booster训练](../../../../examples/tutorial/new_api/cifar_resnet/README.md) ## 简介 在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 From e9cfb5cd77bfb46a2694eea8a0164441626cd53f Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 15:35:17 +0800 Subject: [PATCH 26/30] [booster] update tutorials#3717, fix reference link --- docs/source/en/basics/booster_api.md | 4 ++-- docs/source/zh-Hans/basics/booster_api.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 3df8a6ad16d1..14dde65e43d6 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -6,7 +6,7 @@ author: Mingyan Jiang - [Colossal-AI Overview](../concepts/colossalai_overview.md) **Example Code** -- [Train with Booster](../../../../examples/tutorial/new_api/cifar_resnet/README.md) +- [Train with Booster](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet/README.md) ## Introduction In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate your model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. @@ -83,7 +83,7 @@ def train(): booster.load_model(new_model, save_path) ``` -[more design detailers](https://github.com/hpcaitech/ColossalAI/discussions/3046) +[more design details](https://github.com/hpcaitech/ColossalAI/discussions/3046) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 38517a676e9f..83d50d90fb17 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -5,7 +5,7 @@ - [Colossal-AI 总览](../concepts/colossalai_overview.md) **示例代码** -- [使用booster训练](../../../../examples/tutorial/new_api/cifar_resnet/README.md) +- [使用booster训练](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet/README.md) ## 简介 在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 From bf51a6cad0eba4d534c2db0b12a24e8f54b96902 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 15:47:18 +0800 Subject: [PATCH 27/30] [booster] update tutorials#3717, fix reference link --- docs/source/en/basics/booster_api.md | 2 +- docs/source/zh-Hans/basics/booster_api.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 14dde65e43d6..872c5021317d 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -23,7 +23,7 @@ Plugin is an important component that manages parallel configuration (eg: The ge ### API of booster -{{ autodoc:colossalai.booster.Booster.__init__ }} +{{ autodoc:colossalai.booster.Booster }} {{ autodoc:colossalai.booster.Booster.boost }} diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 83d50d90fb17..2f4bd07710a2 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -22,7 +22,7 @@ Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加 ### Booster接口 -{{ autodoc:colossalai.booster.Booster.__init__ }} +{{ autodoc:colossalai.booster.Booster }} {{ autodoc:colossalai.booster.Booster.boost }} From 591fa122ee0d3cd9bbfba346c69ba4e1cee0dfd1 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 15:53:50 +0800 Subject: [PATCH 28/30] [booster] update tutorials#3717, fix reference link --- docs/source/en/basics/booster_api.md | 2 +- docs/source/zh-Hans/basics/booster_api.md | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 872c5021317d..18dec4500f76 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -1,5 +1,5 @@ # Booster API -author: Mingyan Jiang +Author: [Mingyan Jiang](https://github.com/jiangmingyan) **Prerequisite:** - [Distributed Training](../concepts/distributed_training.md) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 2f4bd07710a2..5ed1b2c37f39 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -1,5 +1,6 @@ # booster 使用 -作者: Mingyan Jiang +作者: [Mingyan Jiang](https://github.com/jiangmingyan) + **预备知识:** - [分布式训练](../concepts/distributed_training.md) - [Colossal-AI 总览](../concepts/colossalai_overview.md) From 0f5703c2f1c84d32a07411e5c3275ad81b52a2a7 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 16:23:01 +0800 Subject: [PATCH 29/30] [booster] update tutorials#3713 --- docs/source/zh-Hans/basics/booster_api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 5ed1b2c37f39..5f6813d5c239 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -19,7 +19,7 @@ Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加 ***TorchDDPPlugin:*** TorchDDPPlugin插件封装了DDP加速方案,实现了模型级别的数据并行,可以跨多机运行。 -***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:切分优化器参数,分发到各并发进程或并发GPU上。阶段 2:切分优化器参数及梯度到各并发进程或并发GPU上。 +***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:切分优化器参数,分发到各并发进程或并发GPU上。阶段 2:切分优化器参数及梯度,分发到各并发进程或并发GPU上。 ### Booster接口 From 274fc1a5be02c652705cb35f22f0ade11e5a1698 Mon Sep 17 00:00:00 2001 From: Mingyan Jiang <1829166702@qq.com> Date: Wed, 17 May 2023 19:08:15 +0800 Subject: [PATCH 30/30] [booster] update tutorials#3713, modify file --- docs/source/zh-Hans/basics/booster_api.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 5f6813d5c239..5410cc213fd2 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -12,8 +12,8 @@ 在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用时我们要注意的细节。 -### Plugin -Plugin是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下: +### Booster插件 +Booster插件是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下: ***GeminiPlugin:*** GeminiPlugin插件封装了 gemini 加速解决方案,即基于块内存管理的 ZeRO优化方案。