From d814f6d795e1790f7f42cad12619c79caaa88e31 Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Tue, 12 Sep 2023 10:55:01 +0800 Subject: [PATCH 01/10] create shardformer doc files --- docs/source/en/features/shardformer.md | 159 ++++++++++++++++++++ docs/source/zh-Hans/features/shardformer.md | 0 2 files changed, 159 insertions(+) create mode 100644 docs/source/en/features/shardformer.md create mode 100644 docs/source/zh-Hans/features/shardformer.md diff --git a/docs/source/en/features/shardformer.md b/docs/source/en/features/shardformer.md new file mode 100644 index 000000000000..7faf977c60dc --- /dev/null +++ b/docs/source/en/features/shardformer.md @@ -0,0 +1,159 @@ +# Shardformer + +Author: [Baizhou Zhang](https://github.com/Fridge003) + +**Prerequisite** +- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md) + +**Example Code** +- [ColossalAI-Examples ResNet with pipeline](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/pipeline_parallel) + +**Related Paper** +- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) +- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965) +- [Flashattention: Fast and memory-efficient exact attention with io-awareness](https://arxiv.org/abs/2205.14135) +- sequence parallel + +## Quick introduction + +In this tutorial, you will learn how to use pipeline parallel. In Colossal-AI, we use 1F1B pipeline, introduced by Nvidia. In this case, ViT and Imagenet are too large to use. Therefore, here we use ResNet and Cifar as example. + +## Table Of Content + +In this tutorial we will cover: + +1. Introduction of 1F1B pipeline. +2. Usage of non-interleaved and interleaved schedule. +3. Training ResNet with pipeline. + +## Introduction of 1F1B pipeline + +First of all, we will introduce you GPipe for your better understanding. + +
+ +
Figure1: GPipe. This figure is from Megatron-LM paper.
+
+ + +As you can see, for GPipe, only when the forward passes of all microbatches in a batch finish, the backward passes would be executed. + +In general, 1F1B(one forward pass followed by one backward pass) is more efficient than GPipe(in memory or both memory and time). There are two schedules of 1F1B pipeline, the non-interleaved and the interleaved. The figures are shown below. + +
+ +
Figure2: This figure is from Megatron-LM paper. The top part shows the default non-interleaved schedule. And the bottom part shows the interleaved schedule.
+
+ +### Non-interleaved Schedule + +The non-interleaved schedule can be divided into three stages. The first stage is the warm-up stage, where workers perform differing numbers of forward passes. At the following stage, workers perform one forward pass followed by one backward pass. Workers will finish backward passes at the last stage. + +This mode is more memory-efficient than GPipe. However, it would take the same time to finish a turn of passes as GPipe. + +### Interleaved Schedule + +This schedule requires **the number of microbatches to be an integer multiple of the stage of pipeline**. + +In this schedule, each device can perform computation for multiple subsets of layers(called a model chunk) instead of a single contiguous set of layers. i.e. Before device 1 had layer 1-4; device 2 had layer 5-8; and so on. But now device 1 has layer 1,2,9,10; device 2 has layer 3,4,11,12; and so on. With this scheme, each device in the pipeline is assigned multiple pipeline stages and each pipeline stage has less computation. + +This mode is both memory-efficient and time-efficient. + +## Usage of non-interleaved and interleaved schedule + +In Colossal-AI, we provided both non-interleaved(as `PipelineSchedule`) and interleaved schedule(as `InterleavedPipelineSchedule`). + +You just need to set `NUM_MICRO_BATCHES` in config file and set `NUM_CHUNKS` in config file if you want to use Interleaved Pipeline Schedule. If you certainly know the shape of each pipeline stage's output tensor and the shapes are all the same, you can set `TENSOR_SHAPE` in config file to further reduce communication. Otherwise, you can just ignore `tensor_shape`, and the shape will be exchanged over pipeline stages automatically. Then we will generate an appropriate schedule for you. + +## Training ResNet with pipeline + +Let's build the `ResNet` model first with Colossal PipelinableContext: +```python +import os +from typing import Callable, List, Optional, Type, Union +import torch +import torch.nn as nn +import colossalai +import colossalai.nn as col_nn + +from colossalai.core import global_context as gpc +from colossalai.logging import disable_existing_loggers, get_dist_logger +from colossalai.legacy.trainer import Trainer, hooks +from colossalai.utils import MultiTimer, get_dataloader +from colossalai.context import ParallelMode +from colossalai.pipeline.pipelinable import PipelinableContext + +from titans.dataloader.cifar10 import build_cifar +from torchvision.models import resnet50 +from torchvision.models.resnet import BasicBlock, Bottleneck, conv1x1 + +# Define some config +BATCH_SIZE = 64 +NUM_EPOCHS = 2 +NUM_CHUNKS = 1 +CONFIG = dict(NUM_MICRO_BATCHES=4, parallel=dict(pipeline=2)) + +# Train +disable_existing_loggers() +parser = colossalai.get_default_parser() +args = parser.parse_args() +colossalai.launch_from_torch(backend=args.backend, config=CONFIG) +logger = get_dist_logger() +pipelinable = PipelinableContext() + +# build model +with pipelinable: + model = resnet50() +``` + +Define an execution sequence. +```python +exec_seq = [ + 'conv1', 'bn1', 'relu', 'maxpool', 'layer1', 'layer2', 'layer3', 'layer4', 'avgpool', + (lambda x: torch.flatten(x, 1), "behind"), 'fc' +] +pipelinable.to_layer_list(exec_seq) +``` + +Partition the model into pipeline. +```python +model = pipelinable.partition(NUM_CHUNKS, gpc.pipeline_parallel_size, gpc.get_local_rank(ParallelMode.PIPELINE)) +``` + +In this tutorial, we use `Trainer` to train `ResNet`: +```python +# build criterion +criterion = nn.CrossEntropyLoss() + +# optimizer +optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) + +# build dataloader +root = os.environ.get('DATA', './data') +train_dataloader, test_dataloader = build_cifar(BATCH_SIZE, root, padding=4, crop=32, resize=32) + +lr_scheduler = col_nn.lr_scheduler.LinearWarmupLR(optimizer, NUM_EPOCHS, warmup_steps=1) +engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model, optimizer, criterion, + train_dataloader, test_dataloader, + lr_scheduler) +timer = MultiTimer() + +trainer = Trainer(engine=engine, timer=timer, logger=logger) + +hook_list = [ + hooks.LossHook(), + hooks.AccuracyHook(col_nn.metric.Accuracy()), + hooks.LogMetricByEpochHook(logger), + hooks.LRSchedulerHook(lr_scheduler, by_epoch=True) +] + +trainer.fit(train_dataloader=train_dataloader, + epochs=NUM_EPOCHS, + test_dataloader=test_dataloader, + test_interval=1, + hooks=hook_list, + display_progress=True) +``` + +We use `2` pipeline stages and the batch will be split into `4` micro batches. + diff --git a/docs/source/zh-Hans/features/shardformer.md b/docs/source/zh-Hans/features/shardformer.md new file mode 100644 index 000000000000..e69de29bb2d1 From 5428e83ffaef0e4e9095fa2d388d99a891af295e Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Tue, 12 Sep 2023 10:59:28 +0800 Subject: [PATCH 02/10] add docstring for seq-parallel --- colossalai/booster/plugin/hybrid_parallel_plugin.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/colossalai/booster/plugin/hybrid_parallel_plugin.py b/colossalai/booster/plugin/hybrid_parallel_plugin.py index fc04f3ecd8e7..415fc50aa1f1 100644 --- a/colossalai/booster/plugin/hybrid_parallel_plugin.py +++ b/colossalai/booster/plugin/hybrid_parallel_plugin.py @@ -245,7 +245,9 @@ class HybridParallelPlugin(PipelinePluginBase): Defaults to False. enable_fused_normalization (bool, optional): Whether to switch on fused normalization. Defaults to False. enable_flash_attention (bool, optional): Whether to switch on flash attention. Defaults to False. - enable_jit_fused (bool, optional): Whether to switch on JIT. Default to Falase. + enable_jit_fused (bool, optional): Whether to switch on JIT. Default to False. + enable_sequence_parallelism (bool): Whether to turn on sequence parallelism. Defaults to False. + enable_sequence_overlap (bool): Whether to turn on sequence overlap. Defaults to False. num_microbatches (int, optional): Number of microbatches when using pipeline parallelism. Defaults to None. microbatch_size (int, optional): Microbatch size when using pipeline parallelism. Either ``num_microbatches`` or ``microbatch_size`` should be provided if using pipeline. From 9c7ebb96525321f2c32ba6877822f12ce86848fc Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Tue, 12 Sep 2023 11:09:50 +0800 Subject: [PATCH 03/10] update ShardConfig docstring --- colossalai/booster/plugin/hybrid_parallel_plugin.py | 10 +++++----- colossalai/shardformer/shard/shard_config.py | 10 +++------- 2 files changed, 8 insertions(+), 12 deletions(-) diff --git a/colossalai/booster/plugin/hybrid_parallel_plugin.py b/colossalai/booster/plugin/hybrid_parallel_plugin.py index 415fc50aa1f1..3fbeebcc4110 100644 --- a/colossalai/booster/plugin/hybrid_parallel_plugin.py +++ b/colossalai/booster/plugin/hybrid_parallel_plugin.py @@ -243,11 +243,11 @@ class HybridParallelPlugin(PipelinePluginBase): enable_all_optimization (bool, optional): Whether to switch on all the optimizations supported by Shardformer. Currently all the optimization methods include fused normalization, flash attention and JIT. Defaults to False. - enable_fused_normalization (bool, optional): Whether to switch on fused normalization. Defaults to False. - enable_flash_attention (bool, optional): Whether to switch on flash attention. Defaults to False. - enable_jit_fused (bool, optional): Whether to switch on JIT. Default to False. - enable_sequence_parallelism (bool): Whether to turn on sequence parallelism. Defaults to False. - enable_sequence_overlap (bool): Whether to turn on sequence overlap. Defaults to False. + enable_fused_normalization (bool, optional): Whether to switch on fused normalization in Shardformer. Defaults to False. + enable_flash_attention (bool, optional): Whether to switch on flash attention in Shardformer. Defaults to False. + enable_jit_fused (bool, optional): Whether to switch on JIT in Shardformer. Default to False. + enable_sequence_parallelism (bool): Whether to turn on sequence parallelism in Shardformer. Defaults to False. + enable_sequence_overlap (bool): Whether to turn on sequence overlap in Shardformer. Defaults to False. num_microbatches (int, optional): Number of microbatches when using pipeline parallelism. Defaults to None. microbatch_size (int, optional): Microbatch size when using pipeline parallelism. Either ``num_microbatches`` or ``microbatch_size`` should be provided if using pipeline. diff --git a/colossalai/shardformer/shard/shard_config.py b/colossalai/shardformer/shard/shard_config.py index 4380ac30814d..62eebc26891b 100644 --- a/colossalai/shardformer/shard/shard_config.py +++ b/colossalai/shardformer/shard/shard_config.py @@ -19,9 +19,12 @@ class ShardConfig: pipeline_stage_manager (Optional[PipelineStageManager]): The pipeline stage manager, defaults to None, which means no pipeline. enable_tensor_parallelism (bool): Whether to turn on tensor parallelism, default is True. enable_fused_normalization (bool): Whether to use fused layernorm, default is False. + enable_flash_attention (bool, optional): Whether to switch on flash attention, default is False. + enable_jit_fused (bool, optional): Whether to switch on JIT, default is False. enable_all_optimization (bool): Whether to turn on all optimization, default is False. enable_sequence_parallelism (bool): Whether to turn on sequence parallelism, default is False. enable_sequence_overlap (bool): Whether to turn on sequence overlap, default is False. + inference_only (bool): Only doing forward passing if True, default is False. """ tensor_parallel_process_group: Optional[ProcessGroup] = None pipeline_stage_manager: Optional[PipelineStageManager] = None @@ -33,14 +36,7 @@ class ShardConfig: enable_sequence_parallelism: bool = False enable_sequence_overlap: bool = False inference_only: bool = False - enable_sequence_parallelism: bool = False - enable_sequence_overlap: bool = False - - # pipeline_parallel_size: int - # data_parallel_size: int # tensor_parallel_mode: Literal['1d', '2d', '2.5d', '3d'] - # inference_only: bool = True - # gather_output: bool = True @property def tensor_parallel_size(self): From 4c7088b04c91d564244f6d740d26614246f8644c Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Tue, 12 Sep 2023 12:15:55 +0800 Subject: [PATCH 04/10] add links to llama example --- docs/source/en/basics/booster_api.md | 3 ++- docs/source/en/features/shardformer.md | 7 ++++--- docs/source/zh-Hans/basics/booster_api.md | 3 ++- 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/docs/source/en/basics/booster_api.md b/docs/source/en/basics/booster_api.md index 7962707514de..392251ef06b2 100644 --- a/docs/source/en/basics/booster_api.md +++ b/docs/source/en/basics/booster_api.md @@ -9,7 +9,8 @@ Author: [Mingyan Jiang](https://github.com/jiangmingyan), [Jianghai Chen](https: **Example Code** -- [Train with Booster](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet) +- [Train ResNet on CIFAR-10 with Booster](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet) +- [Train LLaMA-1/2 on RedPajama with Booster](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/llama2) ## Introduction diff --git a/docs/source/en/features/shardformer.md b/docs/source/en/features/shardformer.md index 7faf977c60dc..3953434d7ca1 100644 --- a/docs/source/en/features/shardformer.md +++ b/docs/source/en/features/shardformer.md @@ -6,13 +6,14 @@ Author: [Baizhou Zhang](https://github.com/Fridge003) - [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md) **Example Code** -- [ColossalAI-Examples ResNet with pipeline](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/pipeline_parallel) + **Related Paper** - [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) - [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965) -- [Flashattention: Fast and memory-efficient exact attention with io-awareness](https://arxiv.org/abs/2205.14135) -- sequence parallel +- [FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691) +- [Sequence Parallelism: Long Sequence Training from System Perspective](https://arxiv.org/abs/2105.13120) + ## Quick introduction diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 573aab1c8a07..5d0f19d52139 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -11,7 +11,8 @@ -- [使用 booster 训练](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet) +- [使用Booster在CIFAR-10数据集上训练ResNet](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet) +- [使用Booster在RedPajama数据集上训练Llama-1/2](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/llama2) ## 简介 From e6d25783c20fe62951a3de89595e12adbfba89d2 Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Tue, 12 Sep 2023 12:26:32 +0800 Subject: [PATCH 05/10] add outdated massage --- docs/source/en/features/1D_tensor_parallel.md | 2 ++ docs/source/en/features/shardformer.md | 2 +- docs/source/zh-Hans/features/1D_tensor_parallel.md | 2 ++ 3 files changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/source/en/features/1D_tensor_parallel.md b/docs/source/en/features/1D_tensor_parallel.md index 7157af210bc5..28ec9196543d 100644 --- a/docs/source/en/features/1D_tensor_parallel.md +++ b/docs/source/en/features/1D_tensor_parallel.md @@ -2,6 +2,8 @@ Author: Zhengda Bian, Yongbin Li +> ⚠️ The information on this page is outdated and will be deprecated. Please check [Shardformer](./shardformer.md) for more information. + **Prerequisite** - [Define Your Configuration](../basics/define_your_config.md) - [Configure Parallelization](../basics/configure_parallelization.md) diff --git a/docs/source/en/features/shardformer.md b/docs/source/en/features/shardformer.md index 3953434d7ca1..4432fbea1e86 100644 --- a/docs/source/en/features/shardformer.md +++ b/docs/source/en/features/shardformer.md @@ -6,7 +6,7 @@ Author: [Baizhou Zhang](https://github.com/Fridge003) - [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md) **Example Code** - +- [Training with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples) **Related Paper** - [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) diff --git a/docs/source/zh-Hans/features/1D_tensor_parallel.md b/docs/source/zh-Hans/features/1D_tensor_parallel.md index 4dd45e8783c3..7246d9d500b2 100644 --- a/docs/source/zh-Hans/features/1D_tensor_parallel.md +++ b/docs/source/zh-Hans/features/1D_tensor_parallel.md @@ -2,6 +2,8 @@ 作者: Zhengda Bian, Yongbin Li +> ⚠️ 此页面上的信息已经过时并将被废弃。请在[Shardformer](./shardformer.md)页面查阅更新。 + **前置教程** - [定义配置文件](../basics/define_your_config.md) - [并行配置](../basics/configure_parallelization.md) From 074a09bf0ffeedd275169579895ee1cfd299ecd9 Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Tue, 12 Sep 2023 16:36:21 +0800 Subject: [PATCH 06/10] finish introduction & supporting information --- docs/source/en/features/shardformer.md | 163 +++++++------------------ 1 file changed, 47 insertions(+), 116 deletions(-) diff --git a/docs/source/en/features/shardformer.md b/docs/source/en/features/shardformer.md index 4432fbea1e86..1ad388391c47 100644 --- a/docs/source/en/features/shardformer.md +++ b/docs/source/en/features/shardformer.md @@ -4,9 +4,12 @@ Author: [Baizhou Zhang](https://github.com/Fridge003) **Prerequisite** - [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md) +- [Booster API](../basics/booster_api.md) +- [Booster Plugins](../basics/booster_plugins.md) **Example Code** -- [Training with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples) +- [Tensor Parallelism with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples) +- [Enabling Shardformer using HybridPrallelPlugin](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert) **Related Paper** - [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) @@ -15,146 +18,74 @@ Author: [Baizhou Zhang](https://github.com/Fridge003) - [Sequence Parallelism: Long Sequence Training from System Perspective](https://arxiv.org/abs/2105.13120) -## Quick introduction +## Introduction -In this tutorial, you will learn how to use pipeline parallel. In Colossal-AI, we use 1F1B pipeline, introduced by Nvidia. In this case, ViT and Imagenet are too large to use. Therefore, here we use ResNet and Cifar as example. +When training large transformer models such as LLaMa-2 70B or OPT 175B, model parallelism methods that divide a huge model into smaller shards, including tensor parallelism or pipeline parallism, are essential so as to meet the limitation of GPU memory. However, manually cutting model and rewriting its forward/backword logic could be difficult for users who are not familiar with distributed training. Meanwhile, the Huggingface transformers has gradually become users' first choice of model source, and most mainstream large models have been open-sourced in Huggingface model library. -## Table Of Content +Out of this motivation, the ColossalAI team develops **Shardformer**, a feature that automatically does preparation of model parallelism (tensor parallelism/pipeline parallelism) for popular transformer models in HuggingFace. This module aims to make parallelization hassle-free for users who are not from the system background. Within a few lines of codes, users can turn a large pretrained Huggingface model into a state ready for distributed training. Also, Shardformer can be configured to adopt some optimization tools for acceleration and memory saving during forward/backward pass. -In this tutorial we will cover: -1. Introduction of 1F1B pipeline. -2. Usage of non-interleaved and interleaved schedule. -3. Training ResNet with pipeline. +## How Shardformer works -## Introduction of 1F1B pipeline -First of all, we will introduce you GPipe for your better understanding. -
- -
Figure1: GPipe. This figure is from Megatron-LM paper.
-
+For more implementation details, please refer to our [develop document](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md). -As you can see, for GPipe, only when the forward passes of all microbatches in a batch finish, the backward passes would be executed. -In general, 1F1B(one forward pass followed by one backward pass) is more efficient than GPipe(in memory or both memory and time). There are two schedules of 1F1B pipeline, the non-interleaved and the interleaved. The figures are shown below. +## Usage -
- -
Figure2: This figure is from Megatron-LM paper. The top part shows the default non-interleaved schedule. And the bottom part shows the interleaved schedule.
-
-### Non-interleaved Schedule -The non-interleaved schedule can be divided into three stages. The first stage is the warm-up stage, where workers perform differing numbers of forward passes. At the following stage, workers perform one forward pass followed by one backward pass. Workers will finish backward passes at the last stage. -This mode is more memory-efficient than GPipe. However, it would take the same time to finish a turn of passes as GPipe. +The case of training ChatGLM-2 6B is a little special: since Huggingface transformers doesn't officially support ChatGLM at present, please import the configuration/model classes through +```python +from colossalai.shardformer.modeling.chatglm2_6b.configuration_chatglm import ChatGLMConfig +from colossalai.shardformer.modeling.chatglm2_6b.modeling_chatglm import ChatGLMForConditionalGeneration, ChatGLMModel +``` +when training ChatGLM-2 with Shardformer, and initialize your model with these imported classes. -### Interleaved Schedule -This schedule requires **the number of microbatches to be an integer multiple of the stage of pipeline**. +## Supporting Information -In this schedule, each device can perform computation for multiple subsets of layers(called a model chunk) instead of a single contiguous set of layers. i.e. Before device 1 had layer 1-4; device 2 had layer 5-8; and so on. But now device 1 has layer 1,2,9,10; device 2 has layer 3,4,11,12; and so on. With this scheme, each device in the pipeline is assigned multiple pipeline stages and each pipeline stage has less computation. +List of Huggingface transformers model families currently supported by Shardformer: +- LlaMa-1/LlaMa-2 +- GPT2 +- BERT +- OPT +- BLOOM +- T5 +- ViT +- ChatGLM-2 6B +- Whisper -This mode is both memory-efficient and time-efficient. +List of optimization tools currently supported by Shardformer: +- Flash Attention 2 +- JIT Fused Operator +- xFormers +- Fused Layer Normalization +- Sequence Parallel +- Sequence Overlap -## Usage of non-interleaved and interleaved schedule +List of model families we plan to support in the near future: +- SAM +- Blip2 +- RoBERTa +- ALBERT +- ERNIE +- GPT Neo +- GPT-J +- BEiT +- SwinTransformer V1/V2 +- qwen -In Colossal-AI, we provided both non-interleaved(as `PipelineSchedule`) and interleaved schedule(as `InterleavedPipelineSchedule`). +These lists will grow longer as more models and optimization tools emerge in the future. If you have any suggestions on the models/optimization we should support, please mention it in [Issues](https://github.com/hpcaitech/ColossalAI/issues) section of our project. -You just need to set `NUM_MICRO_BATCHES` in config file and set `NUM_CHUNKS` in config file if you want to use Interleaved Pipeline Schedule. If you certainly know the shape of each pipeline stage's output tensor and the shapes are all the same, you can set `TENSOR_SHAPE` in config file to further reduce communication. Otherwise, you can just ignore `tensor_shape`, and the shape will be exchanged over pipeline stages automatically. Then we will generate an appropriate schedule for you. +For more details about compatibility between each optimization tool and each supported model, please refer to chapter Roadmap in our [develop document](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md). -## Training ResNet with pipeline -Let's build the `ResNet` model first with Colossal PipelinableContext: -```python -import os -from typing import Callable, List, Optional, Type, Union -import torch -import torch.nn as nn -import colossalai -import colossalai.nn as col_nn - -from colossalai.core import global_context as gpc -from colossalai.logging import disable_existing_loggers, get_dist_logger -from colossalai.legacy.trainer import Trainer, hooks -from colossalai.utils import MultiTimer, get_dataloader -from colossalai.context import ParallelMode -from colossalai.pipeline.pipelinable import PipelinableContext - -from titans.dataloader.cifar10 import build_cifar -from torchvision.models import resnet50 -from torchvision.models.resnet import BasicBlock, Bottleneck, conv1x1 - -# Define some config -BATCH_SIZE = 64 -NUM_EPOCHS = 2 -NUM_CHUNKS = 1 -CONFIG = dict(NUM_MICRO_BATCHES=4, parallel=dict(pipeline=2)) - -# Train -disable_existing_loggers() -parser = colossalai.get_default_parser() -args = parser.parse_args() -colossalai.launch_from_torch(backend=args.backend, config=CONFIG) -logger = get_dist_logger() -pipelinable = PipelinableContext() - -# build model -with pipelinable: - model = resnet50() -``` -Define an execution sequence. -```python -exec_seq = [ - 'conv1', 'bn1', 'relu', 'maxpool', 'layer1', 'layer2', 'layer3', 'layer4', 'avgpool', - (lambda x: torch.flatten(x, 1), "behind"), 'fc' -] -pipelinable.to_layer_list(exec_seq) -``` -Partition the model into pipeline. -```python -model = pipelinable.partition(NUM_CHUNKS, gpc.pipeline_parallel_size, gpc.get_local_rank(ParallelMode.PIPELINE)) -``` -In this tutorial, we use `Trainer` to train `ResNet`: -```python -# build criterion -criterion = nn.CrossEntropyLoss() - -# optimizer -optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) - -# build dataloader -root = os.environ.get('DATA', './data') -train_dataloader, test_dataloader = build_cifar(BATCH_SIZE, root, padding=4, crop=32, resize=32) - -lr_scheduler = col_nn.lr_scheduler.LinearWarmupLR(optimizer, NUM_EPOCHS, warmup_steps=1) -engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model, optimizer, criterion, - train_dataloader, test_dataloader, - lr_scheduler) -timer = MultiTimer() - -trainer = Trainer(engine=engine, timer=timer, logger=logger) - -hook_list = [ - hooks.LossHook(), - hooks.AccuracyHook(col_nn.metric.Accuracy()), - hooks.LogMetricByEpochHook(logger), - hooks.LRSchedulerHook(lr_scheduler, by_epoch=True) -] - -trainer.fit(train_dataloader=train_dataloader, - epochs=NUM_EPOCHS, - test_dataloader=test_dataloader, - test_interval=1, - hooks=hook_list, - display_progress=True) -``` -We use `2` pipeline stages and the batch will be split into `4` micro batches. From db11b2f6ea55bbac80687f508f86610e05a6bce9 Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Wed, 13 Sep 2023 00:10:01 +0800 Subject: [PATCH 07/10] finish 'how shardformer works' --- docs/source/en/features/shardformer.md | 84 +++++++++++++++++++++++--- 1 file changed, 76 insertions(+), 8 deletions(-) diff --git a/docs/source/en/features/shardformer.md b/docs/source/en/features/shardformer.md index 1ad388391c47..1e5b681d81ca 100644 --- a/docs/source/en/features/shardformer.md +++ b/docs/source/en/features/shardformer.md @@ -20,23 +20,95 @@ Author: [Baizhou Zhang](https://github.com/Fridge003) ## Introduction -When training large transformer models such as LLaMa-2 70B or OPT 175B, model parallelism methods that divide a huge model into smaller shards, including tensor parallelism or pipeline parallism, are essential so as to meet the limitation of GPU memory. However, manually cutting model and rewriting its forward/backword logic could be difficult for users who are not familiar with distributed training. Meanwhile, the Huggingface transformers has gradually become users' first choice of model source, and most mainstream large models have been open-sourced in Huggingface model library. +When training large transformer models such as LLaMa-2 70B or OPT 175B, model parallelism methods that divide a huge model into smaller shards, including tensor parallelism or pipeline parallism, are essential so as to meet the limitation of GPU memory. +However, manually cutting model and rewriting its forward/backword logic could be difficult for users who are not familiar with distributed training. +Meanwhile, the Huggingface transformers has gradually become users' first choice of model source, and most mainstream large models have been open-sourced in Huggingface model library. -Out of this motivation, the ColossalAI team develops **Shardformer**, a feature that automatically does preparation of model parallelism (tensor parallelism/pipeline parallelism) for popular transformer models in HuggingFace. This module aims to make parallelization hassle-free for users who are not from the system background. Within a few lines of codes, users can turn a large pretrained Huggingface model into a state ready for distributed training. Also, Shardformer can be configured to adopt some optimization tools for acceleration and memory saving during forward/backward pass. +Out of this motivation, the ColossalAI team develops **Shardformer**, a feature that automatically does preparation of model parallelism (tensor parallelism/pipeline parallelism) for popular transformer models in HuggingFace. +This module aims to make parallelization hassle-free for users who are not from the system background. +Within a few lines of codes, users can turn a large pretrained Huggingface model into a state ready for distributed training. +Also, Shardformer can be configured to adopt some optimization tools for acceleration and memory saving during forward/backward pass. -## How Shardformer works +## How Shardformer Works +Generally, Shardformer works through the following four kinds of *replacements*: +1. Replacing original PyTorch module (e.g. `nn.Linear`, `nn.Embedding`) with a crafted distributed module. The distributed module keeps the same attributes as the original module but replaces the original parameters with distributed parameters. Also, new `forward` methods will replace original ones so as to execute distributed computation, such as linear layers' split /gather operations executed under tensor parallelism. Each distributed module implements its `from_native_module` static method to convert the PyTorch module to its corresponding distributed module. +2. Replacing attributes of original Huggingface Transformers layers with appropriate attributes for distributed training. For example, when training LlaMa-2 with tensor parallel size as 2, the attribute `num_heads` of `LlamaDecoderLayer` (the number of attention heads in each layer) should be replaced with `model.config.num_attention_heads // 2`. -For more implementation details, please refer to our [develop document](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md). +3. Replacing the `forward` methods implemented by original Huggingface +Transformers libraries with our customized `forward` methods. This replacement is essential for pipeline paralellism, where a customiozed function is needed to pass intermediate hidden states between different pipeline stages. Also, optimization methods such as flash attention or sequence parallel can be injected into the `forward` process through our customized `forward` method. +4. Replacing the whole copy of model parameters and optimizer states with incomplete ones controlled by current device (this is why it's called Shardformer). +By executing `ModelSharder.shard` method, current device will only keep the part of model parameters it's supposed to take care of. +To be specific, they should be the assigned parameter shards when using tensor parallelism, or the parameters belonging to current pipeline stage when using pipeline parallelism, or both of them. +All other parameters are released so as to liberate memory usage. +As a result, the optimizer will only compute the states corresponding to these part of parameters, causing the usage of memory to be further saved. + +All of these replacements are implemented with manually written policies and forward functions. +If you want to delve deeper into the design of Shardformer or customize your own Shardformer policies, please refer to our [Shardformer development document](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md) and [pipeline parallelism design](https://github.com/hpcaitech/ColossalAI/discussions/4050) for more details. ## Usage +### Usage 1: Training Through Shardformer APIs + +The sample API usages are given below. If you want to enable the usage of flash attention, please install `flash_attn` through pip, and xformers's `cutlass_op` provides a supplementary optimization: + +```python +from colossalai.shardformer import ShardConfig, ShardFormer +from transformers import BertForMaskedLM +import colossalai + +# launch colossalai +colossalai.launch_from_torch(config={}) + +# create model +config = BertConfig.from_pretrained('bert-base-uncased') +model = BertForMaskedLM.from_pretrained('bert-base-uncased', config=config) + +# create huggingface model as normal +shard_config = ShardConfig(tensor_parallel_process_group=tp_group, + pipeline_stage_manager=stage_manager, + enable_tensor_parallelism=True, + enable_fused_normalization=True, + enable_flash_attention=True, + enable_jit_fused=True, + enable_sequence_parallelism=True, + enable_sequence_overlap=True) + +shard_former = ShardFormer(shard_config=shard_config) +sharded_model, shared_params = shard_former.optimize(model).to('cuda') + +# do everything like normal +... +``` +shardformer configuration + +`tensor_parallel_process_group`: the process group of tensor parallelism, it's necessary when using tensor parallel. +`pipeline_stage_manager`: If using pipeline parallelism, it's necessary to specify a pipeline stage manager for inter-process communication in pipeline parallelism. +{{ autodoc:colossalai.pipeline.stage_manager.PipelineStageManager }} +`enable_tensor_parallelism`: using tensor parallel, partition the model along the columns or along the rows +`enable_fused_normalization`: using apex fused layernorm +`enable_flash_attention`: using flash attention +`enable_jit_fused`: using jit fused operators +`enable_sequence_parallelism`: using sequence parallelism, partition these non-tensor parallel regions along the sequence dimension. +`enable_sequence_overlap`: overlap the computation and communication in the sequence parallelism, it's used with `enable_sequence_parallelism`. + +example: +- [Tensor Parallelism with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples) + +### Usage 2: Training Through HybridParallelPlugin +Need a link to booster_plugin.md here + +example: +- [Enabling Shardformer using HybridPrallelPlugin](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert) + + +Awakening Shardformer using Hybrid Parallel Plugin (Here should be a link) is more recommended. The case of training ChatGLM-2 6B is a little special: since Huggingface transformers doesn't officially support ChatGLM at present, please import the configuration/model classes through ```python @@ -84,8 +156,4 @@ These lists will grow longer as more models and optimization tools emerge in the For more details about compatibility between each optimization tool and each supported model, please refer to chapter Roadmap in our [develop document](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md). - - - - From fdae1d962d1c1de405990a0e967bf2732bbe9b18 Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Wed, 13 Sep 2023 12:14:06 +0800 Subject: [PATCH 08/10] finish shardformer.md English doc --- colossalai/shardformer/README.md | 32 +++++--- colossalai/shardformer/shard/shard_config.py | 22 +++--- docs/source/en/basics/booster_plugins.md | 2 +- docs/source/en/features/shardformer.md | 73 ++++++------------- docs/source/zh-Hans/basics/booster_plugins.md | 2 +- 5 files changed, 58 insertions(+), 73 deletions(-) diff --git a/colossalai/shardformer/README.md b/colossalai/shardformer/README.md index 559f9a56f61e..09a45929201a 100644 --- a/colossalai/shardformer/README.md +++ b/colossalai/shardformer/README.md @@ -60,18 +60,28 @@ sharded_model, shared_params = shard_former.optimize(model).to('cuda') # do everything like normal ... ``` -shardformer configuration - -`tensor_parallel_process_group`: the process group of tensor parallelism, it's necessary when using tensor parallel. -`pipeline_stage_manager`: If using pipeline parallelism, it's necessary to specify a pipeline stage manager for inter-process communication in pipeline parallelism. -{{ autodoc:colossalai.pipeline.stage_manager.PipelineStageManager }} -`enable_tensor_parallelism`: using tensor parallel, partition the model along the columns or along the rows -`enable_fused_normalization`: using apex fused layernorm -`enable_flash_attention`: using flash attention -`enable_jit_fused`: using jit fused operators -`enable_sequence_parallelism`: using sequence parallelism, partition these non-tensor parallel regions along the sequence dimension. -`enable_sequence_overlap`: overlap the computation and communication in the sequence parallelism, it's used with `enable_sequence_parallelism`. +Following are the description `ShardConfig`'s arguments: + +- `tensor_parallel_process_group`: The process group of tensor parallelism, it's necessary when using tensor parallel. Defaults to None, which is the global process group. + +- `pipeline_stage_manager`: If using pipeline parallelism, it's necessary to specify a pipeline stage manager for inter-process communication in pipeline parallelism. Defaults to None, which means not using pipeline parallelism. + +- `enable_tensor_parallelism`: Whether to use tensor parallelism. Defaults to True. + +- `enable_fused_normalization`: Whether to use fused layernorm. Defaults to False. + +- `enable_flash_attention`: Whether to switch on flash attention. Defaults to False. + +- `enable_jit_fused`: Whether to switch on JIT fused operators. Defaults to False. + +- `enable_sequence_parallelism`: Whether to turn on sequence parallelism, which partitions non-tensor-parallel regions along the sequence dimension. Defaults to False. + +- `enable_sequence_overlap`: Whether to turn on sequence overlap, wheich overlap the computation and communication in sequence parallelism. It can only be used when `enable_sequence_parallelism` is True. Defaults to False. + +- `enable_all_optimization`: Whether to turn on all optimization tools including `fused normalizaion`, `flash attention`, `JIT fused operators`, `sequence parallelism` and `sequence overlap`. Defaults to False. + +- `inference_only`: Whether only doing forward passing. Defaults to False. ### Write your own policy diff --git a/colossalai/shardformer/shard/shard_config.py b/colossalai/shardformer/shard/shard_config.py index 62eebc26891b..0b6e1640952b 100644 --- a/colossalai/shardformer/shard/shard_config.py +++ b/colossalai/shardformer/shard/shard_config.py @@ -15,26 +15,26 @@ class ShardConfig: The config for sharding the huggingface model Args: - tensor_parallel_process_group (Optional[ProcessGroup]): The process group for tensor parallelism, defaults to None, which is the global process group. - pipeline_stage_manager (Optional[PipelineStageManager]): The pipeline stage manager, defaults to None, which means no pipeline. - enable_tensor_parallelism (bool): Whether to turn on tensor parallelism, default is True. - enable_fused_normalization (bool): Whether to use fused layernorm, default is False. - enable_flash_attention (bool, optional): Whether to switch on flash attention, default is False. - enable_jit_fused (bool, optional): Whether to switch on JIT, default is False. - enable_all_optimization (bool): Whether to turn on all optimization, default is False. - enable_sequence_parallelism (bool): Whether to turn on sequence parallelism, default is False. - enable_sequence_overlap (bool): Whether to turn on sequence overlap, default is False. - inference_only (bool): Only doing forward passing if True, default is False. + tensor_parallel_process_group (Optional[ProcessGroup]): The process group of tensor parallelism, it's necessary when using tensor parallel. Defaults to None, which is the global process group. + pipeline_stage_manager (Optional[PipelineStageManager]): If using pipeline parallelism, it's necessary to specify a pipeline stage manager for inter-process communication in pipeline parallelism. Defaults to None, which means not using pipeline parallelism. + enable_tensor_parallelism (bool): Whether to use tensor parallelism. Defaults to True. + enable_fused_normalization (bool): Whether to use fused layernorm. Defaults to False. + enable_flash_attention (bool, optional): Whether to switch on flash attention. Defaults to False. + enable_jit_fused (bool, optional): Whether to switch on JIT fused operators. Defaults to False. + enable_sequence_parallelism (bool): Whether to turn on sequence parallelism, which partitions non-tensor-parallel regions along the sequence dimension. Defaults to False. + enable_sequence_overlap (bool): Whether to turn on sequence overlap, wheich overlap the computation and communication in sequence parallelism. It can only be used when enable_sequence_parallelism is True. Defaults to False. + enable_all_optimization (bool): Whether to turn on all optimization tools including 'fused normalizaion', 'flash attention', 'JIT fused operators', 'sequence parallelism' and 'sequence overlap'. Defaults to False. + inference_only (bool): Whether only doing forward passing. Defaults to False. """ tensor_parallel_process_group: Optional[ProcessGroup] = None pipeline_stage_manager: Optional[PipelineStageManager] = None enable_tensor_parallelism: bool = True enable_fused_normalization: bool = False - enable_all_optimization: bool = False enable_flash_attention: bool = False enable_jit_fused: bool = False enable_sequence_parallelism: bool = False enable_sequence_overlap: bool = False + enable_all_optimization: bool = False inference_only: bool = False # tensor_parallel_mode: Literal['1d', '2d', '2.5d', '3d'] diff --git a/docs/source/en/basics/booster_plugins.md b/docs/source/en/basics/booster_plugins.md index 7a88dc1701ba..d7532b0ce39b 100644 --- a/docs/source/en/basics/booster_plugins.md +++ b/docs/source/en/basics/booster_plugins.md @@ -73,7 +73,7 @@ More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/fsdp.h This plugin implements the combination of various parallel training strategies and optimization tools. The features of HybridParallelPlugin can be generally divided into four parts: -1. Shardformer: This plugin provides an entrance to Shardformer, which controls model sharding under tensor parallel and pipeline parallel setting. Shardformer also overloads the logic of model's forward/backward process to ensure the smooth working of tp/pp. Also, optimization tools including fused normalization, flash attention (xformers), JIT and sequence parallel are injected into the overloaded forward/backward method by Shardformer. +1. Shardformer: This plugin provides an entrance to Shardformer, which controls model sharding under tensor parallel and pipeline parallel setting. Shardformer also overloads the logic of model's forward/backward process to ensure the smooth working of tp/pp. Also, optimization tools including fused normalization, flash attention (xformers), JIT and sequence parallel are injected into the overloaded forward/backward method by Shardformer. More details can be found in chapter [Shardformer Doc](../features/shardformer.md). 2. Mixed Precision Training: Support for fp16/bf16 mixed precision training. More details about its arguments configuration can be found in [Mixed Precision Training Doc](../features/mixed_precision_training_with_booster.md). diff --git a/docs/source/en/features/shardformer.md b/docs/source/en/features/shardformer.md index 1e5b681d81ca..641b7d8f9c01 100644 --- a/docs/source/en/features/shardformer.md +++ b/docs/source/en/features/shardformer.md @@ -52,63 +52,38 @@ If you want to delve deeper into the design of Shardformer or customize your own ## Usage -### Usage 1: Training Through Shardformer APIs +### Shardformer Configuration -The sample API usages are given below. If you want to enable the usage of flash attention, please install `flash_attn` through pip, and xformers's `cutlass_op` provides a supplementary optimization: +The configuration of Shardformer is controlled by class `ShardConfig`: +{{ autodoc:colossalai.shardformer.ShardConfig }} -```python -from colossalai.shardformer import ShardConfig, ShardFormer -from transformers import BertForMaskedLM -import colossalai - -# launch colossalai -colossalai.launch_from_torch(config={}) - -# create model -config = BertConfig.from_pretrained('bert-base-uncased') -model = BertForMaskedLM.from_pretrained('bert-base-uncased', config=config) - -# create huggingface model as normal -shard_config = ShardConfig(tensor_parallel_process_group=tp_group, - pipeline_stage_manager=stage_manager, - enable_tensor_parallelism=True, - enable_fused_normalization=True, - enable_flash_attention=True, - enable_jit_fused=True, - enable_sequence_parallelism=True, - enable_sequence_overlap=True) - -shard_former = ShardFormer(shard_config=shard_config) -sharded_model, shared_params = shard_former.optimize(model).to('cuda') - -# do everything like normal -... -``` -shardformer configuration - -`tensor_parallel_process_group`: the process group of tensor parallelism, it's necessary when using tensor parallel. -`pipeline_stage_manager`: If using pipeline parallelism, it's necessary to specify a pipeline stage manager for inter-process communication in pipeline parallelism. -{{ autodoc:colossalai.pipeline.stage_manager.PipelineStageManager }} -`enable_tensor_parallelism`: using tensor parallel, partition the model along the columns or along the rows -`enable_fused_normalization`: using apex fused layernorm -`enable_flash_attention`: using flash attention -`enable_jit_fused`: using jit fused operators -`enable_sequence_parallelism`: using sequence parallelism, partition these non-tensor parallel regions along the sequence dimension. -`enable_sequence_overlap`: overlap the computation and communication in the sequence parallelism, it's used with `enable_sequence_parallelism`. - -example: -- [Tensor Parallelism with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples) +If you want to enable Apex fused layernorm, please install `apex`. +If you want to enable the usage of flash attention, please install `flash_attn`. +In addition, xFormers's `cutlass_op` provides a supplementary optimization. +### Enabling Shardformer -### Usage 2: Training Through HybridParallelPlugin +#### 1. Enabling Shardformer Through Booster (Recommended) -Need a link to booster_plugin.md here +Enabling `Shardformer` through `Booster` initialized with `HybridParallelPlugin` is the recommended way to awaken the power of Shardformer. +The main reason is that pipeline parallelism cannot successfully work without the calling of `execute_pipeline` method of `Booster`. Besides, `HybridParallelPlugin` provides the capacity to combine the features of `Shardformer` with other useful features, such as mixed precision training or Zero. + +More details about this usage can be found in chapter [Booster API](../basics/booster_api.md) and [Booster Plugins](../basics/booster_plugins.md). + +[Here](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert) is an example on how to trigger `Shardformer` through `HybridParallelPlugin`. Please be aware that there's a difference in the way of doing forward and backward between the situation of using pipeline and not using pipeline. + + +#### 2. Enabling Shardformer Through Shardformer APIs (Not Recommended) + +You can also use Shardformer through manually calling Shardformer APIs. The sample API usages are given below. This usage is not recommended since pipeline parallelism can't run without `Booster`. + +[Here](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/convergence_benchmark.py) +is an example on how to trigger `Shardformer` through calling Shardformer APIs. -example: -- [Enabling Shardformer using HybridPrallelPlugin](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert) +### Precautions -Awakening Shardformer using Hybrid Parallel Plugin (Here should be a link) is more recommended. +When you use Shardformer to process classification models such as `GPT2ForSequenceClassification`, `ViTForImageClassification`, please ensure that the total number of labels should be integer multiple of tensor parallel size, otherwise Shardformer can't process the classifier layer correctly. A simple fix could be appending dummy labels in transformers config. This bug will be fixed in future version of Shardformer. The case of training ChatGLM-2 6B is a little special: since Huggingface transformers doesn't officially support ChatGLM at present, please import the configuration/model classes through ```python diff --git a/docs/source/zh-Hans/basics/booster_plugins.md b/docs/source/zh-Hans/basics/booster_plugins.md index 6f731bfac1fc..0ad1cacab151 100644 --- a/docs/source/zh-Hans/basics/booster_plugins.md +++ b/docs/source/zh-Hans/basics/booster_plugins.md @@ -74,7 +74,7 @@ Zero-2 不支持局部梯度累积。如果您坚持使用,虽然可以积累 这个插件实现了多种并行训练策略和优化工具的组合。Hybrid Parallel插件支持的功能大致可以被分为以下四个部分: -1. Shardformer: Shardformer负责在张量并行以及流水线并行下切分模型的逻辑,以及前向/后向方法的重载,这个插件为Shardformer功能提供了一个简单易用的接口。与此同时,Shardformer还负责将包括fused normalization, flash attention (xformers), JIT和序列并行在内的各类优化工具融入重载后的前向/后向方法。 +1. Shardformer: Shardformer负责在张量并行以及流水线并行下切分模型的逻辑,以及前向/后向方法的重载,这个插件为Shardformer功能提供了一个简单易用的接口。与此同时,Shardformer还负责将包括fused normalization, flash attention (xformers), JIT和序列并行在内的各类优化工具融入重载后的前向/后向方法。更多关于Shardformer的信息请参考 [Shardformer文档](../features/shardformer.md)。 2. 混合精度训练:插件支持fp16/bf16的混合精度训练。更多关于混合精度训练的参数配置的详细信息请参考 [混合精度训练文档](../features/mixed_precision_training_with_booster.md)。 From 502c7c9aaad7ab6c540c0171727c5d16cd704b07 Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Wed, 13 Sep 2023 13:59:44 +0800 Subject: [PATCH 09/10] fix doctest fail --- docs/source/en/features/1D_tensor_parallel.md | 2 ++ docs/source/zh-Hans/features/1D_tensor_parallel.md | 2 ++ docs/source/zh-Hans/features/shardformer.md | 1 + 3 files changed, 5 insertions(+) diff --git a/docs/source/en/features/1D_tensor_parallel.md b/docs/source/en/features/1D_tensor_parallel.md index 28ec9196543d..79fe5ddea221 100644 --- a/docs/source/en/features/1D_tensor_parallel.md +++ b/docs/source/en/features/1D_tensor_parallel.md @@ -118,3 +118,5 @@ Output of the first linear layer: torch.Size([16, 512]) Output of the second linear layer: torch.Size([16, 256]) ``` The output of the first linear layer is split into 2 partitions (each has the shape `[16, 512]`), while the second layer has identical outputs across the GPUs. + + diff --git a/docs/source/zh-Hans/features/1D_tensor_parallel.md b/docs/source/zh-Hans/features/1D_tensor_parallel.md index 7246d9d500b2..61982cbb8be9 100644 --- a/docs/source/zh-Hans/features/1D_tensor_parallel.md +++ b/docs/source/zh-Hans/features/1D_tensor_parallel.md @@ -120,3 +120,5 @@ Output of the first linear layer: torch.Size([16, 512]) Output of the second linear layer: torch.Size([16, 256]) ``` 第一个线性层的输出被划分成2块 (每个形状为 `[16, 512]`), 而第二层在整个 GPU 上的输出是相同的。 + + diff --git a/docs/source/zh-Hans/features/shardformer.md b/docs/source/zh-Hans/features/shardformer.md index e69de29bb2d1..5029c12f8991 100644 --- a/docs/source/zh-Hans/features/shardformer.md +++ b/docs/source/zh-Hans/features/shardformer.md @@ -0,0 +1 @@ + From c68e4ff67ed16ad8b7488e2018bf2425df7876c6 Mon Sep 17 00:00:00 2001 From: Baizhou Zhang Date: Wed, 13 Sep 2023 17:26:20 +0800 Subject: [PATCH 10/10] add Chinese document --- docs/source/en/features/shardformer.md | 43 ++++--- docs/source/zh-Hans/basics/booster_api.md | 2 +- docs/source/zh-Hans/features/shardformer.md | 120 ++++++++++++++++++++ 3 files changed, 147 insertions(+), 18 deletions(-) diff --git a/docs/source/en/features/shardformer.md b/docs/source/en/features/shardformer.md index 641b7d8f9c01..872d00e4a073 100644 --- a/docs/source/en/features/shardformer.md +++ b/docs/source/en/features/shardformer.md @@ -22,24 +22,30 @@ Author: [Baizhou Zhang](https://github.com/Fridge003) When training large transformer models such as LLaMa-2 70B or OPT 175B, model parallelism methods that divide a huge model into smaller shards, including tensor parallelism or pipeline parallism, are essential so as to meet the limitation of GPU memory. However, manually cutting model and rewriting its forward/backword logic could be difficult for users who are not familiar with distributed training. -Meanwhile, the Huggingface transformers has gradually become users' first choice of model source, and most mainstream large models have been open-sourced in Huggingface model library. +Meanwhile, the Huggingface transformers library has gradually become users' first choice of model source, and most mainstream large models have been open-sourced in Huggingface transformers model library. Out of this motivation, the ColossalAI team develops **Shardformer**, a feature that automatically does preparation of model parallelism (tensor parallelism/pipeline parallelism) for popular transformer models in HuggingFace. This module aims to make parallelization hassle-free for users who are not from the system background. -Within a few lines of codes, users can turn a large pretrained Huggingface model into a state ready for distributed training. -Also, Shardformer can be configured to adopt some optimization tools for acceleration and memory saving during forward/backward pass. +Within a few lines of codes, users can turn a model into a state ready for distributed training. +Also, Shardformer contains various optimization tools for acceleration and memory saving during forward/backward pass. ## How Shardformer Works Generally, Shardformer works through the following four kinds of *replacements*: -1. Replacing original PyTorch module (e.g. `nn.Linear`, `nn.Embedding`) with a crafted distributed module. The distributed module keeps the same attributes as the original module but replaces the original parameters with distributed parameters. Also, new `forward` methods will replace original ones so as to execute distributed computation, such as linear layers' split /gather operations executed under tensor parallelism. Each distributed module implements its `from_native_module` static method to convert the PyTorch module to its corresponding distributed module. +1. Replacing original PyTorch module (e.g. `nn.Linear`, `nn.Embedding`) with a crafted distributed module. +The distributed module keeps the same attributes as the original module but replaces the original parameters with distributed parameters. +Also, new `forward` methods will replace original ones so as to execute distributed computation, such as linear layers' split /gather operations executed under tensor parallelism. +Each distributed module implements its `from_native_module` static method to convert the PyTorch module to its corresponding distributed module. -2. Replacing attributes of original Huggingface Transformers layers with appropriate attributes for distributed training. For example, when training LlaMa-2 with tensor parallel size as 2, the attribute `num_heads` of `LlamaDecoderLayer` (the number of attention heads in each layer) should be replaced with `model.config.num_attention_heads // 2`. +2. Replacing attributes of original Huggingface Transformers layers with appropriate attributes for distributed training. +For example, when training LlaMa-2 with tensor parallel size as 2, the attribute `num_heads` of `LlamaDecoderLayer` (the number of attention heads in each layer) should be replaced with `model.config.num_attention_heads // 2`. 3. Replacing the `forward` methods implemented by original Huggingface -Transformers libraries with our customized `forward` methods. This replacement is essential for pipeline paralellism, where a customiozed function is needed to pass intermediate hidden states between different pipeline stages. Also, optimization methods such as flash attention or sequence parallel can be injected into the `forward` process through our customized `forward` method. +Transformers libraries with our customized `forward` methods. +This replacement is essential for pipeline paralellism, where a customiozed function is needed to pass intermediate hidden states between different pipeline stages. +Also, optimization methods such as flash attention or sequence parallel can be injected into the `forward` process through our customized `forward` method. 4. Replacing the whole copy of model parameters and optimizer states with incomplete ones controlled by current device (this is why it's called Shardformer). By executing `ModelSharder.shard` method, current device will only keep the part of model parameters it's supposed to take care of. @@ -55,11 +61,12 @@ If you want to delve deeper into the design of Shardformer or customize your own ### Shardformer Configuration The configuration of Shardformer is controlled by class `ShardConfig`: + {{ autodoc:colossalai.shardformer.ShardConfig }} -If you want to enable Apex fused layernorm, please install `apex`. +If you want to enable Apex Fused Layernorm, please install `apex`. If you want to enable the usage of flash attention, please install `flash_attn`. -In addition, xFormers's `cutlass_op` provides a supplementary optimization. +In addition, xFormers's `cutlass_op` can serve as a backup for flash attention. ### Enabling Shardformer @@ -75,7 +82,7 @@ More details about this usage can be found in chapter [Booster API](../basics/bo #### 2. Enabling Shardformer Through Shardformer APIs (Not Recommended) -You can also use Shardformer through manually calling Shardformer APIs. The sample API usages are given below. This usage is not recommended since pipeline parallelism can't run without `Booster`. +You can also use Shardformer through manually calling Shardformer APIs. However, this usage is not recommended since pipeline parallelism can't run without `Booster`. [Here](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/convergence_benchmark.py) is an example on how to trigger `Shardformer` through calling Shardformer APIs. @@ -83,14 +90,16 @@ is an example on how to trigger `Shardformer` through calling Shardformer APIs. ### Precautions -When you use Shardformer to process classification models such as `GPT2ForSequenceClassification`, `ViTForImageClassification`, please ensure that the total number of labels should be integer multiple of tensor parallel size, otherwise Shardformer can't process the classifier layer correctly. A simple fix could be appending dummy labels in transformers config. This bug will be fixed in future version of Shardformer. +1. When enabling pipeline parallel, please don't do the forward/backward pass in the conventional way (`model(input)`, `loss.backward()`), which will cause unexpected errors. Rather, please do forward/backward pass through calling `booster.execute_pipeline` method. + +2. When you use Shardformer to process classification models such as `GPT2ForSequenceClassification`, `ViTForImageClassification`, please ensure that the total number of labels should be integer multiple of tensor parallel size, otherwise Shardformer can't process the classifier layer correctly. A simple fix could be appending dummy labels in transformers config. This bug will be fixed in future version of Shardformer. -The case of training ChatGLM-2 6B is a little special: since Huggingface transformers doesn't officially support ChatGLM at present, please import the configuration/model classes through -```python -from colossalai.shardformer.modeling.chatglm2_6b.configuration_chatglm import ChatGLMConfig -from colossalai.shardformer.modeling.chatglm2_6b.modeling_chatglm import ChatGLMForConditionalGeneration, ChatGLMModel -``` -when training ChatGLM-2 with Shardformer, and initialize your model with these imported classes. +3. The case of training ChatGLM-2 6B is a little special: since Huggingface transformers doesn't officially support ChatGLM at present, please import the configuration/model classes through + ```python + from colossalai.shardformer.modeling.chatglm2_6b.configuration_chatglm import ChatGLMConfig + from colossalai.shardformer.modeling.chatglm2_6b.modeling_chatglm import ChatGLMForConditionalGeneration, ChatGLMModel + ``` + when training ChatGLM-2 with Shardformer, and initialize your model with these imported classes. ## Supporting Information @@ -126,7 +135,7 @@ List of model families we plan to support in the near future: - SwinTransformer V1/V2 - qwen -These lists will grow longer as more models and optimization tools emerge in the future. If you have any suggestions on the models/optimization we should support, please mention it in [Issues](https://github.com/hpcaitech/ColossalAI/issues) section of our project. +These lists will grow longer as more models and optimization tools emerge in the future. If you have any suggestions on the models/optimization we should support, please feel free to mention it in [Issues](https://github.com/hpcaitech/ColossalAI/issues) section of our project. For more details about compatibility between each optimization tool and each supported model, please refer to chapter Roadmap in our [develop document](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md). diff --git a/docs/source/zh-Hans/basics/booster_api.md b/docs/source/zh-Hans/basics/booster_api.md index 5d0f19d52139..c59d75d321c0 100644 --- a/docs/source/zh-Hans/basics/booster_api.md +++ b/docs/source/zh-Hans/basics/booster_api.md @@ -1,4 +1,4 @@ -# booster 使用 +# Booster API 作者: [Mingyan Jiang](https://github.com/jiangmingyan), [Jianghai Chen](https://github.com/CjhHa1), [Baizhou Zhang](https://github.com/Fridge003) diff --git a/docs/source/zh-Hans/features/shardformer.md b/docs/source/zh-Hans/features/shardformer.md index 5029c12f8991..49aa23e2d06b 100644 --- a/docs/source/zh-Hans/features/shardformer.md +++ b/docs/source/zh-Hans/features/shardformer.md @@ -1 +1,121 @@ +# Shardformer + +Author: [Baizhou Zhang](https://github.com/Fridge003) + +**预备知识** +- [并行技术](../concepts/paradigms_of_parallelism.md) +- [Booster API](../basics/booster_api.md) +- [Booster 插件](../basics/booster_plugins.md) + +**示例代码** +- [使用Shardformer进行张量并行训练](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples) +- [通过HybridParallelPlugin使用Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert) + +**相关论文** +- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) +- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965) +- [FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691) +- [Sequence Parallelism: Long Sequence Training from System Perspective](https://arxiv.org/abs/2105.13120) + + +## 简介 + +在训练LLaMa-2 70B或OPT 175B等大型Transformer模型时,为了满足GPU内存的限制,将大型模型划分为更小的分片的模型并行方法(包括张量并行以及流水线并行)是必不可少的。然而,对于不熟悉分布式训练的用户来说,手动剪切模型并重写其前向/反向逻辑可能很困难。与此同时,Huggingface transformers开源库正在逐渐成为用户模型来源的首选,大部分主流大型模型都已在Huggingface transformers模型库中开源。 + +出于这种动机,ColossalAI团队开发了**Shardformer**,该功能可以自动为HuggingFace中主流的Transformer模型进行封装,用于张量并行以及流水线并行的训练策略。如此一来,对系统了解不多的用户也可以轻松地在transformers模型上进行并行训练:只需几行代码,用户就可以将模型转变为并行训练的状态。此外,Shardformer也包括了多种优化工具,用于在前向/后向的传递过程中实现加速和节省内存。 + + +## Shardformer的工作原理 + +通常来说,Shardformer通过以下四种“替换”进行工作: + +1. 用我们设计的分布式模块替换原始的PyTorch模块(例如`nn.Linear`、`nn.Embedding`)。 +分布式模块保持与原始模块相同的属性,但分布式模块会用新的参数替换原始模块的参数。新的前向函数将取代原来的前向函数,用于执行分布式计算,例如在张量并行下执行线性层的split/gather操作。每个分布式模块都应当实现其`from_native_module`静态方法,以将PyTorch模块转换为其相应的分布式模块。 + +2. 将原始Huggingface Transformers中间层的属性为适用于并行训练的属性。例如,当使用并行度为2的张量并行训练LlaMa-2时,`LlamaDecoderLayer` 的属性`num_heads`(每一层注意力头的数量)应替换为`model.config.num_attention_heads // 2`。 + +3. 将原来Huggingface transformers库实现的前向函数替换为我们定制的前向函数。前向函数的替换对于流水线并行性至关重要,因为流水线并行需要特殊的前向函数去在不同的流水线阶段之间传递中间的隐藏状态。此外,可以通过我们定制的前向函数将例如`flash attention`或序列并行的优化方法注入到前向的过程中。 + +4. 将完整的模型参数和优化器状态替换为只由当前设备控制的部分模型参数和优化器状态。通过执行`ModelSharder.shard`方法,当前设备仅会保留它应该处理的那部分模型参数。具体来说,这部分参数可以是使用张量并行时分配到当前机器的参数分片,或者使用流水线并行时当前流水线阶段的模型参数,或者兼而有之。除此之外的所有其他参数都被释放,用于节省内存的空间。 +如此一来,优化器只会计算保留的部分参数对应的状态,从而进一步节省内存的使用。 + +所有这些替换都是通过手动编写的策略和前向函数来实现的。如果您想更深入地研究Shardformer的设计方案,或者定制您自己的Shardformer策略,请参考[Shardformer 开发者文档](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md)和[流水并行设计方案](https://github.com/hpcaitech/ColossalAI/discussions/4050)以获得更多细节。 + +## 用法 + +### Shardformer的参数配置 + +Shardformer的配置由类`ShardConfig`的参数控制: + +{{ autodoc:colossalai.shardformer.ShardConfig }} + +如果您想启用 Apex Fused Layernorm,请安装 `apex`。如果您想启用 flash attention,请安装 `flash_attn`。此外,xFormers 的 `cutlass_op` 可以作为Flash Attention的补充优化方式。 + +### 启动Shardformer + +#### 1. 通过Booster启动Shardformer (推荐) + +通过用`HybridParallelPlugin`初始化的`Booster`来启动`Shardformer`是最推荐的用法。其主要原因是如果不调用`Booster`的`execute_pipeline`方法,流水线并行就无法正常工作。此外,`HybridParallelPlugin`提供了将`Shardformer`的功能与其他功能(例如混合精度训练或Zero)相结合的能力。 + +更多关于这一用法的细节可以参考 [Booster API 文档](../basics/booster_api.md)以及[Booster 插件文档](../basics/booster_plugins.md)。[这里](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert)是一个通过`HybridParallelPlugin`启动`Shardformer`的示例。 + + +#### 2. 通过Shardformer API启动Shardformer (不推荐) + +您还可以通过手动调用Shardformer API的方式启动Shardformer。然而我们并不推荐这种用法,因为流水线并行在没有`Booster`的情况下无法正常运行。 + +[这里](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/convergence_benchmark.py) +是一个通过调用Shardformer的API启动`Shardformer`的示例。 + + +### 注意事项 + +1. 当启用流水线并行时,请不要用常规方式(`model(input)`、`loss.backward()`)进行前向/后向传递,这样会导致未知的错误。这种情形下请通过调用`booster.execute_pipeline`方法来进行前向/后向传递。 + +2. 当使用Shardformer处理`GPT2ForSequenceClassification`、`ViTForImageClassification`等分类模型时,请确保labels的总数为张量并行度的整数倍,否则Shardformer无法正确地处理classifier层。一个简单的修复方法就是在transformers的config中添加虚拟的标签。这一bug将在 Shardformer的未来版本中修复。 + +3. 训练ChatGLM-2 6B的情况有点特殊:由于Huggingface Transformers 目前尚未正式支持ChatGLM。在使用Shardformer训练ChatGLM-2时,请通过以下方式导入config/model的类: + ```python + from colossalai.shardformer.modeling.chatglm2_6b.configuration_chatglm import ChatGLMConfig + from colossalai.shardformer.modeling.chatglm2_6b.modeling_chatglm import ChatGLMForConditionalGeneration, ChatGLMModel + ``` + 并且使用这些导入的类初始化模型。 + +## 支持信息 + +Shardformer目前支持的Huggingface Transformer模型: +- LlaMa-1/LlaMa-2 +- GPT2 +- BERT +- OPT +- BLOOM +- T5 +- ViT +- ChatGLM-2 6B +- Whisper + +Shardformer目前支持的优化工具: +- Flash Attention 2 +- JIT Fused Operator +- xFormers +- Fused Layer Normalization +- Sequence Parallel +- Sequence Overlap + +我们计划在不久后为Shardformer支持的模型: +- SAM +- Blip2 +- RoBERTa +- ALBERT +- ERNIE +- GPT Neo +- GPT-J +- BEiT +- SwinTransformer V1/V2 +- qwen + +随着未来更多模型和优化工具的出现,这些列表将会变得越来越长。如果您对我们应该支持的模型/优化工具有任何建议,欢迎在项目的[Issues](https://github.com/hpcaitech/ColossalAI/issues)板块参与讨论。 + +更多关于不同优化工具和模型之间兼容性的细节,请参考[Shardformer开发者文档](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md)中的Roadmap一节。 +