-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[doc] update booster tutorials #3718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
ed5d525
[booster] update booster tutorials#3717
flybird11111 2a2e889
[booster] update booster tutorials#3717, fix
flybird11111 9362e15
[booster] update booster tutorials#3717, update setup doc
flybird11111 52d7e93
[booster] update booster tutorials#3717, update setup doc
flybird11111 111315d
[booster] update booster tutorials#3717, update setup doc
flybird11111 24987bd
[booster] update booster tutorials#3717, update setup doc
flybird11111 c3d44ad
[booster] update booster tutorials#3717, update setup doc
flybird11111 e8d7b94
[booster] update booster tutorials#3717, update setup doc
flybird11111 68e84be
[booster] update booster tutorials#3717, rename colossalai booster.md
flybird11111 6052a5d
[booster] update booster tutorials#3717, rename colossalai booster.md
flybird11111 21d3af1
[booster] update booster tutorials#3717, rename colossalai booster.md
flybird11111 9870991
[booster] update booster tutorials#3717, fix
flybird11111 6692c11
[booster] update booster tutorials#3717, fix
flybird11111 101250a
Merge branch 'hpcaitech:main' into booster-tutorials
flybird11111 9cc14e3
[booster] update tutorials#3717, update booster api doc
flybird11111 6c93a9f
Merge branch 'booster-tutorials' of https://github.com/jiangmingyan/C…
flybird11111 602c3ae
[booster] update tutorials#3717, modify file
flybird11111 f997d87
[booster] update tutorials#3717, modify file
flybird11111 6072224
[booster] update tutorials#3717, modify file
flybird11111 138d292
[booster] update tutorials#3717, modify file
flybird11111 5a2ef21
[booster] update tutorials#3717, modify file
flybird11111 9c20d0a
[booster] update tutorials#3717, modify file
flybird11111 08101d0
[booster] update tutorials#3717, modify file
flybird11111 ba4d77a
[booster] update tutorials#3717, fix reference link
flybird11111 8a4feb1
[booster] update tutorials#3717, fix reference link
flybird11111 e045350
[booster] update tutorials#3717, fix reference link
flybird11111 f4a0bcf
[booster] update tutorials#3717, fix reference link
flybird11111 e9cfb5c
[booster] update tutorials#3717, fix reference link
flybird11111 bf51a6c
[booster] update tutorials#3717, fix reference link
flybird11111 591fa12
[booster] update tutorials#3717, fix reference link
flybird11111 0f5703c
[booster] update tutorials#3713
flybird11111 274fc1a
[booster] update tutorials#3713, modify file
flybird11111 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| # Booster API | ||
| Author: [Mingyan Jiang](https://github.com/jiangmingyan) | ||
|
|
||
| **Prerequisite:** | ||
| - [Distributed Training](../concepts/distributed_training.md) | ||
| - [Colossal-AI Overview](../concepts/colossalai_overview.md) | ||
|
|
||
| **Example Code** | ||
| - [Train with Booster](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet/README.md) | ||
|
|
||
| ## Introduction | ||
| In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate your model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of. | ||
|
|
||
| ### Plugin | ||
| Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows: | ||
|
|
||
| ***GeminiPlugin:*** This plugin wrapps the Gemini acceleration solution, that ZeRO with chunk-based memory management. | ||
|
|
||
| ***TorchDDPPlugin:*** This plugin wrapps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines. | ||
|
|
||
| ***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs. | ||
|
|
||
| ### API of booster | ||
|
|
||
|
|
||
| {{ autodoc:colossalai.booster.Booster }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.boost }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.backward }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.no_sync }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.save_model }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.load_model }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.save_optimizer }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.load_optimizer }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.save_lr_scheduler }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.load_lr_scheduler }} | ||
|
|
||
| ## Usage | ||
| In a typical workflow, you should launch distributed environment at the beginning of training script and create objects needed (such as models, optimizers, loss function, data loaders etc.) firstly, then call `colossalai.booster` to inject features into these objects, After that, you can use our booster APIs and these returned objects to continue the rest of your training processes. | ||
|
|
||
| A pseudo-code example is like below: | ||
|
|
||
| ```python | ||
| import torch | ||
| from torch.optim import SGD | ||
| from torchvision.models import resnet18 | ||
|
|
||
| import colossalai | ||
| from colossalai.booster import Booster | ||
| from colossalai.booster.plugin import TorchDDPPlugin | ||
|
|
||
| def train(): | ||
| colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost') | ||
| plugin = TorchDDPPlugin() | ||
| booster = Booster(plugin=plugin) | ||
| model = resnet18() | ||
| criterion = lambda x: x.mean() | ||
| optimizer = SGD((model.parameters()), lr=0.001) | ||
| scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1) | ||
| model, optimizer, criterion, _, scheduler = booster.boost(model, optimizer, criterion, lr_scheduler=scheduler) | ||
|
|
||
| x = torch.randn(4, 3, 224, 224) | ||
| x = x.to('cuda') | ||
| output = model(x) | ||
| loss = criterion(output) | ||
| booster.backward(loss, optimizer) | ||
| optimizer.clip_grad_by_norm(1.0) | ||
| optimizer.step() | ||
| scheduler.step() | ||
|
|
||
| save_path = "./model" | ||
| booster.save_model(model, save_path, True, True, "", 10, use_safetensors=use_safetensors) | ||
|
|
||
| new_model = resnet18() | ||
| booster.load_model(new_model, save_path) | ||
| ``` | ||
|
|
||
| [more design details](https://github.com/hpcaitech/ColossalAI/discussions/3046) | ||
|
|
||
|
|
||
| <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 booster_api.py --> | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| # booster 使用 | ||
| 作者: [Mingyan Jiang](https://github.com/jiangmingyan) | ||
|
|
||
| **预备知识:** | ||
| - [分布式训练](../concepts/distributed_training.md) | ||
| - [Colossal-AI 总览](../concepts/colossalai_overview.md) | ||
|
|
||
| **示例代码** | ||
| - [使用booster训练](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet/README.md) | ||
|
|
||
| ## 简介 | ||
| 在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。 | ||
| 在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用时我们要注意的细节。 | ||
|
|
||
| ### Booster插件 | ||
| Booster插件是管理并行配置的重要组件(eg:gemini插件封装了gemini加速方案)。目前支持的插件如下: | ||
|
|
||
| ***GeminiPlugin:*** GeminiPlugin插件封装了 gemini 加速解决方案,即基于块内存管理的 ZeRO优化方案。 | ||
|
|
||
| ***TorchDDPPlugin:*** TorchDDPPlugin插件封装了DDP加速方案,实现了模型级别的数据并行,可以跨多机运行。 | ||
|
|
||
| ***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1:切分优化器参数,分发到各并发进程或并发GPU上。阶段 2:切分优化器参数及梯度,分发到各并发进程或并发GPU上。 | ||
|
|
||
| ### Booster接口 | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.boost }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.backward }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.no_sync }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.save_model }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.load_model }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.save_optimizer }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.load_optimizer }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.save_lr_scheduler }} | ||
|
|
||
| {{ autodoc:colossalai.booster.Booster.load_lr_scheduler }} | ||
|
|
||
| ## 使用方法及示例 | ||
|
|
||
| 在使用colossalai训练时,首先需要在训练脚本的开头启动分布式环境,并创建需要使用的模型、优化器、损失函数、数据加载器等对象。之后,调用`colossalai.booster` 将特征注入到这些对象中,您就可以使用我们的booster API去进行您接下来的训练流程。 | ||
|
|
||
| 以下是一个伪代码示例,将展示如何使用我们的booster API进行模型训练: | ||
|
|
||
| ```python | ||
| import torch | ||
| from torch.optim import SGD | ||
| from torchvision.models import resnet18 | ||
|
|
||
| import colossalai | ||
| from colossalai.booster import Booster | ||
| from colossalai.booster.plugin import TorchDDPPlugin | ||
|
|
||
| def train(): | ||
| colossalai.launch(config=dict(), rank=rank, world_size=world_size, port=port, host='localhost') | ||
| plugin = TorchDDPPlugin() | ||
| booster = Booster(plugin=plugin) | ||
| model = resnet18() | ||
| criterion = lambda x: x.mean() | ||
| optimizer = SGD((model.parameters()), lr=0.001) | ||
| scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1) | ||
| model, optimizer, criterion, _, scheduler = booster.boost(model, optimizer, criterion, lr_scheduler=scheduler) | ||
|
|
||
| x = torch.randn(4, 3, 224, 224) | ||
| x = x.to('cuda') | ||
| output = model(x) | ||
| loss = criterion(output) | ||
| booster.backward(loss, optimizer) | ||
| optimizer.clip_grad_by_norm(1.0) | ||
| optimizer.step() | ||
| scheduler.step() | ||
|
|
||
| save_path = "./model" | ||
| booster.save_model(model, save_path, True, True, "", 10, use_safetensors=use_safetensors) | ||
|
|
||
| new_model = resnet18() | ||
| booster.load_model(new_model, save_path) | ||
| ``` | ||
|
|
||
| [更多的设计细节请参考](https://github.com/hpcaitech/ColossalAI/discussions/3046) | ||
|
|
||
| <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 booster_api.py --> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.