Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 28 additions & 67 deletions examples/tutorial/auto_parallel/README.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,52 @@
# Auto-Parallelism with ResNet
# Auto-Parallelism

## 🚀Quick Start
### Auto-Parallel Tutorial
1. Install `pulp` and `coin-or-cbc` for the solver.
```bash
pip install pulp
conda install -c conda-forge coin-or-cbc
```
2. Run the auto parallel resnet example with 4 GPUs with synthetic dataset.
```bash
colossalai run --nproc_per_node 4 auto_parallel_with_resnet.py -s
```
## Table of contents

You should expect to the log like this. This log shows the edge cost on the computation graph as well as the sharding strategy for an operation. For example, `layer1_0_conv1 S01R = S01R X RR` means that the first dimension (batch) of the input and output is sharded while the weight is not sharded (S means sharded, R means replicated), simply equivalent to data parallel training.
![](https://raw.githubusercontent.com/hpcaitech/public_assets/main/examples/tutorial/auto-parallel%20demo.png)
- [Auto-Parallelism](#auto-parallelism)
- [Table of contents](#table-of-contents)
- [📚 Overview](#-overview)
- [🚀 Quick Start](#-quick-start)
- [Setup](#setup)
- [Auto-Parallel Tutorial](#auto-parallel-tutorial)
- [Auto-Checkpoint Tutorial](#auto-checkpoint-tutorial)


### Auto-Checkpoint Tutorial
1. Stay in the `auto_parallel` folder.
2. Install the dependencies.
```bash
pip install matplotlib transformers
```
3. Run a simple resnet50 benchmark to automatically checkpoint the model.
```bash
python auto_ckpt_solver_test.py --model resnet50
```
## 📚 Overview

You should expect the log to be like this
![](https://raw.githubusercontent.com/hpcaitech/public_assets/main/examples/tutorial/auto-ckpt%20demo.png)
This tutorial folder contains a simple demo to run auto-parallelism with ResNet. Meanwhile, this diretory also contains demo scripts to run automatic activation checkpointing, but both features are still experimental for now and no guarantee that they will work for your version of Colossal-AI.

This shows that given different memory budgets, the model is automatically injected with activation checkpoint and its time taken per iteration. You can run this benchmark for GPT as well but it can much longer since the model is larger.
```bash
python auto_ckpt_solver_test.py --model gpt2
```
## 🚀 Quick Start

4. Run a simple benchmark to find the optimal batch size for checkpointed model.
```bash
python auto_ckpt_batchsize_test.py
```
### Setup

You can expect the log to be like
![](https://raw.githubusercontent.com/hpcaitech/public_assets/main/examples/tutorial/auto-ckpt%20batchsize.png)


## Prepare Dataset

We use CIFAR10 dataset in this example. You should invoke the `donwload_cifar10.py` in the tutorial root directory or directly run the `auto_parallel_with_resnet.py`.
The dataset will be downloaded to `colossalai/examples/tutorials/data` by default.
If you wish to use customized directory for the dataset. You can set the environment variable `DATA` via the following command.
1. Create a conda environment

```bash
export DATA=/path/to/data
conda create -n auto python=3.8
conda activate auto
```

## extra requirements to use autoparallel
2. Install `requirements` and `coin-or-cbc` for the solver.

```bash
pip install pulp
conda install coin-or-cbc
pip install -r requirements.txt
conda install -c conda-forge coin-or-cbc
```

## Run on 2*2 device mesh

### Auto-Parallel Tutorial

Run the auto parallel resnet example with 4 GPUs with synthetic dataset.

```bash
colossalai run --nproc_per_node 4 auto_parallel_with_resnet.py
```

## Auto Checkpoint Benchmarking
You should expect to the log like this. This log shows the edge cost on the computation graph as well as the sharding strategy for an operation. For example, `layer1_0_conv1 S01R = S01R X RR` means that the first dimension (batch) of the input and output is sharded while the weight is not sharded (S means sharded, R means replicated), simply equivalent to data parallel training.
![](https://raw.githubusercontent.com/hpcaitech/public_assets/main/examples/tutorial/auto-parallel%20demo.png)


### Auto-Checkpoint Tutorial

We prepare two bechmarks for you to test the performance of auto checkpoint

Expand All @@ -86,21 +65,3 @@ python auto_ckpt_solver_test.py --model resnet50
# tun auto_ckpt_batchsize_test.py
python auto_ckpt_batchsize_test.py
```

There are some results for your reference

## Auto Checkpoint Solver Test

### ResNet 50
![](https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/resnet50_benchmark.png)

### GPT2 Medium
![](https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/gpt2_benchmark.png)

## Auto Checkpoint Batch Size Test
```bash
===============test summary================
batch_size: 512, peak memory: 73314.392 MB, through put: 254.286 images/s
batch_size: 1024, peak memory: 73316.216 MB, through put: 397.608 images/s
batch_size: 2048, peak memory: 72927.837 MB, through put: 277.429 images/s
```
18 changes: 3 additions & 15 deletions examples/tutorial/auto_parallel/auto_parallel_with_resnet.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,4 @@
import argparse
import os
from pathlib import Path

import torch
from titans.utils import barrier_context
from torchvision import transforms
from torchvision.datasets import CIFAR10
from torchvision.models import resnet50
from tqdm import tqdm

Expand All @@ -14,9 +7,6 @@
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.nn.lr_scheduler import CosineAnnealingLR
from colossalai.utils import get_dataloader

DATA_ROOT = Path(os.environ.get('DATA', '../data')).absolute()


def synthesize_data():
Expand Down Expand Up @@ -48,9 +38,8 @@ def main():
model.train()

# if we use synthetic data
# we assume it only has 30 steps per epoch
num_steps = range(30)

# we assume it only has 10 steps per epoch
num_steps = range(10)
progress = tqdm(num_steps)

for _ in progress:
Expand All @@ -73,8 +62,7 @@ def main():

# if we use synthetic data
# we assume it only has 10 steps for evaluation
num_steps = range(30)

num_steps = range(10)
progress = tqdm(num_steps)

for _ in progress:
Expand Down
4 changes: 2 additions & 2 deletions examples/tutorial/auto_parallel/config.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
BATCH_SIZE = 128
NUM_EPOCHS = 10
BATCH_SIZE = 32
NUM_EPOCHS = 2
32 changes: 0 additions & 32 deletions examples/tutorial/auto_parallel/environment.yaml

This file was deleted.

9 changes: 7 additions & 2 deletions examples/tutorial/auto_parallel/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
colossalai >= 0.1.12
torch >= 1.8.1
torch
colossalai
titans
pulp
datasets
matplotlib
transformers
11 changes: 3 additions & 8 deletions examples/tutorial/auto_parallel/test_ci.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,6 @@
#!/bin/bash
set -euxo pipefail

conda init bash
conda env create -f environment.yaml
conda activate auto
cd ../../..
pip uninstall colossalai
pip install -v .
cd ./examples/tutorial/auto_parallel
colossalai run --nproc_per_node 4 auto_parallel_with_resnet.py -s
pip install -r requirements.txt
conda install -c conda-forge coin-or-cbc
colossalai run --nproc_per_node 4 auto_parallel_with_resnet.py