Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,12 @@

## Overview

This folder provides several examples accelerated by Colossal-AI. The `tutorial` folder is for everyone to quickly try out the different features in Colossal-AI. Other folders such as `images` and `language` include a wide range of deep learning tasks and applications.
This folder provides several examples accelerated by Colossal-AI.
Folders such as `images` and `language` include a wide range of deep learning tasks and applications.
The `community` folder aim to create a collaborative platform for developers to contribute exotic features built on top of Colossal-AI.
The `tutorial` folder is for everyone to quickly try out the different features in Colossal-AI.

You can find applications such as Chatbot, Stable Diffusion and Biomedicine in the [Applications](https://github.com/hpcaitech/ColossalAI/tree/main/applications) directory.
You can find applications such as Chatbot, AIGC and Biomedicine in the [Applications](https://github.com/hpcaitech/ColossalAI/tree/main/applications) directory.

## Folder Structure

Expand Down Expand Up @@ -52,3 +55,10 @@ Therefore, it is essential for the example contributors to know how to integrate
2. Configure your testing parameters such as number steps, batch size in `test_ci.sh`, e.t.c. Keep these parameters small such that each example only takes several minutes.
3. Export your dataset path with the prefix `/data` and make sure you have a copy of the dataset in the `/data/scratch/examples-data` directory on the CI machine. Community contributors can contact us via slack to request for downloading the dataset on the CI machine.
4. Implement the logic such as dependency setup and example execution

## Community Dependency
We are happy to introduce the following nice community dependency repos that are powered by Colossal-AI:
- [lightning-ColossalAI](https://github.com/Lightning-AI/lightning)
- [HCP-Diffusion](https://github.com/7eu7d7/HCP-Diffusion)
- [KoChatGPT](https://github.com/airobotlab/KoChatGPT)
- [minichatgpt](https://github.com/juncongmoo/minichatgpt)
28 changes: 28 additions & 0 deletions examples/community/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#Community Examples

Community-driven Examples is an initiative that allows users to share their own examples to the Colossal-AI community, fostering a sense of community and making it easy for others to access and benefit from shared work. The primary goal with community-driven examples is to have a community-maintained collection of diverse and exotic functionalities built on top of the Colossal-AI package.

If a community example doesn't work as expected, you can [open an issue](https://github.com/hpcaitech/ColossalAI/issues/new/choose) and @ the author to report it.


| Example | Description | Code Example | Colab |Author |
|:------------------|:---------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------|:-----------------------------------------|-----------------------------------------------------:|
| RoBERTa | Adding RoBERTa for SFT and Prompts model training | [RoBERTa](./roberta) | - | [YY Lin](https://github.com/yynil) (Moore Threads) |
| TransformerEngine FP8 | Adding TransformerEngine with FP8 training | [TransformerEngine FP8](./fp8) | - | [Kirthi Shankar Sivamani](https://github.com/ksivaman) (NVIDIA) |
|...|...|...|...|...|

## Looking for Examples
* [Swin-Transformer](https://github.com/microsoft/Swin-Transformer)
* [T-5](https://github.com/google-research/text-to-text-transfer-transformer)
* [Segment Anything (SAM)](https://github.com/facebookresearch/segment-anything)
* [ControlNet](https://github.com/lllyasviel/ControlNet)
* [Consistency Models](https://github.com/openai/consistency_models)
* [MAE](https://github.com/facebookresearch/mae)
* [CLIP](https://github.com/openai/CLIP)

Welcome to [open an issue](https://github.com/hpcaitech/ColossalAI/issues/new/choose) to share your insights and needs.

## How to get involved
To join our community-driven initiative, please visit the [Colossal-AI examples](https://github.com/hpcaitech/ColossalAI/tree/main/examples), review the provided information, and explore the codebase.

To contribute, create a new issue outlining your proposed feature or enhancement, and our team will review and provide feedback. If you are confident enough you can also submit a PR directly. We look forward to collaborating with you on this exciting project!
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Basic MNIST Example with optional FP8 of TransformerEngine
[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower memory utilization in both training and inference.
Thanks for the contribution to this tutorial from NVIDIA.
```bash
python main.py
python main.py --use-te # Linear layers from TransformerEngine
python main.py --use-fp8 # FP8 + TransformerEngine for Linear layers
```
> We are working to integrate it with Colossal-AI and will finish it soon.
# Basic MNIST Example with optional FP8 of TransformerEngine

[TransformerEngine](https://github.com/NVIDIA/TransformerEngine) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower memory utilization in both training and inference.

Thanks for the contribution to this tutorial from NVIDIA.

```bash
python main.py
python main.py --use-te # Linear layers from TransformerEngine
python main.py --use-fp8 # FP8 + TransformerEngine for Linear layers
```

> We are working to integrate it with Colossal-AI and will finish it soon.
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@
# See LICENSE for license information.

import argparse

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
from torchvision import datasets, transforms

try:
from transformer_engine import pytorch as te
Expand All @@ -18,6 +19,7 @@


class Net(nn.Module):

def __init__(self, use_te=False):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
Expand Down Expand Up @@ -62,12 +64,10 @@ def train(args, model, device, train_loader, optimizer, epoch, use_fp8):
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print(
f"Train Epoch: {epoch} "
f"[{batch_idx * len(data)}/{len(train_loader.dataset)} "
f"({100. * batch_idx / len(train_loader):.0f}%)]\t"
f"Loss: {loss.item():.6f}"
)
print(f"Train Epoch: {epoch} "
f"[{batch_idx * len(data)}/{len(train_loader.dataset)} "
f"({100. * batch_idx / len(train_loader):.0f}%)]\t"
f"Loss: {loss.item():.6f}")
if args.dry_run:
break

Expand All @@ -83,6 +83,7 @@ def calibrate(model, device, test_loader):
with te.fp8_autocast(enabled=False, calibrating=True):
output = model(data)


def test(model, device, test_loader, use_fp8):
"""Testing function."""
model.eval()
Expand All @@ -93,21 +94,15 @@ def test(model, device, test_loader, use_fp8):
data, target = data.to(device), target.to(device)
with te.fp8_autocast(enabled=use_fp8):
output = model(data)
test_loss += F.nll_loss(
output, target, reduction="sum"
).item() # sum up batch loss
pred = output.argmax(
dim=1, keepdim=True
) # get the index of the max log-probability
test_loss += F.nll_loss(output, target, reduction="sum").item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()

test_loss /= len(test_loader.dataset)

print(
f"\nTest set: Average loss: {test_loss:.4f}, "
f"Accuracy: {correct}/{len(test_loader.dataset)} "
f"({100. * correct / len(test_loader.dataset):.0f}%)\n"
)
print(f"\nTest set: Average loss: {test_loss:.4f}, "
f"Accuracy: {correct}/{len(test_loader.dataset)} "
f"({100. * correct / len(test_loader.dataset):.0f}%)\n")


def main():
Expand Down Expand Up @@ -154,9 +149,7 @@ def main():
default=False,
help="quickly check a single pass",
)
parser.add_argument(
"--seed", type=int, default=1, metavar="S", help="random seed (default: 1)"
)
parser.add_argument("--seed", type=int, default=1, metavar="S", help="random seed (default: 1)")
parser.add_argument(
"--log-interval",
type=int,
Expand All @@ -170,15 +163,12 @@ def main():
default=False,
help="For Saving the current Model",
)
parser.add_argument(
"--use-fp8", action="store_true", default=False, help="Use FP8 for inference and training without recalibration"
)
parser.add_argument(
"--use-fp8-infer", action="store_true", default=False, help="Use FP8 inference only"
)
parser.add_argument(
"--use-te", action="store_true", default=False, help="Use Transformer Engine"
)
parser.add_argument("--use-fp8",
action="store_true",
default=False,
help="Use FP8 for inference and training without recalibration")
parser.add_argument("--use-fp8-infer", action="store_true", default=False, help="Use FP8 inference only")
parser.add_argument("--use-te", action="store_true", default=False, help="Use Transformer Engine")
args = parser.parse_args()
use_cuda = torch.cuda.is_available()

Expand All @@ -205,9 +195,7 @@ def main():
train_kwargs.update(cuda_kwargs)
test_kwargs.update(cuda_kwargs)

transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
dataset1 = datasets.MNIST("../data", train=True, download=True, transform=transform)
dataset2 = datasets.MNIST("../data", train=False, transform=transform)
train_loader = torch.utils.data.DataLoader(dataset1, **train_kwargs)
Expand All @@ -227,7 +215,7 @@ def main():

if args.save_model or args.use_fp8_infer:
torch.save(model.state_dict(), "mnist_cnn.pt")
print('Eval with reloaded checkpoint : fp8='+str(args.use_fp8_infer))
print('Eval with reloaded checkpoint : fp8=' + str(args.use_fp8_infer))
weights = torch.load("mnist_cnn.pt")
model.load_state_dict(weights)
test(model, device, test_loader, args.use_fp8_infer)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ ssh-keygen
ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination
```

- In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below.
- In all hosts, edit /etc/hosts to record all hosts' name and ip.The example is shown below.

```bash
192.168.2.1 GPU001
Expand All @@ -29,7 +29,7 @@ ssh-copy-id -i ~/.ssh/id_rsa.pub ip_destination
service ssh restart
```

## 1. Corpus Preprocessing
## 1. Corpus Preprocessing
```bash
cd preprocessing
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ This folder is used to preprocess chinese corpus with Whole Word Masked. You can
<span id='Split Sentence'/>

### 2.1. Split Sentence & Split data into multiple shard:
Firstly, each file has multiple documents, and each document contains multiple sentences. Split sentence through punctuation, such as `。!`. **Secondly, split data into multiple shard based on server hardware (cpu, cpu memory, hard disk) and corpus size.** Each shard contains a part of corpus, and the model needs to train all the shards as one epoch.
Firstly, each file has multiple documents, and each document contains multiple sentences. Split sentence through punctuation, such as `。!`. **Secondly, split data into multiple shard based on server hardware (cpu, cpu memory, hard disk) and corpus size.** Each shard contains a part of corpus, and the model needs to train all the shards as one epoch.
In this example, split 200G Corpus into 100 shard, and each shard is about 2G. The size of the shard is memory-dependent, taking into account the number of servers, the memory used by the tokenizer, and the memory used by the multi-process training to read the shard (n data parallel requires n\*shard_size memory). **To sum up, data preprocessing and model pretraining requires fighting with hardware, not just GPU.**

```python
Expand Down Expand Up @@ -49,7 +49,7 @@ python sentence_split.py --input_path /orginal_corpus --output_path /shard --sha
]
```

<summary><b>Output txt:</b></summary>
<summary><b>Output txt:</b></summary>

```
我今天去打篮球。
Expand All @@ -76,7 +76,7 @@ make

* `--input_path`: location of all shard with split sentences, e.g., /shard/0.txt, /shard/1.txt ...
* `--output_path`: location of all h5 with token_id, input_mask, segment_ids and masked_lm_positions, e.g., /h5/0.h5, /h5/1.h5 ...
* `--tokenizer_path`: tokenizer path contains huggingface tokenizer.json. Download config.json, special_tokens_map.json, vocab.txt and tokenzier.json from [hfl/chinese-roberta-wwm-ext-large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/tree/main)
* `--tokenizer_path`: tokenizer path contains huggingface tokenizer.json. Download config.json, special_tokens_map.json, vocab.txt and tokenzier.json from [hfl/chinese-roberta-wwm-ext-large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/tree/main)
* `--backend`: python or c++, **specifies c++ can obtain faster preprocess speed**
* `--dupe_factor`: specifies how many times the preprocessor repeats to create the input from the same article/document
* `--worker`: number of process
Expand All @@ -91,7 +91,7 @@ make
下周请假。
```

<summary><b>Output h5+numpy:</b></summary>
<summary><b>Output h5+numpy:</b></summary>

```
'input_ids': [[id0,id1,id2,id3,id4,id5,id6,0,0..],
Expand All @@ -102,4 +102,4 @@ make
...]
'masked_lm_positions': [[label1,-1,-1,label2,-1...],
...]
```
```
Loading