Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
fea2565
[shardformer] update shardformer readme
flybird11111 Sep 5, 2023
b2a2d13
[shardformer] update llama2/opt finetune example and shardformer upda…
flybird11111 Sep 6, 2023
a89e948
Merge branch 'feature/shardformer' of https://github.com/flybird11111…
flybird11111 Sep 6, 2023
0d5d5b2
[shardformer] update llama2/opt finetune example and shardformer upda…
flybird11111 Sep 6, 2023
82d76a8
[shardformer] update llama2/opt finetune example and shardformer upda…
flybird11111 Sep 6, 2023
6b71f75
Merge branch 'main' into feature/shardformer
flybird11111 Sep 7, 2023
05097f0
[shardformer] change dataset
flybird11111 Sep 7, 2023
fbac97d
Merge branch 'feature/shardformer' of https://github.com/flybird11111…
flybird11111 Sep 7, 2023
f06e22a
[shardformer] change dataset
flybird11111 Sep 7, 2023
abfe7a1
[shardformer] fix CI
flybird11111 Sep 7, 2023
b3e2869
[shardformer] fix
flybird11111 Sep 7, 2023
d641035
[shardformer] fix
flybird11111 Sep 7, 2023
f12bd7e
[shardformer] fix
flybird11111 Sep 7, 2023
d25fbde
[shardformer] fix
flybird11111 Sep 7, 2023
e84b267
[shardformer] fix
flybird11111 Sep 7, 2023
3f35976
[example] llama2 add finetune example
flybird11111 Sep 8, 2023
8033bfe
[example] llama2 add finetune example
flybird11111 Sep 8, 2023
d1c5f58
[example] llama2 add finetune example
flybird11111 Sep 8, 2023
8f1b6aa
[example] llama2 add finetune example
flybird11111 Sep 8, 2023
0617fe1
Merge branch 'main' into llama2
flybird11111 Sep 9, 2023
3142897
fix
flybird11111 Sep 11, 2023
1df5dc4
Merge branch 'llama2' of https://github.com/flybird11111/ColossalAI i…
flybird11111 Sep 11, 2023
5240696
update llama2 example
flybird11111 Sep 11, 2023
1fa07af
update llama2 example
flybird11111 Sep 11, 2023
4e7e5fd
fix
flybird11111 Sep 11, 2023
f258e90
update llama2 example
flybird11111 Sep 13, 2023
9ca9113
update llama2 example
flybird11111 Sep 13, 2023
ae03409
update llama2 example
flybird11111 Sep 13, 2023
bb355d7
update llama2 example
flybird11111 Sep 14, 2023
43cc09b
update llama2 example
flybird11111 Sep 14, 2023
fb16ca5
update llama2 example
flybird11111 Sep 14, 2023
591042c
Update requirements.txt
flybird11111 Sep 15, 2023
74f19a7
update llama2 example
flybird11111 Sep 15, 2023
3589248
update llama2 example
flybird11111 Sep 15, 2023
43b09df
update llama2 example
flybird11111 Sep 15, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from torch.optim import Optimizer
from torch.optim.lr_scheduler import _LRScheduler as LRScheduler

from colossalai.cluster import DistCoordinator
Comment thread
Fridge003 marked this conversation as resolved.
from colossalai.interface import OptimizerWrapper

from .general_checkpoint_io import GeneralCheckpointIO
Expand Down Expand Up @@ -71,6 +72,7 @@ def __init__(self,
self.verbose = verbose
self.working_to_master_map = None
self.master_to_working_map = None
self.coordinator = DistCoordinator()
Comment thread
Fridge003 marked this conversation as resolved.

@staticmethod
def _model_sharder(model: nn.Module,
Expand Down Expand Up @@ -655,7 +657,7 @@ def gather_from_sharded_optimizer_state(state: OrderedDict, param: torch.Tensor,
dist.all_gather(gather_tensor, v, group=tp_group)
v = torch.cat(gather_tensor, dim=partition_dim)

state_[k] = v.detach().clone().cpu()
state_[k] = v.detach().clone().cpu()

return state_

Expand Down
7 changes: 3 additions & 4 deletions examples/language/bert/finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,14 +129,13 @@ def train_epoch(epoch: int, model: nn.Module, optimizer: Optimizer, _criterion:

use_pipeline = isinstance(booster.plugin, HybridParallelPlugin) and booster.plugin.pp_size > 1
is_pp_last_stage = use_pipeline and booster.plugin.stage_manager.is_last_stage()
print_flag = (not use_pipeline and coordinator.is_master()) or (use_pipeline and is_pp_last_stage)
total_step = len(train_dataloader)

model.train()
optimizer.zero_grad()
train_dataloader_iter = iter(train_dataloader)
with tqdm(range(total_step),
desc=f'Epoch [{epoch + 1}/{NUM_EPOCHS}]',
disable=not (coordinator.is_master() or is_pp_last_stage)) as pbar:
with tqdm(range(total_step), desc=f'Epoch [{epoch + 1}/{NUM_EPOCHS}]', disable=not print_flag) as pbar:
# Forward pass
for _ in pbar:
if use_pipeline:
Expand Down Expand Up @@ -192,13 +191,13 @@ def main():
model_name = "albert-xxlarge-v2"
else:
raise RuntimeError

# ==============================
# Launch Distributed Environment
# ==============================
colossalai.launch_from_torch(config={}, seed=42)
coordinator = DistCoordinator()

# local_batch_size = BATCH_SIZE // coordinator.world_size
lr = LEARNING_RATE * coordinator.world_size

# ==============================
Expand Down
41 changes: 39 additions & 2 deletions examples/language/llama2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
</p>

- 70 billion parameter LLaMA2 model training accelerated by 195%
[[code]](https://github.com/hpcaitech/ColossalAI/tree/example/llama/examples/language/llama)
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/llama2)
[[blog]](https://www.hpc-ai.tech/blog/70b-llama2-training)

### LLaMA1
Expand Down Expand Up @@ -92,7 +92,7 @@ Make sure master node can access all nodes (including itself) by ssh without pas
Here is details about CLI arguments:

- Model configuration: `-c`, `--config`. `7b`, `13b`, `30b` and `65b` are supported for LLaMA-1, `7b`, `13b`, and `70b` are supported for LLaMA-2.
- Booster plugin: `-p`, `--plugin`. `gemini`, `gemini_auto`, `zero2` and `zero2_cpu` are supported. For more details, please refer to [Booster plugins](https://colossalai.org/docs/basics/booster_plugins).
- Booster plugin: `-p`, `--plugin`. `gemini`, `gemini_auto`, `zero2`, `hybrid_parallel` and `zero2_cpu` are supported. For more details, please refer to [Booster plugins](https://colossalai.org/docs/basics/booster_plugins).
- Dataset path: `-d`, `--dataset`. The default dataset is `togethercomputer/RedPajama-Data-1T-Sample`. It support any dataset from `datasets` with the same data format as RedPajama.
- Number of epochs: `-e`, `--num_epochs`. The default value is 1.
- Local batch size: `-b`, `--batch_size`. Batch size per GPU. The default value is 2.
Expand Down Expand Up @@ -192,3 +192,40 @@ If you run the above command successfully, you will get the following results:
year={2023}
}
```


# Fine-tune Llama2

We also provide a example to fine-tune llama2 in `finetune.py`,

Make sure master node can access all nodes (including itself) by ssh without password.

Here is details about CLI arguments:

- Pretrained checkpoint path: `--model_path`, the path of your model checkpoint, it can be your local directory or a Hugging Face tag.
- Booster plugin: `-p`, `--plugin`. `gemini`, `gemini_auto`, `zero2`, `hybrid_parallel` and `zero2_cpu` are supported. For more details, please refer to [Booster plugins](https://colossalai.org/docs/basics/booster_plugins).
- Dataset path: `-d`, `--dataset`. The default dataset is `yizhongw/self_instruct`. It support any dataset from `datasets` with the same data format as `yizhongw/self_instruct`.
- task name: `--task_name`, the task to fine-tune, it's also related to the target of loading dataset, The default value is `super_natural_instructions`.
- Number of epochs: `-e`, `--num_epochs`. The default value is 1.
- Local batch size: `-b`, `--batch_size`. Batch size per GPU. The default value is 2.
- Learning rate: `--lr`. The default value is 3e-4.
- Weight decay: `-w`, `--weight_decay`. The default value is 0.1.
- Gradient checkpointing: `-g`, `--gradient_checkpoint`. The default value is `False`. This saves memory at the cost of speed. You'd better enable this option when training with a large batch size.
- Max length: `-l`, `--max_length`. The default value is 4096.
- Mixed precision: `-x`, `--mixed_precision`. The default value is "fp16". "fp16" and "bf16" are supported.
- Save interval: `-i`, `--save_interval`. The interval (steps) of saving checkpoints. The default value is 1000.
- Checkpoint directory: `-o`, `--save_dir`. The directoty path to save checkpoints. The default value is `checkpoint`.
- Checkpoint to load: `-f`, `--load`. The checkpoint path to load. The default value is `None`.
- Gradient clipping: `--gradient_clipping`. The default value is 1.0.
- Tensorboard log directory: `-t`, `--tensorboard_dir`. The directory path to save tensorboard logs. The default value is `tb_logs`.
- Flash attention: `-a`, `--flash_attention`. If you want to use flash attention, you must install `flash-attn`. The default value is `False`. This is helpful to accelerate training while saving memory. We recommend you always use flash attention.


```shell
torchrun --standalone --nproc_per_node 8 finetune.py \
--plugin "hybrid_parallel" \
--dataset "yizhongw/self_instruct" \
--model_path "/path/llama" \
--task_name "super_natural_instructions" \
--save_dir "/path/output"
```
Loading