j by jamesthesnake · Pull Request #161 · jamesthesnake/ColossalAI

jamesthesnake · 2023-09-08T16:28:50Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

* [shardformer/sequence parallel] Support sequence parallel for gpt2 (#4384) * [sequence parallel] add sequence parallel linear col/row support (#4336) * add sequence parallel linear col/row support * add annotation * add annotation * add support for gpt2 fused qkv linear layer * support sequence parallel in GPT2 * add docstring and note * add requirments * remove unused flash-attb * modify flash attn test * modify flash attn setting * modify flash attn code * add assert before divide, rename forward function * [shardformer/test] fix gpt2 test with seq-parallel * [shardformer/sequence parallel] Overlap input gather and grad computation during col backward (#4401) * overlap gather input / grad computing during col backward * modify test for overlap * simplify code * fix code and modify cuda stream synchronize * [shardformer/sequence parallel] polish code

* support DDP for HybridPlugin/add tp+dp tests * add docstring for HybridParallelPlugin

* [test] remove cpu marker * [test] remove gpu marker * [test] update pytest markers * [ci] update unit test ci

* support interleaved pipeline * fix unit test * remove virtual stage test in stage mgr * add droped type hint and updated bwd

…tp (#4460) * support gpt2 seq parallel with pp/dp/tp * fix a bug when waiting for stream done * delete unused gpt2_seq file

[shardformer] bloom support sequence parallel

* [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel * [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel * [shardformer] bert support sequence parallel

* add some base tests and policies * finish whisper base model * add conditional generation * finish basic tests * whisper * finish whisper * finish whisper * del useless whisper test * fix * add argmin to replace * finish revision

* support tp+zero/input type cast for hybridplugin * add tp+zero tests * fix bucket arguments

…warning and fix a bug in gpt2 pp (#4488)

* [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel * fix fix fix fix

…ome fix. (#4498) * [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel * fix fix fix fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * activate checks

* [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel * fix fix fix fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * activate checks * [Test] test ci * test ci * test ci * test ci * test ci * test ci * test ci * fix

…lelPlugin (#4506) * add APIs * implement save_sharded_model * add test for hybrid checkpointio * implement naive loading for sharded model * implement efficient sharded model loading * open a new file for hybrid checkpoint_io * small fix * fix circular importing * fix docstring * arrange arguments and apis * small fix

* pause * finish pp+zero1 * Update test_shard_vit.py

…on in shardco… (#4516) * fix overlap bug and support bert, add overlap as an option in shardconfig * support overlap for chatglm and bloom

* add overlap support for gpt2 * remove unused code * remove unused code

* [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix

* [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1

…lPlugin (#4540) * implement sharded optimizer saving * add more param info * finish implementation of sharded optimizer saving * fix bugs in optimizer sharded saving * add pp+zero test * param group loading * greedy loading of optimizer * fix bug when loading * implement optimizer sharded saving * add optimizer test & arrange checkpointIO utils * fix gemini sharding state_dict * add verbose option * add loading of master params * fix typehint * fix master/working mapping in fp16 amp

…arallelPlugin (#4575) * hybrid plugin support huggingface from_pretrained * add huggingface compatibility tests * add folder cleaning * fix bugs

* pytree test * test bert * test bert * test bert * revise * add register * add register

…4584) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] fix epoch change * [shardformer] broadcast add pp group * [shardformer] fix opt test hanging * fix * test * test * [shardformer] zero1+pp and the corresponding tests (#4517) * pause * finish pp+zero1 * Update test_shard_vit.py * [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516) * fix overlap bug and support bert, add overlap as an option in shardconfig * support overlap for chatglm and bloom * [shardformer] fix emerged bugs after updating transformers (#4526) * test * fix test * fix test * remove print * add fix * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] Add overlap support for gpt2 (#4535) * add overlap support for gpt2 * remove unused code * remove unused code * [shardformer] support pp+tp+zero1 tests (#4531) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] fix submodule replacement bug when enabling pp (#4544) * [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) * implement sharded optimizer saving * add more param info * finish implementation of sharded optimizer saving * fix bugs in optimizer sharded saving * add pp+zero test * param group loading * greedy loading of optimizer * fix bug when loading * implement optimizer sharded saving * add optimizer test & arrange checkpointIO utils * fix gemini sharding state_dict * add verbose option * add loading of master params * fix typehint * fix master/working mapping in fp16 amp * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] fix epoch change * [shardformer] broadcast add pp group * rebase feature/shardformer * update pipeline * [shardformer] fix * [shardformer] fix * [shardformer] bert finetune fix * [shardformer] add all_reduce operation to loss add all_reduce operation to loss * [shardformer] make compatible with pytree. make compatible with pytree. * [shardformer] disable tp disable tp * [shardformer] add 3d plugin to ci test * [shardformer] update num_microbatches to None * [shardformer] update microbatchsize * [shardformer] update assert * update scheduler * update scheduler --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>

…4606)

* add optional overlap for plugin * remove fixed todo

[shardformer] update shardformer readme [shardformer] update shardformer readme

* [zero] add method to update master params * [zero] update zero plugin * [plugin] update low level zero plugin

* [legacy] move trainer to legacy * [doc] update docs related to trainer * [test] ignore legacy test

* [legacy] move engine to legacy * [example] fix seq parallel example * [example] fix seq parallel example * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [example] update seq parallel requirements

[shardformer] update hybrid parallel plugin and fix bugs

…HybridParallelPlugin (#4624) * Enable policy assignment in HybridPlugin and enable llama policy for llamav2 * Remove Policy from Plugin * revert changes of plugin HybridParallelModule * revert changes in plugin * upgrade transformers * revert transformers version --------- Co-authored-by: flybird11111 <1829166702@qq.com>

* set optimizer to optional in execute_pipeline * arrange device and mixed precision in booster init * fix execute_pipeline in booster.py

* update vit example for hybrid plugin * reset tp/pp size * fix dataloader iteration bug * update optimizer passing in evaluation/add grad_accum * change criterion * wrap tqdm * change grad_accum to grad_checkpoint * fix pbar

* [devops] fix concurrency group * [devops] fix compatibility test * [devops] fix tensornvme install * [devops] fix tensornvme install * [devops] fix colossalai install

FoolPlayer and others added 30 commits August 16, 2023 15:41

[shardformer] support DDP in HybridPlugin/add tp+dp tests (#4446)

6ef33f7

* support DDP for HybridPlugin/add tp+dp tests * add docstring for HybridParallelPlugin

[devops] add large-scale distributed test marker (#4452)

26e29d5

* [test] remove cpu marker * [test] remove gpu marker * [test] update pytest markers * [ci] update unit test ci

[shardformer] support interleaved pipeline (#4448)

a78daf6

* support interleaved pipeline * fix unit test * remove virtual stage test in stage mgr * add droped type hint and updated bwd

[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/…

7c8be77

…tp (#4460) * support gpt2 seq parallel with pp/dp/tp * fix a bug when waiting for stream done * delete unused gpt2_seq file

[shardformer] bloom support sequence parallel (#4465)

0ecd71e

[shardformer] bloom support sequence parallel

[shardformer] Pipeline/whisper (#4456)

8739aa7

* add some base tests and policies * finish whisper base model * add conditional generation * finish basic tests * whisper * finish whisper * finish whisper * del useless whisper test * fix * add argmin to replace * finish revision

[shardformer] support tp+zero for shardformer (#4472)

1c7df56

* support tp+zero/input type cast for hybridplugin * add tp+zero tests * fix bucket arguments

rename chatglm to chatglm2 (#4484)

5545114

[shardformer/sequence parallel] not support opt of seq-parallel, add …

351351a

…warning and fix a bug in gpt2 pp (#4488)

[shardformer] tests for 3d parallel (#4493)

e04436a

[shardformer] zero1+pp and the corresponding tests (#4517)

376533a

* pause * finish pp+zero1 * Update test_shard_vit.py

[shardformer/fix overlap bug] fix overlap bug, add overlap as an opti…

c554b7f

…on in shardco… (#4516) * fix overlap bug and support bert, add overlap as an option in shardconfig * support overlap for chatglm and bloom

[shardformer] fix emerged bugs after updating transformers (#4526)

0387a47

[shardformer] Add overlap support for gpt2 (#4535)

e241b74

* add overlap support for gpt2 * remove unused code * remove unused code

[shardformer] fix opt test hanging (#4521)

d367b88

* [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix

[shardformer] fix submodule replacement bug when enabling pp (#4544)

2c787d7

[shardformer] support from_pretrained when loading model with HybridP…

38ccb8b

…arallelPlugin (#4575) * hybrid plugin support huggingface from_pretrained * add huggingface compatibility tests * add folder cleaning * fix bugs

[pipeline] 1f1b schedule receive microbatch size (#4589)

508ca36

[shardformer] Pytree fix (#4533)

24c0768

* pytree test * test bert * test bert * test bert * revise * add register * add register

[checkpointio] support huggingface from_pretrained for all plugins (#…

e79b1e8

…4606)

Merge branch 'main' into feature/shardformer

a39a5c6

FoolPlayer and others added 15 commits September 5, 2023 11:52

[shardformer] Add overlap optional for HybridParallelPlugin (#4615)

86d2258

* add optional overlap for plugin * remove fixed todo

[shardformer] update shardformer readme (#4617)

ec08668

[shardformer] update shardformer readme [shardformer] update shardformer readme

[test] ignore gpt2 shardformer test (#4619)

e71d245

[zero] hotfix master param sync (#4618)

807e01a

* [zero] add method to update master params * [zero] update zero plugin * [plugin] update low level zero plugin

[test] fix gemini checkpoint and gpt test (#4620)

bd18678

[legacy] move trainer to legacy (#4545)

89fe027

* [legacy] move trainer to legacy * [doc] update docs related to trainer * [test] ignore legacy test

[legacy] move builder and registry to legacy (#4603)

ac178ca

Merge branch 'main' into feature/shardformer

fae6c92

Merge pull request #4612 from hpcaitech/feature/shardformer

efba0f4

[shardformer] update hybrid parallel plugin and fix bugs

[release] update version (#4623)

9709b8f

[pipeline] set optimizer to optional in execute_pipeline (#4630)

660eed9

* set optimizer to optional in execute_pipeline * arrange device and mixed precision in booster init * fix execute_pipeline in booster.py

[devops] fix concurrency group and compatibility test (#4665)

a686f9d

* [devops] fix concurrency group * [devops] fix compatibility test * [devops] fix tensornvme install * [devops] fix tensornvme install * [devops] fix colossalai install

jamesthesnake merged commit 973f7e4 into jamesthesnake:l Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

j#161

j#161
jamesthesnake merged 45 commits intojamesthesnake:lfrom
hpcaitech:main

jamesthesnake commented Sep 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

jamesthesnake commented Sep 8, 2023

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants