tra by jamesthesnake · Pull Request #183 · jamesthesnake/ColossalAI

jamesthesnake · 2023-10-15T00:55:51Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

…case) (#4771) * add Colossal-Inference serving example w/ TorchServe * add dockerfile * fix dockerfile * fix dockerfile: fix commit hash, install curl * refactor file structure * revise readme * trivial * trivial: dockerfile format * clean dir; revise readme * fix comments: fix imports and configs * fix formats * remove unused requirements

* fix imports * add ray-serve with Colossal-Infer tp * trivial: send requests script * add README * fix worker port * fix readme * use app builder and autoscaling * trivial: input args * clean code; revise readme * testci (skip example test) * use auto model/tokenizer * revert imports fix (fixed in other PRs)

* fix import bug and release useless init * fix * fix * fix

* fix test bug * delete useless code * fix typo

…ng correspoding docs)

[doc]: typo in document of booster low_level_zero plugin

…zero [test] remove the redundant code of model output transformation in torchrec

…e style (#4792)

…#4866) * [doc]update advanced tutorials, training gpt with hybrid parallelism * [doc]update advanced tutorials, training gpt with hybrid parallelism * update vit tutorials * update vit tutorials * update vit tutorials * update vit tutorials * update en/train_vit_with_hybrid_parallel.py * fix * resolve comments * fix

add modelscope link

add modelscope model link

* [pipeline inference] pipeline inference (#4492) * add pp stage manager as circle stage * fix a bug when create process group * add ppinfer basic framework * add micro batch manager and support kvcache-pp gpt2 fwd * add generate schedule * use mb size to control mb number * support generate with kv cache * add output, remove unused code * add test * reuse shardformer to build model * refactor some code and use the same attribute name of hf * fix review and add test for generation * remove unused file * fix CI * add cache clear * fix code error * fix typo * [Pipeline inference] Modify to tieweight (#4599) * add pp stage manager as circle stage * fix a bug when create process group * add ppinfer basic framework * add micro batch manager and support kvcache-pp gpt2 fwd * add generate schedule * use mb size to control mb number * support generate with kv cache * add output, remove unused code * add test * reuse shardformer to build model * refactor some code and use the same attribute name of hf * fix review and add test for generation * remove unused file * modify the way of saving newtokens * modify to tieweight * modify test * remove unused file * solve review * add docstring * [Pipeline inference] support llama pipeline inference (#4647) * support llama pipeline inference * remove tie weight operation * [pipeline inference] Fix the blocking of communication when ppsize is 2 (#4708) * add benchmark verbose * fix export tokens * fix benchmark verbose * add P2POp style to do p2p communication * modify schedule as p2p type when ppsize is 2 * remove unused code and add docstring * [Pipeline inference] Refactor code, add docsting, fix bug (#4790) * add benchmark script * update argparse * fix fp16 load * refactor code style * add docstring * polish code * fix test bug * [Pipeline inference] Add pipeline inference docs (#4817) * add readme doc * add a ico * Add performance * update table of contents * refactor code (#4873)

[doc] add reminder for issue encountered with hybrid adam

* [gemini] support no reuse fp16 chunk * [gemini] support no master weight for optim * [gemini] support no master weight for gemini ddp * [test] update gemini tests * [test] update gemini tests * [plugin] update gemini plugin * [test] fix gemini checkpointio test * [test] fix gemini checkpoint io

* Add clip_grad_norm for hibrid_parallel_plugin * polish code * add unittests * Move tp to a higher-level optimizer interface. * bug fix * polish code

* add llama2 support * fix multi group bug

…4816) * [feature] support no master weights for low level zero plugin * [feature] support no master weights for low level zero plugin, remove data copy when no master weights * remove data copy and typecasting when no master weights * not load weights to cpu when using no master weights * fix grad: use fp16 grad when no master weights * only do not update working param when no master weights * fix: only do not update working param when no master weights * fix: passing params in dict format in hybrid plugin * fix: remove extra params (tp_process_group) in hybrid_parallel_plugin

Xu-Kai and others added 30 commits September 28, 2023 13:47

add autotune (#4822)

c3bef20

update Colossal (#4832)

ed06731

[inference]fix import bug and delete down useless init (#4830)

013a4be

* fix import bug and release useless init * fix * fix * fix

[infer] fix test bug (#4838)

d1fcc0f

* fix test bug * delete useless code * fix typo

[test] modify model supporting part of low_level_zero plugin (includi…

db40e08

…ng correspoding docs)

fix: typo in comment of low_level_zero plugin

c97a352

Merge pull request #4858 from Shawlleyw/main

81ee91f

[doc]: typo in document of booster low_level_zero plugin

Merge pull request #4856 from KKZ20/test/model_support_for_low_level_…

ad23460

…zero [test] remove the redundant code of model output transformation in torchrec

[checkpointio] hotfix torch 2.0 compatibility (#4824)

cb3a25a

polish code for gptq (#4793)

eef96e0

[NFC] polish colossalai/inference/quant/gptq/cai_gptq/__init__.py cod…

07ed155

…e style (#4792)

[NFC] polish code style (#4799)

cd6a962

[nfc] fix minor typo in README (#4846)

8aed02b

Update modelscope link in README.md

3043d5d

add modelscope link

Update main README.md

d6c4b9b

add modelscope model link

Update README.md

afe10a8

Update README.md

652adc2

fix test llama (#4884)

fdec650

[doc] add reminder for issue encountered with hybrid adam

1dcaf24

[hotfix] fix bug in sequence parallel test (#4887)

ffd9a3c

Merge pull request #4889 from ppt0011/main

c1fab95

[doc] add reminder for issue encountered with hybrid adam

[feature] Add clip_grad_norm for hybrid_parallel_plugin (#4837)

83b52c5

* Add clip_grad_norm for hibrid_parallel_plugin * polish code * add unittests * Move tp to a higher-level optimizer interface. * bug fix * polish code

[hotfix] fix lr scheduler bug in torch 2.0 (#4864)

39f2582

[inference] add llama2 support (#4898)

77a9328

* add llama2 support * fix multi group bug

jamesthesnake merged commit 786d11e into jamesthesnake:co Oct 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tra#183

tra#183
jamesthesnake merged 30 commits intojamesthesnake:cofrom
hpcaitech:main

jamesthesnake commented Oct 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

jamesthesnake commented Oct 15, 2023

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants