Skip to content
Merged

L #65

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
b68f7f9
Merge pull request #38 from jamesthesnake/ra
jamesthesnake May 8, 2023
20873a5
Merge pull request #41 from hpcaitech/main
jamesthesnake May 14, 2023
7c9f2ed
[dtensor] polish sharding spec docstring (#3838)
ver217 May 25, 2023
46503c3
Modify torch version requirement to adapt torch 2.0
MaruyamaAya Jun 1, 2023
fb06bd0
Merge pull request #50 from hpcaitech/main
jamesthesnake Jun 1, 2023
60ec33b
Add a new example of Dreambooth training using the booster API
MaruyamaAya Jun 2, 2023
42e3232
roll back
MaruyamaAya Jun 2, 2023
5fc120c
Merge pull request #55 from jamesthesnake/ra
jamesthesnake Jun 2, 2023
25447d4
modify path
MaruyamaAya Jun 5, 2023
3898942
Merge pull request #56 from hpcaitech/main
jameshennessytempus Jun 5, 2023
be6afda
Merge pull request #58 from jamesthesnake/ra
jamesthesnake Jun 5, 2023
ec9bbc0
[devops] improving testmon cache (#3902)
ver217 Jun 6, 2023
c1535cc
[doc] fix docs about booster api usage (#3898)
Fridge003 Jun 6, 2023
0e484e6
[nfc]fix typo colossalai/pipeline tensor nn (#3899)
digger-yu Jun 6, 2023
176010f
update performance evaluation
MaruyamaAya Jun 6, 2023
b56c7f4
update shell file
MaruyamaAya Jun 6, 2023
1c1f71c
fixing insecure hash function
MaruyamaAya Jun 6, 2023
b29e1f0
change directory
MaruyamaAya Jun 6, 2023
d3379f0
fixed model saving bugs
MaruyamaAya Jun 6, 2023
79c9f77
fixed port
MaruyamaAya Jun 6, 2023
b4437e8
fixed port
MaruyamaAya Jun 6, 2023
41fb723
[devops] hotfix CI about testmon cache (#3910)
ver217 Jun 6, 2023
b5f0566
[chat] add distributed PPO trainer (#3740)
ver217 Jun 7, 2023
4fc8bc6
modify file path
MaruyamaAya Jun 7, 2023
9c88b6c
[lazy] fix compatibility problem on torch 1.13 (#3911)
ver217 Jun 7, 2023
c622bb3
Merge pull request #3915 from FrankLeeeee/update/develop
FrankLeeeee Jun 7, 2023
d51e83d
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop
FrankLeeeee Jun 7, 2023
c25d421
[devops] hotfix testmon cache clean logic (#3917)
ver217 Jun 7, 2023
5e2132d
[workflow] added docker latest tag for release (#3920)
FrankLeeeee Jun 7, 2023
a55fb00
[booster] update bert example, using booster api (#3885)
wukong1992 Jun 7, 2023
b306cec
[example] Modify palm example with the new booster API (#3913)
MaruyamaAya Jun 7, 2023
a9d1cad
fix typo with colossalai/trainer utils zero (#3908)
digger-yu Jun 7, 2023
c94a335
modify shell for check
MaruyamaAya Jun 7, 2023
12c90db
[doc] add lazy init tutorial (#3922)
ver217 Jun 7, 2023
ea79888
Merge pull request #60 from hpcaitech/main
jamesthesnake Jun 7, 2023
eb41632
Merge branch 'l' into co
jamesthesnake Jun 7, 2023
f7121c5
Merge pull request #61 from jamesthesnake/co
jamesthesnake Jun 7, 2023
de0d7df
[nfc] fix typo colossalai/zero (#3923)
digger-yu Jun 7, 2023
9166988
[devops] update torch version in compability test (#3919)
ver217 Jun 8, 2023
eb39154
[dtensor] updated api and doc (#3845)
FrankLeeeee Jun 8, 2023
cf4792c
modify shell for check
MaruyamaAya Jun 8, 2023
e417dd0
[example] update opt example using booster api (#3918)
Fridge003 Jun 8, 2023
039854b
modify shell for check
MaruyamaAya Jun 8, 2023
49567d5
modify shell for check
MaruyamaAya Jun 8, 2023
730a092
modify shell for check
MaruyamaAya Jun 8, 2023
407aa48
fix typo examples/community/roberta (#3925)
digger-yu Jun 8, 2023
a98e16e
Merge pull request #3926 from hpcaitech/feature/dtensor
FrankLeeeee Jun 8, 2023
9b5e7ce
modify shell for check
MaruyamaAya Jun 8, 2023
6a69b44
[shardformer] init shardformer code structure (#3731)
FoolPlayer May 22, 2023
58f6432
[shardformer]: Feature/shardformer, add some docstring and readme (#3…
FoolPlayer May 24, 2023
bc19024
[shardformer] updated readme (#3827)
FrankLeeeee May 24, 2023
537a52b
[shardformer] refactored the user api (#3828)
FrankLeeeee May 24, 2023
997544c
[shardformer] update readme with modules implement doc (#3834)
FoolPlayer May 24, 2023
21a3915
[shardformer] add Dropout layer support different dropout pattern (#3…
FoolPlayer Jun 1, 2023
6370a93
update README (#3909)
FoolPlayer Jun 6, 2023
ef15377
[shardformer] add gpt2 policy and modify shard and slicer to support …
FoolPlayer Jun 7, 2023
33eef71
fix typo examples and docs (#3932)
digger-yu Jun 8, 2023
21c4c0b
support UniEval and add CHRF metric (#3924)
chengeharrison Jun 8, 2023
e277534
Merge pull request #3905 from MaruyamaAya/dreambooth
MaruyamaAya Jun 9, 2023
24651fd
Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer
FoolPlayer Jun 9, 2023
ddcf58c
Revert "[sync] sync feature/shardformer with develop"
FrankLeeeee Jun 9, 2023
bd2c7c3
Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-s…
FoolPlayer Jun 9, 2023
e61ffc7
fix typo tests/ (#3936)
digger-yu Jun 9, 2023
1aadeed
fix typo .github/workflows/scripts/ (#3946)
digger-yu Jun 9, 2023
b3ab7fb
[example] update ViT example using booster api (#3940)
Jun 12, 2023
eabae7a
Merge pull request #62 from hpcaitech/main
jamesthesnake Jun 13, 2023
9d02590
[chat] refactor actor class (#3968)
cwher Jun 13, 2023
2925f47
[evaluate] support gpt evaluation with reference (#3972)
chengeharrison Jun 13, 2023
49246fb
Merge pull request #64 from hpcaitech/main
jamesthesnake Jun 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .compatibility
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
1.12.0-11.3.0
1.11.0-11.3.0
1.10.1-11.3.0
1.13.0-11.6.0
2.0.0-11.7.0
10 changes: 9 additions & 1 deletion .github/workflows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,18 @@ I will provide the details of each workflow below.

| Workflow Name | File name | Description |
| ---------------------- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Build on PR` | `build_on_pr.yml` | This workflow is triggered when a PR changes essential files. It will run all the unit tests in the repository with 4 GPUs. |
| `Build on PR` | `build_on_pr.yml` | This workflow is triggered when a PR changes essential files and a branch is created/deleted. It will run all the unit tests in the repository with 4 GPUs. |
| `Build on Schedule` | `build_on_schedule.yml` | This workflow will run the unit tests everyday with 8 GPUs. The result is sent to Lark. |
| `Report test coverage` | `report_test_coverage.yml` | This PR will put up a comment to report the test coverage results when `Build` is done. |

To reduce the average time of the unit test on PR, `Build on PR` workflow manages testmon cache.

1. When creating a new branch, it copies `cache/main/.testmondata*` to `cache/<branch>/`.
2. When creating a new PR or change the base branch of a PR, it copies `cache/<base_ref>/.testmondata*` to `cache/_pull/<pr_number>/`.
3. When running unit tests for each PR, it restores testmon cache from `cache/_pull/<pr_number>/`. After the test, it stores the cache back to `cache/_pull/<pr_number>/`.
4. When a PR is closed, if it's merged, it copies `cache/_pull/<pr_number>/.testmondata*` to `cache/<base_ref>/`. Otherwise, it just removes `cache/_pull/<pr_number>`.
5. When a branch is deleted, it removes `cache/<ref>`.

### Example Test

| Workflow Name | File name | Description |
Expand Down
117 changes: 112 additions & 5 deletions .github/workflows/build_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: Build on PR

on:
pull_request:
types: [synchronize, opened, reopened]
types: [synchronize, opened, reopened, ready_for_review, closed, edited]
branches:
- "main"
- "develop"
Expand All @@ -18,11 +18,63 @@ on:
- "!tests/**.md" # ignore doc change
- "pytest.ini" # test config change
- "setup.py" # install command change
create:
delete:

jobs:
prepare_cache:
name: Prepare testmon cache
if: |
github.event_name == 'create' &&
github.event.ref_type == 'branch' &&
github.event.repository.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Copy testmon cache
run: | # branch name may contain slash, we need to replace it with space
export REF_BRANCH=$(echo ${{ github.event.ref }} | sed "s/\// /")
if [ -d /github/home/testmon_cache/${MAIN_BRANCH} ]; then
[ ! -z "$(ls -A /github/home/testmon_cache/${MAIN_BRANCH})" ] && cp -p -r /github/home/testmon_cache/${MAIN_BRANCH} "/github/home/testmon_cache/${REF_BRANCH}"
fi
env:
MAIN_BRANCH: ${{ github.event.master_branch }}

prepare_cache_for_pr:
name: Prepare testmon cache for PR
if: |
github.event_name == 'pull_request' &&
(github.event.action == 'opened' || github.event.action == 'reopened' || (github.event.action == 'edited' && github.event.changes.base != null)) &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Copy testmon cache
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.pull_request.base.ref }} | sed "s/\// /")
if [ -d "/github/home/testmon_cache/${BASE}" ]; then
[ ! -z "$(ls -A "/github/home/testmon_cache/${BASE}")" ] && mkdir -p /github/home/testmon_cache/_pull && cp -p -r "/github/home/testmon_cache/${BASE}" /github/home/testmon_cache/_pull/${PR_NUMBER}
fi
env:
PR_NUMBER: ${{ github.event.number }}

detect:
name: Detect file change
if: |
github.event_name == 'pull_request' &&
(github.event.action == 'synchronize' || github.event.action == 'opened' || github.event.action == 'reopened' || github.event.action == 'ready_for_review') &&
github.event.pull_request.draft == false &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
outputs:
Expand Down Expand Up @@ -135,9 +187,11 @@ jobs:

- name: Restore Testmon Cache
run: |
if [ -d /github/home/testmon_cache ]; then
[ ! -z "$(ls -A /github/home/testmon_cache)" ] && cp -p -r /github/home/testmon_cache/.testmondata* /__w/ColossalAI/ColossalAI/
if [ -d /github/home/testmon_cache/_pull/${PR_NUMBER} ]; then
[ ! -z "$(ls -A /github/home/testmon_cache/_pull/${PR_NUMBER})" ] && cp -p -r /github/home/testmon_cache/_pull/${PR_NUMBER}/.testmondata* /__w/ColossalAI/ColossalAI/
fi
env:
PR_NUMBER: ${{ github.event.number }}

- name: Execute Unit Testing
run: |
Expand All @@ -149,8 +203,10 @@ jobs:

- name: Store Testmon Cache
run: |
[ -d /github/home/testmon_cache ] || mkdir /github/home/testmon_cache
cp -p -r /__w/ColossalAI/ColossalAI/.testmondata* /github/home/testmon_cache/
mkdir -p /github/home/testmon_cache/_pull/${PR_NUMBER}
cp -p -r /__w/ColossalAI/ColossalAI/.testmondata* /github/home/testmon_cache/_pull/${PR_NUMBER}/
env:
PR_NUMBER: ${{ github.event.number }}

- name: Collate artifact
env:
Expand Down Expand Up @@ -188,3 +244,54 @@ jobs:
with:
name: report
path: report/

store_cache:
name: Store testmon cache for PR
if: |
github.event_name == 'pull_request' &&
github.event.action == 'closed' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Store testmon cache if possible
if: github.event.pull_request.merged == true
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.pull_request.base.ref }} | sed "s/\// /")
if [ -d /github/home/testmon_cache/_pull/${PR_NUMBER} ]; then
[ ! -z "$(ls -A /github/home/testmon_cache/_pull/${PR_NUMBER})" ] && cp -p -r /github/home/testmon_cache/_pull/${PR_NUMBER}/.testmondata* "/github/home/testmon_cache/${BASE}/"
fi
env:
PR_NUMBER: ${{ github.event.pull_request.number }}

- name: Remove testmon cache
run: |
rm -rf /github/home/testmon_cache/_pull/${PR_NUMBER}
env:
PR_NUMBER: ${{ github.event.pull_request.number }}

remove_cache:
name: Remove testmon cache
if: |
github.event_name == 'delete' &&
github.event.ref_type == 'branch' &&
github.event.repository.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Remove testmon cache
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.ref }} | sed "s/\// /")
rm -rf "/github/home/testmon_cache/${BASE}"
4 changes: 4 additions & 0 deletions .github/workflows/release_docker_after_publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,11 @@ jobs:
run: |
version=$(cat version.txt)
tag=hpcaitech/colossalai:$version
latest=hpcaitech/colossalai:latest
docker build --build-arg http_proxy=http://172.17.0.1:7890 --build-arg https_proxy=http://172.17.0.1:7890 --build-arg VERSION=v${version} -t $tag ./docker
docker tag $tag $latest
echo "tag=${tag}" >> $GITHUB_OUTPUT
echo "latest=${latest}" >> $GITHUB_OUTPUT

- name: Log in to Docker Hub
uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9
Expand All @@ -36,6 +39,7 @@ jobs:
id: docker-push
run: |
docker push ${{ steps.build.outputs.tag }}
docker push ${{ steps.build.outputs.latest }}

notify:
name: Notify Lark via webhook
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/run_chatgpt_examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/github_actions/chat:/data/scratch/github_actions/chat
options: --gpus all --rm -v /data/scratch/github_actions/chat:/data/scratch/github_actions/chat --shm-size=10.24gb
timeout-minutes: 30
defaults:
run:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def plot_bar_chart(x: List[Any], y: List[Any], xlabel: str, ylabel: str, title:

def get_issue_pull_request_comments(github_token: str, since: str) -> Dict[str, int]:
"""
Retrive the issue/PR comments made by our members in the last 7 days.
Retrieve the issue/PR comments made by our members in the last 7 days.

Args:
github_token (str): GitHub access token for API calls
Expand Down Expand Up @@ -89,7 +89,7 @@ def get_issue_pull_request_comments(github_token: str, since: str) -> Dict[str,

def get_discussion_comments(github_token, since) -> Dict[str, int]:
"""
Retrive the discussion comments made by our members in the last 7 days.
Retrieve the discussion comments made by our members in the last 7 days.
This is only available via the GitHub GraphQL API.

Args:
Expand Down Expand Up @@ -194,7 +194,7 @@ def _call_graphql_api(query):

discussion_updated_at = datetime.strptime(discussion['updatedAt'], "%Y-%m-%dT%H:%M:%SZ")
# check if the updatedAt is within the last 7 days
# if yes, add it to dicussion_numbers
# if yes, add it to discussion_numbers
if discussion_updated_at > since:
if discussion['authorAssociation'] != 'MEMBER':
discussion_numbers.append(discussion['number'])
Expand All @@ -207,14 +207,14 @@ def _call_graphql_api(query):
# update cursor
cursor = edges[-1]['cursor']

# get the dicussion comments and replies made by our member
# get the discussion comments and replies made by our member
user_engagement_count = {}
for dicussion_number in discussion_numbers:
for discussion_number in discussion_numbers:
cursor = None
num_per_request = 10

while True:
query = _generate_comment_reply_count_for_discussion(dicussion_number, num_per_request, cursor)
query = _generate_comment_reply_count_for_discussion(discussion_number, num_per_request, cursor)
data = _call_graphql_api(query)

# get the comments
Expand Down Expand Up @@ -249,7 +249,7 @@ def _call_graphql_api(query):
reply = reply_edge['node']
if reply['authorAssociation'] == 'MEMBER':
# check if the updatedAt is within the last 7 days
# if yes, add it to dicussion_numbers
# if yes, add it to discussion_numbers
reply_updated_at = datetime.strptime(reply['updatedAt'], "%Y-%m-%dT%H:%M:%SZ")
if reply_updated_at > since:
member_name = reply['author']['login']
Expand Down
Loading