Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
46503c3
Modify torch version requirement to adapt torch 2.0
MaruyamaAya Jun 1, 2023
60ec33b
Add a new example of Dreambooth training using the booster API
MaruyamaAya Jun 2, 2023
42e3232
roll back
MaruyamaAya Jun 2, 2023
25447d4
modify path
MaruyamaAya Jun 5, 2023
176010f
update performance evaluation
MaruyamaAya Jun 6, 2023
b56c7f4
update shell file
MaruyamaAya Jun 6, 2023
1c1f71c
fixing insecure hash function
MaruyamaAya Jun 6, 2023
b29e1f0
change directory
MaruyamaAya Jun 6, 2023
d3379f0
fixed model saving bugs
MaruyamaAya Jun 6, 2023
79c9f77
fixed port
MaruyamaAya Jun 6, 2023
b4437e8
fixed port
MaruyamaAya Jun 6, 2023
4fc8bc6
modify file path
MaruyamaAya Jun 7, 2023
c25d421
[devops] hotfix testmon cache clean logic (#3917)
ver217 Jun 7, 2023
5e2132d
[workflow] added docker latest tag for release (#3920)
FrankLeeeee Jun 7, 2023
a55fb00
[booster] update bert example, using booster api (#3885)
wukong1992 Jun 7, 2023
b306cec
[example] Modify palm example with the new booster API (#3913)
MaruyamaAya Jun 7, 2023
a9d1cad
fix typo with colossalai/trainer utils zero (#3908)
digger-yu Jun 7, 2023
c94a335
modify shell for check
MaruyamaAya Jun 7, 2023
de0d7df
[nfc] fix typo colossalai/zero (#3923)
digger-yu Jun 7, 2023
9166988
[devops] update torch version in compability test (#3919)
ver217 Jun 8, 2023
cf4792c
modify shell for check
MaruyamaAya Jun 8, 2023
e417dd0
[example] update opt example using booster api (#3918)
Fridge003 Jun 8, 2023
039854b
modify shell for check
MaruyamaAya Jun 8, 2023
49567d5
modify shell for check
MaruyamaAya Jun 8, 2023
730a092
modify shell for check
MaruyamaAya Jun 8, 2023
407aa48
fix typo examples/community/roberta (#3925)
digger-yu Jun 8, 2023
9b5e7ce
modify shell for check
MaruyamaAya Jun 8, 2023
6a69b44
[shardformer] init shardformer code structure (#3731)
FoolPlayer May 22, 2023
58f6432
[shardformer]: Feature/shardformer, add some docstring and readme (#3…
FoolPlayer May 24, 2023
bc19024
[shardformer] updated readme (#3827)
FrankLeeeee May 24, 2023
537a52b
[shardformer] refactored the user api (#3828)
FrankLeeeee May 24, 2023
997544c
[shardformer] update readme with modules implement doc (#3834)
FoolPlayer May 24, 2023
21a3915
[shardformer] add Dropout layer support different dropout pattern (#3…
FoolPlayer Jun 1, 2023
6370a93
update README (#3909)
FoolPlayer Jun 6, 2023
ef15377
[shardformer] add gpt2 policy and modify shard and slicer to support …
FoolPlayer Jun 7, 2023
33eef71
fix typo examples and docs (#3932)
digger-yu Jun 8, 2023
21c4c0b
support UniEval and add CHRF metric (#3924)
chengeharrison Jun 8, 2023
e277534
Merge pull request #3905 from MaruyamaAya/dreambooth
MaruyamaAya Jun 9, 2023
24651fd
Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer
FoolPlayer Jun 9, 2023
ddcf58c
Revert "[sync] sync feature/shardformer with develop"
FrankLeeeee Jun 9, 2023
bd2c7c3
Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-s…
FoolPlayer Jun 9, 2023
e61ffc7
fix typo tests/ (#3936)
digger-yu Jun 9, 2023
1aadeed
fix typo .github/workflows/scripts/ (#3946)
digger-yu Jun 9, 2023
b3ab7fb
[example] update ViT example using booster api (#3940)
Jun 12, 2023
71fe527
[gemini] fixed the gemini checkpoint io (#3934)
FrankLeeeee Jun 9, 2023
6718a2f
[workflow] cancel duplicated workflow jobs (#3960)
FrankLeeeee Jun 12, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .compatibility
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
1.12.0-11.3.0
1.11.0-11.3.0
1.10.1-11.3.0
1.13.0-11.6.0
2.0.0-11.7.0
10 changes: 9 additions & 1 deletion .github/workflows/build_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ jobs:
defaults:
run:
shell: bash
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- name: Copy testmon cache
run: | # branch name may contain slash, we need to replace it with space
Expand All @@ -83,6 +86,9 @@ jobs:
changedLibraryFiles: ${{ steps.find-lib-change.outputs.all_changed_files }}
anyLibraryFileChanged: ${{ steps.find-lib-change.outputs.any_changed }}
runs-on: ubuntu-latest
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- uses: actions/checkout@v2
with:
Expand Down Expand Up @@ -140,6 +146,9 @@ jobs:
defaults:
run:
shell: bash
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- name: Checkout TensorNVMe
uses: actions/checkout@v2
Expand Down Expand Up @@ -271,7 +280,6 @@ jobs:
PR_NUMBER: ${{ github.event.pull_request.number }}

- name: Remove testmon cache
if: github.event.pull_request.merged != true
run: |
rm -rf /github/home/testmon_cache/_pull/${PR_NUMBER}
env:
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/compatiblity_test_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ jobs:
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- uses: actions/checkout@v3
- id: set-matrix
Expand Down Expand Up @@ -40,6 +43,9 @@ jobs:
image: ${{ matrix.container }}
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10
timeout-minutes: 120
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- name: Install dependencies
run: |
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/doc_check_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ jobs:
github.event.pull_request.draft == false &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: ubuntu-latest
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- uses: actions/checkout@v2

Expand All @@ -31,6 +34,9 @@ jobs:
github.event.pull_request.draft == false &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: ubuntu-latest
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- uses: actions/checkout@v2
with:
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/doc_test_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ jobs:
outputs:
any_changed: ${{ steps.changed-files.outputs.any_changed }}
changed_files: ${{ steps.changed-files.outputs.all_changed_files }}
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
name: Detect changed example files
steps:
- uses: actions/checkout@v3
Expand Down Expand Up @@ -59,6 +62,9 @@ jobs:
defaults:
run:
shell: bash
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- name: Checkout ColossalAI-Documentation
uses: actions/checkout@v2
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/example_check_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ jobs:
matrix: ${{ steps.setup-matrix.outputs.matrix }}
anyChanged: ${{ steps.setup-matrix.outputs.anyChanged }}
name: Detect changed example files
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- uses: actions/checkout@v3
with:
Expand Down Expand Up @@ -77,6 +80,9 @@ jobs:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/examples-data:/data/
timeout-minutes: 10
concurrency:
group: ${{ github.head_ref }}
cancel-in-progress: false
steps:
- uses: actions/checkout@v3

Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/release_docker_after_publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,11 @@ jobs:
run: |
version=$(cat version.txt)
tag=hpcaitech/colossalai:$version
latest=hpcaitech/colossalai:latest
docker build --build-arg http_proxy=http://172.17.0.1:7890 --build-arg https_proxy=http://172.17.0.1:7890 --build-arg VERSION=v${version} -t $tag ./docker
docker tag $tag $latest
echo "tag=${tag}" >> $GITHUB_OUTPUT
echo "latest=${latest}" >> $GITHUB_OUTPUT

- name: Log in to Docker Hub
uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9
Expand All @@ -36,6 +39,7 @@ jobs:
id: docker-push
run: |
docker push ${{ steps.build.outputs.tag }}
docker push ${{ steps.build.outputs.latest }}

notify:
name: Notify Lark via webhook
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def plot_bar_chart(x: List[Any], y: List[Any], xlabel: str, ylabel: str, title:

def get_issue_pull_request_comments(github_token: str, since: str) -> Dict[str, int]:
"""
Retrive the issue/PR comments made by our members in the last 7 days.
Retrieve the issue/PR comments made by our members in the last 7 days.

Args:
github_token (str): GitHub access token for API calls
Expand Down Expand Up @@ -89,7 +89,7 @@ def get_issue_pull_request_comments(github_token: str, since: str) -> Dict[str,

def get_discussion_comments(github_token, since) -> Dict[str, int]:
"""
Retrive the discussion comments made by our members in the last 7 days.
Retrieve the discussion comments made by our members in the last 7 days.
This is only available via the GitHub GraphQL API.

Args:
Expand Down Expand Up @@ -194,7 +194,7 @@ def _call_graphql_api(query):

discussion_updated_at = datetime.strptime(discussion['updatedAt'], "%Y-%m-%dT%H:%M:%SZ")
# check if the updatedAt is within the last 7 days
# if yes, add it to dicussion_numbers
# if yes, add it to discussion_numbers
if discussion_updated_at > since:
if discussion['authorAssociation'] != 'MEMBER':
discussion_numbers.append(discussion['number'])
Expand All @@ -207,14 +207,14 @@ def _call_graphql_api(query):
# update cursor
cursor = edges[-1]['cursor']

# get the dicussion comments and replies made by our member
# get the discussion comments and replies made by our member
user_engagement_count = {}
for dicussion_number in discussion_numbers:
for discussion_number in discussion_numbers:
cursor = None
num_per_request = 10

while True:
query = _generate_comment_reply_count_for_discussion(dicussion_number, num_per_request, cursor)
query = _generate_comment_reply_count_for_discussion(discussion_number, num_per_request, cursor)
data = _call_graphql_api(query)

# get the comments
Expand Down Expand Up @@ -249,7 +249,7 @@ def _call_graphql_api(query):
reply = reply_edge['node']
if reply['authorAssociation'] == 'MEMBER':
# check if the updatedAt is within the last 7 days
# if yes, add it to dicussion_numbers
# if yes, add it to discussion_numbers
reply_updated_at = datetime.strptime(reply['updatedAt'], "%Y-%m-%dT%H:%M:%SZ")
if reply_updated_at > since:
member_name = reply['author']['login']
Expand Down
105 changes: 81 additions & 24 deletions applications/Chat/evaluate/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,13 @@ pip install -r requirements.txt

## Evaluation Pipeline

The whole evaluation pipeline consists of two methods:
The whole evaluation pipeline consists of three methods:

1. `GPT Evaluation`: evaluates model predictions using GPT models.
* Compare the performance of two different models (battle).
* Rate the model according to pre-defined metrics using prompting design.
2. `Automatic Evaluation`: evaluates model predictions using automatic metrics.
3. `UniEval`: evaluates model predictions using UniEval models(English only).

### Evaluation Category

Expand Down Expand Up @@ -75,7 +76,9 @@ GPT evaluation uses GPT models to evaluate the prediction of different models an

GPT models evaluate the quality of model predictions based on the given prompt words and gives a score between 1-5.

> **NOTE:** Even for the same metric, the details of its prompt words and CoT(Chain-of-Thought) can differ based on which category you want to evaluate. For example, prompt words for metric `correctness` showed here is "The answer should be in line with common sense, life experience, etc."(this is for category `brainstorming`), but for category `extraction`, prompt words can be "Answers should extract the required information accurately and should not contain any incorrect or misleading information." You can find all the prompt words and CoT(Chain-of-Thought) in `prompt/evaluation_prompt`.
> **NOTE 1:** Even for the same metric, the details of its prompt words and CoT(Chain-of-Thought) can differ based on which category you want to evaluate. For example, prompt words for metric `correctness` showed here is "The answer should be in line with common sense, life experience, etc."(this is for category `brainstorming`), but for category `extraction`, prompt words can be "Answers should extract the required information accurately and should not contain any incorrect or misleading information." You can find all the prompt words and CoT(Chain-of-Thought) in `prompt/evaluation_prompt`.

> **NOTE 2:** To add customized metrics, you can refer to [FAQ](#faq).

#### Automatic Evaluation

Expand All @@ -85,7 +88,7 @@ There are two ways to obtain reference answers:
* For instruction coming from human-designed problems, the reference answers are generated by GPT-3.5, such as roleplay, chat.
* For instruction related with classic NLP problems, the reference answers are collected from open-sourced dataset with target answers, such as classification, extraction, summarization.

There are 5 types of automatic evaluation metrics listed in the table below:
There are 6 types of automatic evaluation metrics listed in the table below:

| Automatic Evaluation Metric | Description |
| :---------------------------------: | :----------------------------------------------------------- |
Expand All @@ -94,6 +97,25 @@ There are 5 types of automatic evaluation metrics listed in the table below:
| Distinct | Measure the diversity of generation text by counting the unique n-grams. |
| BERTScore | Measure the semantic similarity between tokens of predictions and references with BERT. |
| Precision<br/> Recall<br/> F1 Score | Measure the number of overlaps between prediction and reference (design for classification and extraction categories). |
| CHRF | Measure the similarity of character n-grams between prediction and reference. |

#### UniEval Evaluation

UniEval converts all evaluation tasks of different dimensions(metrics) into Boolean QA problems and utilize the model to answer with “Yes” or “No”. Compared with similarity-based metrics such as ROUGE and BLEU, UniEval can achieve a more comprehensive evaluation. In addition, UniEval also demonstrates its ability to transfer to unseen dimensions and tasks.

In our evaluation pipeline, two pre-trained UniEval evaluators are used. One is [unieval-sum](https://huggingface.co/MingZhong/unieval-sum) and the other is [unieval-dialog](https://huggingface.co/MingZhong/unieval-dialog). The two models can be used for the 3 tasks, `summarization`, `dialogue` and `data2text`. Each task has different evaluation dimensions.

| UniEval Model | Task | Dimension(Metric) |
| :------------: | :----------------- | :--- |
| unieval-sum | summarization | coherence: whether the summary is coherent<br/>consistency: whether the claim is consistent with the given document<br/>fluency: whether the paragraph is fluent<br/>relevance: whether the summary is relevant to the reference |
| unieval-sum | data2text | naturalness: whether the utterance is fluent<br/>informativeness: whether the utterance is informative according to the reference |
| unieval-dialog | dialogue | naturalness: whether the response is natural in the dialogue<br/>coherence: whether the response is coherent in the dialogue history<br/>understandability: whether the response is understandable in the dialogue |

> **NOTE 1:** Task "data2text" uses the same model as task "summarization".

> **NOTE 2:** In UniEval paper, the `unieval-sum` model demonstrates the best transfer ability and so you can evaluate your customized metric with this model. Details of adding customized metrics can be found in [FAQ](#faq).

> **NOTE 3:** We consider not including all metrics provided in UniEval in our pipeline because the data structure and content of the instructions we want to evaluate are not suitable for direct use of some UniEval metrics.

## Evaluation Process

Expand Down Expand Up @@ -215,47 +237,60 @@ The following is an example of a Chinese GPT evaluation prompt. In an evaluation

#### Configuration

The following is an example of a Chinese config file. The configuration file can control how the pipeline evaluates the model. You need to specify GPT evaluation metrics and automatic metrics in key `GPT` and `Metrics`. You can find an example Chinese config file in `config`.
The following is an example of a Chinese config file. The configuration file can control how the pipeline evaluates the model. You need to specify GPT evaluation metrics, automatic metrics and UniEval metrics in key `GPT`, `Metrics` and `UniEval`(English only). You can find an example English config file in `config`.

```json
{
"language": "cn",
"language": "en",
"path_for_UniEval": {
"summarization": "path to unieval-sum model",
"dialogue": "path to unieval-dialog model",
"data2text": "path to unieval-sum model"
},
"category": {
"brainstorming": {
"GPT": ["relevance", "creativity", "practicality", "correctness"],
"Metrics": ["Distinct"]
"Metrics": ["Distinct"],
"UniEval": ["summarization-fluency", "data2text-naturalness", "data2text-informativeness"]
},
"chat": {
"GPT": [ "relevance", "naturalness", "engagingness", "reasonableness"],
"Metrics": ["Distinct"]
"Metrics": ["Distinct"],
"UniEval": ["dialogue-naturalness", "dialogue-coherence", "dialogue-understandability"]
}
}
}
```

`"language"`: the language used to evaluate the model capability. We only support Chinese `"cn"` for now.

`"path_for_UniEval"`: path to the UniEval model.

`"category"`: the category/categories needed to evaluate the model capability.

`"GPT"`: the metrics you want to use for GPT evaluation.

`"Metrics"`: the metrics you want to use for automatic metrics evaluation.

`"UniEval"`: the metrics you want to use for UniEval metrics evaluation. The metric has to be in the `"{task}-{metric}"` format because different tasks have same metrics such as naturalness and coherence.

You can remove the key such as `"Metrics"` to skip evaluating answers using its corresponding evaluation metrics.

You can create your config file based on available settings listed in following table.

| "category" | "GPT" | "Metrics" |
| :--------------: | :---------------------: | :---------: |
| "brainstorming" | "language organization" | "BLEU" |
| "chat" | "relevance" | "ROUGE" |
| "classification" | "creativity" | "Distinct" |
| "closed_qa" | "practicality" | "BERTScore" |
| "extraction" | "correctness" | "Precision" |
| "generation" | "naturalness" | "Recall" |
| "open_qa" | "engagingness" | "F1 score" |
| "rewriting" | "reasonableness" | |
| "roleplay" | "diversity" | |
| "summarization" | "fidelity" | |
| | "conciseness" | |
| "category" | "GPT" | "Metrics" | "UniEval" |
| :--------------: | :---------------------: | :---------: | :--------------------------: |
| "brainstorming" | "language organization" | "BLEU" | "dialogue-naturalness" |
| "chat" | "relevance" | "ROUGE" | "dialogue-coherence" |
| "classification" | "creativity" | "Distinct" | "dialogue-understandability" |
| "closed_qa" | "practicality" | "BERTScore" | "data2text-naturalness" |
| "extraction" | "correctness" | "Precision" | "data2text-informativeness" |
| "generation" | "naturalness" | "Recall" | "summarization-coherence" |
| "open_qa" | "engagingness" | "F1 score" | "summarization-consistency" |
| "rewriting" | "reasonableness" | "CHRF" | "summarization-fluency" |
| "roleplay" | "diversity" | | "summarization-relevance" |
| "summarization" | "fidelity" | | |
| | "conciseness" | | |

> **NOTE:** For categories which don't have standard answers such as `brainstorming`, you should avoid using automatic metrics such as `BLEU` and `ROUGE` which are based on similarity measures and you should use `Distinct` instead in your config file.

Expand Down Expand Up @@ -290,23 +325,36 @@ For example, if you want to add a new metric `persuasiveness` into category `bra
"id": 1,
"category": "brainstorming",
"metrics": {
"persuasiveness": "说服力(1-5):XXX"
"persuasiveness": "persuasiveness(1-5):a short description for persuasiveness"
},
"CoT": {
"persuasiveness": "XXX\n\n说服力:"
"persuasiveness": "CoT for persuasiveness\n\npersuasiveness:"
},
"prompt": "你是一个好助手。请你为下面“头脑风暴”问题的答案打分。\n\n问题如下:\n\n{question}\n\n答案如下:\n\n{answer}\n\n评分的指标如下:\n\n{metric}\n\n请你遵照以下的评分步骤:\n\n{steps}"
"prompt": "You are a good assistant. Please rate the given answer to the \"brainstorming\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
}
}
```

</details>

<details><summary><b>How can I add a new UniEval evaluation metric?</b></summary>

For example, if you want to add a new metric `persuasiveness` into task `data2text`, you should add a Boolean QA question about the metric in function `add_question` in `unieval/utils.py`. Please do note that how effectively the model would evaluate this metric is unknown and you may need some experiments to test whether the model is capable of evaluating this metric.

```python
if task == 'data2text':
if dimension == 'persuasiveness':
cur_input = 'question: Is this a persuasive utterence </s> utterance: ' + output[i]
```

</details>

## To Do

- [x] Add evaluation for English capability
- [ ] Support UniEval
- [x] Support UniEval
- [x] Support GPT-4 evaluation
- [ ] Support GPT evaluation with reference in the prompt

## Citations

Expand All @@ -327,4 +375,13 @@ For example, if you want to add a new metric `persuasiveness` into category `bra
archivePrefix={arXiv},
primaryClass={cs.CL}
}

@misc{zhong2022unified,
title={Towards a Unified Multi-Dimensional Evaluator for Text Generation},
author={Ming Zhong and Yang Liu and Da Yin and Yuning Mao and Yizhu Jiao and Pengfei Liu and Chenguang Zhu and Heng Ji and Jiawei Han},
year={2022},
eprint={2210.07197},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Loading