Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 59 additions & 58 deletions applications/Chat/evaluate/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,36 @@
# Evaluation

In this directory we will introduce how you can evaluate your model with GPT-4.
In this directory, we introduce how you can evaluate your model with GPT-4.

## Evaluation Pipeline

The whole evaluation process undergoes two steps.

1. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models.
2. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4.
The whole evaluation process undergoes the following three steps:
1. Prepare the questions following the internal data structure in the data format section (described below).
2. Generate answers from different models:
* Generate answers using GPT-3.5: [`generate_gpt35_answers.py`](generate_gpt35_answers.py).
* Generate answers using your own models: [`generate_answers.py`](generate_answers.py).
3. Evaluate models using GPT-4: [`evaluate.py`](evaluate.py).

### Generate Answers
#### Generate Answers Using GPT-3.5
You can provide your own OpenAI key to generate answers from GPT-3.5 using [`generate_gpt35_answers.py`](./generate_gpt35_answers.py).

To generate answers, you should first format [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) `question.jsonl` file. We do this formatting because we would like to add more questions later and the pipeline for generating new questions may follow that of Self-Instruct and Stanford Alpaca. An example script is given as follows.

An example script is provided as follows:
```shell
python format_questions.py \
--questions_path "path to FastChat's question.jsonl" \
--save_path "path to the formatted file" \
python generate_gpt35_answers.py \
--dataset "path to the question dataset" \
--answer_path "path to answer folder" \
--num_workers 4 \
--openai_key "your openai key" \
--max_tokens 512 \
```

```
#### Generate Answers Using our Own Model
You can also generate answers using your own models. The generation process is divided into two stages:
1. Generate answers using multiple GPUs (optional) with batch processing: [`generate_answers.py`](./generate_answers.py).
2. Merge multiple shards and output a single file: [`merge.py`](./merge.py).

In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows.
An example script is given as follows:

```shell
device_number=number of your devices
Expand Down Expand Up @@ -51,21 +61,9 @@ done

```

`generate_gpt35_answers.py` will generate answers of GPT-3.5 An example script is given as follows.

```shell
python generate_gpt35_answers.py \
--dataset "path to the question dataset" \
--answer_path "path to answer folder" \
--num_workers 4 \
--openai_key "your openai key" \
--max_tokens 512 \

```

### Evaluate Answers

In `evaluate.py`, GPT-4 will help review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script will finally print several metrics and output corresponding JSON files.
In [`evaluate.py`](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.

The metrics include:

Expand Down Expand Up @@ -107,66 +105,69 @@ We would like to mention that the evaluation of model answers using the GPT-3.5
## Data Format

### Questions

We store questions in `questions.json`. The JSON file contains one list. Each element in the list is a question record.

A question record has the following field:

* `category` (str): The category of the question.
* `instruction` (str): The question.
* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
* `output` (str): This is empty.
* `id` (int): The question id.
The file [`questions.json`](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field:
* `id` (id, compulsory): The ID of the instruction / question.
* `instruction` (str, compulsory): The instruction / question for the LLM.
* `input` (str, optional): The additional context of the instruction / question.
* `output` (str, optional): The sample output of the instruction / question.
* `category` (str, compulsory): The category of the instruction / question.

Example:
```
{
"id": 0,
"instruction": "Help me summarize the following short story?",
"input": "{story}",
"output": "{summarized story}",
"category": "closed qa"
}
```

### Answers

We store model answers in `{model_name}_answers.json`. The JSON file contains one list. Each element in the list is an answer record to one question.

An answer record has the following field:

* `category` (str): The category of the question.
* `instruction` (str): The question.
* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
* `output` (str): The answer to the question.
* `id` (int): The question id.
* `category` (str, compulsory): The category of the instruction / question.
* `instruction` (str, compulsory): The instruction / question for the LLM.
* `input` (str, optional): The additional context of the instruction / question.
* `output` (str, compulsory): The output from the LLM.
* `id` (int, compulsory): The ID of the instruction / question.

### Results

We store evaluation results in `results.json`. The JSON file contains one dictionary. The key in the dictionary is formatted as `{model 1}_vs_{model 2}` and the value is also a dictionary contains metrics about the evaluation.

The value has the following field:

* `model` (list): The names of the two models.
* `better` (int): The number of reviews where Model 2 receives a higher score.
* `worse` (int): The number of reviews where Model 2 receives a lower score.
* `tie` (int): The number of reviews where two models play to a tie.
* `win_rate` (float): Win rate of Model 2.
* `score` (list): Average score of the two models.
* `model` (list, compulsory): The names of the two models.
* `better` (int, compulsory): The number of reviews where Model 2 receives a higher score.
* `worse` (int, compulsory): The number of reviews where Model 2 receives a lower score.
* `tie` (int, compulsory): The number of reviews where two models play to a tie.
* `win_rate` (float, compulsory): Win rate of Model 2.
* `score` (list, compulsory): Average score of the two models.

### Better, Worse, Tie, Invalid, Review

To help better compare the model answers, we store JSON files whose name ends with `_better`, `_worse`, `_tie`, `_invalid` or `_review`. Each JSON file contains one list. Each element in the list is a record of better, worse, tie, invalid or all cases.

A record has the following field:

* `review_id` (str): Random UUID, not in use.
* `id` (int): The question id.
* `reviewer_id` (int): A unique ID for a reviewer. Different reviewer id use different prompts.
* `metadata` (dict): It is empty.
* `review` (str): GPT-4 's review.
* `score` (list): The scores of two models.
* `review_id` (str, optional): Random UUID, not in use.
* `id` (int, compulsory): The ID of the instruction / question.
* `reviewer_id` (int, compulsory): A unique ID for a reviewer. Different reviewer id use different prompts.
* `metadata` (dict, optional): It is empty.
* `review` (str, optional): GPT-4's review.
* `score` (list, compulsory): The scores of two models.

### Prompts

The data format is the same with [FastChat's]([FastChat/prompt.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl)) prompts.
The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts.

### Reviewer

The data format is the same with [FastChat's]([FastChat/reviewer.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl)) reviewers.

## Plan

- [ ] Extend the questions
The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers.

## Citations

Expand Down
31 changes: 0 additions & 31 deletions applications/Chat/evaluate/format_questions.py

This file was deleted.

3 changes: 0 additions & 3 deletions applications/Chat/evaluate/format_questions.sh

This file was deleted.

9 changes: 9 additions & 0 deletions applications/Chat/evaluate/sample/questions.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[
{
"id": 0,
"instruction": "Help me summarize the following news?",
"input": "National Commercial Bank (NCB), Saudi Arabia's largest lender by assets, agreed to buy rival Samba Financial Group for $15 billion in the biggest banking takeover this year.NCB will pay 28.45 riyals ($7.58) for each Samba share, according to a statement on Sunday, valuing it at about 55.7 billion riyals. NCB will offer 0.739 new shares for each Samba share, at the lower end of the 0.736-0.787 ratio the banks set when they signed an initial framework agreement in June.The offer is a 3.5% premium to Samba's Oct. 8 closing price of 27.50 riyals and about 24% higher than the level the shares traded at before the talks were made public. Bloomberg News first reported the merger discussions.The new bank will have total assets of more than $220 billion, creating the Gulf region's third-largest lender. The entity's $46 billion market capitalization nearly matches that of Qatar National Bank QPSC, which is still the Middle East's biggest lender with about $268 billion of assets.",
"output": "NCB to pay 28.45 riyals for each Samba share. Deal will create Gulf region's third-largest lender",
"category": "closed qa"
}
]