hpcaitech · TongLi3701 · Apr 28, 2023 · Apr 27, 2023 · Apr 27, 2023 · Apr 28, 2023
@@ -1,26 +1,36 @@
 # Evaluation
 
-In this directory we will introduce how you can evaluate your model with GPT-4. 
+In this directory, we introduce how you can evaluate your model with GPT-4. 
 
 ## Evaluation Pipeline
 
-The whole evaluation process undergoes two steps. 
-
-1. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models.
-2. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4.
+The whole evaluation process undergoes the following three steps: 
+1. Prepare the questions following the internal data structure in the data format section (described below).
+2. Generate answers from different models: 
+    * Generate answers using GPT-3.5: [`generate_gpt35_answers.py`](generate_gpt35_answers.py).
+    * Generate answers using your own models: [`generate_answers.py`](generate_answers.py).
+3. Evaluate models using GPT-4: [`evaluate.py`](evaluate.py).
 
 ### Generate Answers
+#### Generate Answers Using GPT-3.5
+You can provide your own OpenAI key to generate answers from GPT-3.5 using [`generate_gpt35_answers.py`](./generate_gpt35_answers.py).
 
-To generate answers, you should first format [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) `question.jsonl` file. We do this formatting because we would like to add more questions later and the pipeline for generating new questions may follow that of Self-Instruct and Stanford Alpaca. An example script is given as follows.
-
+An example script is provided as follows:
 ```shell
-python format_questions.py \
-    --questions_path "path to FastChat's question.jsonl" \
-    --save_path "path to the formatted file" \
+python generate_gpt35_answers.py \
+    --dataset "path to the question dataset" \
+    --answer_path "path to answer folder" \
+    --num_workers 4 \
+    --openai_key "your openai key" \
+    --max_tokens 512 \
+``` 
 
-```
+#### Generate Answers Using our Own Model
+You can also generate answers using your own models. The generation process is divided into two stages:
+1. Generate answers using multiple GPUs (optional) with batch processing: [`generate_answers.py`](./generate_answers.py).
+2. Merge multiple shards and output a single file: [`merge.py`](./merge.py).
 
-In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows.
+An example script is given as follows:
 
 ```shell
 device_number=number of your devices
@@ -51,21 +61,9 @@ done
 
 ```
 
-`generate_gpt35_answers.py` will generate answers of GPT-3.5 An example script is given as follows.
-
-```shell
-python generate_gpt35_answers.py \
-    --dataset "path to the question dataset" \
-    --answer_path "path to answer folder" \
-    --num_workers 4 \
-    --openai_key "your openai key" \
-    --max_tokens 512 \
-
-```
-
 ### Evaluate Answers
 
-In `evaluate.py`, GPT-4 will help review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script will finally print several metrics and output corresponding JSON files.
+In [`evaluate.py`](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.
 
 The metrics include:
 
@@ -107,66 +105,69 @@ We would like to mention that the evaluation of model answers using the GPT-3.5
 ## Data Format
 
 ### Questions
-
-We store questions in `questions.json`. The JSON file contains one list. Each element in the list is a question record.
-
-A question record has the following field:
-
-* `category` (str): The category of the question.
-* `instruction` (str): The question.
-* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
-* `output` (str): This is empty.
-* `id` (int): The question id.
+The file [`questions.json`](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field:
+* `id` (id, compulsory): The ID of the instruction / question.
+* `instruction` (str, compulsory): The instruction / question for the LLM.
+* `input` (str, optional): The additional context of the instruction / question.
+* `output` (str, optional): The sample output of the instruction / question.
+* `category` (str, compulsory): The category of the instruction / question.
+
+Example:
+```
+{
+    "id": 0,
+    "instruction": "Help me summarize the following short story?",
+    "input": "{story}",
+    "output": "{summarized story}",
+    "category": "closed qa"
+}
+```
 
 ### Answers
 
 We store model answers in `{model_name}_answers.json`. The JSON file contains one list. Each element in the list is an answer record to one question.
 
 An answer record has the following field:
 
-* `category` (str): The category of the question.
-* `instruction` (str): The question.
-* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
-* `output` (str): The answer to the question.
-* `id` (int): The question id.
+* `category` (str, compulsory): The category of the instruction / question.
+* `instruction` (str, compulsory): The instruction / question for the LLM.
+* `input` (str, optional): The additional context of the instruction / question.
+* `output` (str, compulsory): The output from the LLM.
+* `id` (int, compulsory): The ID of the instruction / question.
 
 ### Results
 
 We store evaluation results in `results.json`. The JSON file contains one dictionary. The key in the dictionary is formatted as `{model 1}_vs_{model 2}` and the value is also a dictionary contains metrics about the evaluation.
 
 The value has the following field:
 
-* `model` (list): The names of the two models.
-* `better` (int): The number of reviews where Model 2 receives a higher score.
-* `worse` (int): The number of reviews where Model 2 receives a lower score.
-* `tie` (int): The number of reviews where two models play to a tie.
-* `win_rate` (float): Win rate of Model 2.
-* `score` (list): Average score of the two models.
+* `model` (list, compulsory): The names of the two models.
+* `better` (int, compulsory): The number of reviews where Model 2 receives a higher score.
+* `worse` (int, compulsory): The number of reviews where Model 2 receives a lower score.
+* `tie` (int, compulsory): The number of reviews where two models play to a tie.
+* `win_rate` (float, compulsory): Win rate of Model 2.
+* `score` (list, compulsory): Average score of the two models.
 
 ### Better, Worse, Tie, Invalid, Review
 
 To help better compare the model answers, we store JSON files whose name ends with `_better`, `_worse`, `_tie`, `_invalid` or `_review`. Each JSON file contains one list. Each element in the list is a record of better, worse, tie, invalid or all cases.
 
 A record has the following field:
 
-* `review_id` (str): Random UUID, not in use.
-* `id` (int): The question id.
-* `reviewer_id` (int): A unique ID for a reviewer. Different reviewer id use different prompts.
-* `metadata` (dict): It is empty.
-* `review` (str): GPT-4 's review.
-* `score` (list): The scores of two models.
+* `review_id` (str, optional): Random UUID, not in use.
+* `id` (int, compulsory): The ID of the instruction / question.
+* `reviewer_id` (int, compulsory): A unique ID for a reviewer. Different reviewer id use different prompts.
+* `metadata` (dict, optional): It is empty.
+* `review` (str, optional): GPT-4's review.
+* `score` (list, compulsory): The scores of two models.
 
 ### Prompts
 
-The data format is the same with [FastChat's]([FastChat/prompt.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl)) prompts.
+The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts.
 
 ### Reviewer
 
-The data format is the same with [FastChat's]([FastChat/reviewer.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl)) reviewers.
-
-## Plan
-
-- [ ] Extend the questions
+The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers.
 
 ## Citations
 

@@ -0,0 +1,9 @@
+[
+    {
+        "id": 0,
+        "instruction": "Help me summarize the following news?",
+        "input": "National Commercial Bank (NCB), Saudi Arabia's largest lender by assets, agreed to buy rival Samba Financial Group for $15 billion in the biggest banking takeover this year.NCB will pay 28.45 riyals ($7.58) for each Samba share, according to a statement on Sunday, valuing it at about 55.7 billion riyals. NCB will offer 0.739 new shares for each Samba share, at the lower end of the 0.736-0.787 ratio the banks set when they signed an initial framework agreement in June.The offer is a 3.5% premium to Samba's Oct. 8 closing price of 27.50 riyals and about 24% higher than the level the shares traded at before the talks were made public. Bloomberg News first reported the merger discussions.The new bank will have total assets of more than $220 billion, creating the Gulf region's third-largest lender. The entity's $46 billion market capitalization nearly matches that of Qatar National Bank QPSC, which is still the Middle East's biggest lender with about $268 billion of assets.",
+        "output": "NCB to pay 28.45 riyals for each Samba share. Deal will create Gulf region's third-largest lender",
+        "category": "closed qa"
+    }
+]