Describe the feature
Technical Design of the Evaluation Module
Data Format
Questions
Each question record should have the following fields:
id (int, compulsory): The ID of the instruction.
instruction (str, compulsory): The instruction for the LLM.
category (str, compulsory): The category of the instruction.
input (str, optional): The additional context of the instruction.
output (str, optional): The sample output of the instruction (default: GPT-3.5).
target (str, optional): The target answer for the instruction.
Note: if the input has a gold standard, the output can be empty. Otherwise, we generate answers from GPT-3.5 as the output, and the target field is empty.
While evaluating the performance, if the target is empty, use the value from output.
Answers
The JSON file contains one list. Each element in the list is an answer record to one question.
An answer record has the following field:
id (int, compulsory): The ID of the instruction.
instruction (str, compulsory): The instruction for the LLM.
category (str, compulsory): The category of the instruction.
input (str, optional): The additional context of the instruction.
output (str, compulsory): The output from the LLM.
Evaluation
Configuration
We assume that all answers are generated and saved following the internal data structure.
Configuration file for the evaluator module: config_eval.json.
This file controls how we evaluate the performance of the model.
{
"language": "eng",
"category": {
"role play": {
"GPT-3.5": ["fluency", "coherence", "consistency", "relevance"],
"GPT-4": ["fluency", "coherence", "consistency", "relevance"],
"Metrics": ["BLEU", "ROUGE", "F1 score", "Distinct", "MAUVE"]
},
"Multi-turn conversation": {
"GPT-3.5": ["fluency", "coherence", "consistency", "relevance"],
"GPT-4": ["fluency", "coherence", "consistency", "relevance"],
"Metrics": ["BLEU", "ROUGE", "F1 score", "Distinct", "MAUVE"]
},
"Open QA": {
"GPT-3.5": ["fluency", "coherence", "consistency", "relevance"],
"GPT-4": ["fluency", "coherence", "consistency", "relevance"],
"Metrics": ["BLEU", "ROUGE", "F1 score", "Distinct", "MAUVE"]
}
}
}
The value for GPT-3.5 and GPT-4 can be a empty list, the value for Metrics can also be empty. For example, for classification tasks, you only need to put Presicion, Recall and F1 score.
We support eng and ch now.
Code Architecture
evaluator.py: Main class for the evaluator.
class Evaluator(object):
def __init__(self, params: Dict) -> None:
self.params = params
self.stats = dict()
def battle(self, answers1: Dict, answers2: Dict) -> None:
"""
Comparison between two models using GPT-4 as the reviewer.
"""
pass
def evaluate(self, answers: Dict) -> None:
"""
A comprehensive evaluation of the answers from the model.
The function evaluates the model's performance from different perspectives
using GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.
The metrics will be decided by the config file.
"""
pass
def save(self, path: str) -> None:
pass
Results will be saved as a JSON file. Please save all files in a separate folder.
metrics.py: the function file that contains all metrics. One function defines one metric.
def rouge_score(preds: List, target: List) -> Dict:
rouge_scores = {"rouge1": 0, "rouge2": 0, "rougeL": 0}
# calculate scores
return rouge_scores
eval.py: driver function that initialises the evaluator.
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# load config
# initialize evaluator
If two files are provided, we should use battle, otherwise, we will call evaluate.
Existing functions for generating answers can be moved to a separate folder. Please see below for the folder structure:
eval
- eval.py
- metrics.py
- gpt_evaluate.py
- evaluator.py
- utlis.py
- results
- generate_answers
- generate_gpt35_answers.py
- ...
Describe the feature
Technical Design of the Evaluation Module
Data Format
Questions
Each question record should have the following fields:
id(int, compulsory): The ID of the instruction.instruction(str, compulsory): The instruction for the LLM.category(str, compulsory): The category of the instruction.input(str, optional): The additional context of the instruction.output(str, optional): The sample output of the instruction (default: GPT-3.5).target(str, optional): The target answer for the instruction.Note: if the
inputhas a gold standard, theoutputcan be empty. Otherwise, we generate answers from GPT-3.5 as theoutput, and thetargetfield is empty.While evaluating the performance, if the
targetis empty, use the value fromoutput.Answers
The JSON file contains one list. Each element in the list is an answer record to one question.
An answer record has the following field:
id(int, compulsory): The ID of the instruction.instruction(str, compulsory): The instruction for the LLM.category(str, compulsory): The category of the instruction.input(str, optional): The additional context of the instruction.output(str, compulsory): The output from the LLM.Evaluation
Configuration
We assume that all answers are generated and saved following the internal data structure.
Configuration file for the evaluator module:
config_eval.json.This file controls how we evaluate the performance of the model.
The value for
GPT-3.5andGPT-4can be a empty list, the value forMetricscan also be empty. For example, for classification tasks, you only need to putPresicion,RecallandF1 score.We support
engandchnow.Code Architecture
evaluator.py: Main class for the evaluator.Results will be saved as a JSON file. Please save all files in a separate folder.
metrics.py: the function file that contains all metrics. One function defines one metric.eval.py: driver function that initialises the evaluator.If two files are provided, we should use
battle, otherwise, we will callevaluate.Existing functions for generating answers can be moved to a separate folder. Please see below for the folder structure: