[FEATURE]: Technical Design of the Evaluation Module

### Describe the feature

# Technical Design of the Evaluation Module

## Data Format

### Questions
Each question record should have the following fields:
* `id` (int, compulsory): The ID of the instruction.
* `instruction` (str, compulsory): The instruction for the LLM.
* `category` (str, compulsory): The category of the instruction.
* `input` (str, optional): The additional context of the instruction.
* `output` (str, optional): The sample output of the instruction (default: GPT-3.5).
* `target` (str, optional): The target answer for the instruction.

Note: if the `input` has a gold standard, the `output` can be empty. Otherwise, we generate answers from GPT-3.5 as the `output`, and the `target` field is empty.

While evaluating the performance, if the `target` is empty, use the value from `output`.

### Answers
The JSON file contains one list. Each element in the list is an answer record to one question.
An answer record has the following field:
* `id` (int, compulsory): The ID of the instruction.
* `instruction` (str, compulsory): The instruction for the LLM.
* `category` (str, compulsory): The category of the instruction.
* `input` (str, optional): The additional context of the instruction.
* `output` (str, compulsory): The output from the LLM.


### Evaluation 
#### Configuration
We assume that all answers are generated and saved following the internal data structure.

Configuration file for the evaluator module: `config_eval.json`.

This file controls how we evaluate the performance of the model.

```
{
    "language": "eng",
    "category": {
        "role play": {
            "GPT-3.5": ["fluency", "coherence", "consistency", "relevance"],
            "GPT-4": ["fluency", "coherence", "consistency", "relevance"],
            "Metrics": ["BLEU", "ROUGE", "F1 score", "Distinct", "MAUVE"]
        },
        "Multi-turn conversation": {
            "GPT-3.5": ["fluency", "coherence", "consistency", "relevance"],
            "GPT-4": ["fluency", "coherence", "consistency", "relevance"],
            "Metrics": ["BLEU", "ROUGE", "F1 score", "Distinct", "MAUVE"]  
        },
        "Open QA": {
            "GPT-3.5": ["fluency", "coherence", "consistency", "relevance"],
            "GPT-4": ["fluency", "coherence", "consistency", "relevance"],
            "Metrics": ["BLEU", "ROUGE", "F1 score", "Distinct", "MAUVE"]
        }
    }
}
```

The value for `GPT-3.5` and `GPT-4` can be a empty list, the value for `Metrics` can also be empty. For example, for classification tasks, you only need to put `Presicion`, `Recall` and `F1 score`.

We support `eng` and `ch` now.


#### Code Architecture
`evaluator.py`: Main class for the evaluator.
```python
class Evaluator(object):
    def __init__(self, params: Dict) -> None:
        self.params = params
        self.stats = dict()
    
    def battle(self, answers1: Dict, answers2: Dict) -> None:
        """
        Comparison between two models using GPT-4 as the reviewer.
        """
        pass

    def evaluate(self, answers: Dict) -> None:
        """
        A comprehensive evaluation of the answers from the model.
        The function evaluates the model's performance from different perspectives 
        using GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.

       The metrics will be decided by the config file.
   
        """
        pass

    def save(self, path: str) -> None:
        pass

```

Results will be saved as a JSON file. Please save all files in a separate folder.

`metrics.py`: the function file that contains all metrics. One function defines one metric.
```python
def rouge_score(preds: List, target: List) -> Dict:
    rouge_scores = {"rouge1": 0, "rouge2": 0, "rougeL": 0}

    # calculate scores

    return rouge_scores

```


`eval.py`: driver function that initialises the evaluator.

```python
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    # load config
    # initialize evaluator 
```

If two files are provided, we should use `battle`, otherwise, we will call `evaluate`.

Existing functions for generating answers can be moved to a separate folder. Please see below for the folder structure:
```
eval
   - eval.py
   - metrics.py
   - gpt_evaluate.py
   - evaluator.py
   - utlis.py
   - results
   - generate_answers
      - generate_gpt35_answers.py
      - ...
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Technical Design of the Evaluation Module #3714

Describe the feature

Technical Design of the Evaluation Module

Data Format

Questions

Answers

Evaluation

Configuration

Code Architecture

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE]: Technical Design of the Evaluation Module #3714

Description

Describe the feature

Technical Design of the Evaluation Module

Data Format

Questions

Answers

Evaluation

Configuration

Code Architecture

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions