Feature/implement load & save for benchmark reports by chakravarthik27 · Pull Request #999 · PacificAI/langtest

chakravarthik27 · 2024-03-27T07:15:17Z

Description

This pull request introduces a significant upgrade to LangTest's evaluation capabilities, focusing on report management and leaderboards. These enhancements empower you to:

Streamlined Reporting and Tracking: Effortlessly save and load detailed evaluation reports directly from the command line using langtest eval, enabling efficient performance tracking and comparative analysis over time, with manual file review options in the ~/.langtest or ./.langtest folder.
Enhanced Leaderboards: Gain valuable insights with the new langtest show-leaderboard command. This command displays existing leaderboards, providing a centralized view of ranked model performance across evaluations.
Average Model Ranking: Leaderboard now include the average ranking for each evaluated model. This metric provides a comprehensive understanding of model performance across various datasets and tests.

How it works:

First, create the parameter.json or parameter.yaml in the working directory

JSON Format

{
    "task": "question-answering",
    "model": {
        "model": "http://localhost:1234/v1/chat/completions",
        "hub": "lm-studio"
    },
    "data": [
        {
            "data_source": "MedMCQA"
        },
        {
            "data_source": "PubMedQA"
        },
        {
            "data_source": "MMLU"
        },
        {
            "data_source": "MedQA"
        }
    ],
    "config": {
        "model_parameters": {
            "max_tokens": 64
        },
        "tests": {
            "defaults": {
                "min_pass_rate": 1.0
            },
            "robustness": {
                "add_typo": {
                    "min_pass_rate": 0.70
                }
            },
            "accuracy": {
                "llm_eval": {
                    "min_score": 0.60
                }
                
            }
        }
    }
}

Yaml Format

task: question-answering
model:
  model: http://localhost:1234/v1/chat/completions
  hub: lm-studio
data:
- data_source: MedMCQA
- data_source: PubMedQA
- data_source: MMLU
- data_source: MedQA
config:
  model_parameters:
    max_tokens: 64
  tests:
    defaults:
      min_pass_rate: 1
    robustness:
      add_typo:
        min_pass_rate: 0.7
    accuracy:
      llm_eval:
        min_score: 0.6

And open the terminal or cmd in your system

langtest eval --model <your model name or endpoint> \
              --hub <model hub like hugging face, lm-studio, web ...> \
              -c < your configuration file like parameter.json or parameter.yaml>

Finally, we can know the leaderboard and rank of the model.

To visualize the leaderboard anytime using the CLI command

langtest show-leaderboard

…ng/loading files and keys

…and updating summary

…board method

into feature/Implement-Load-&-Save-for-Benchmark-Reports

…int horizontal line in leaderboard

chakravarthik27 and others added 19 commits March 22, 2024 13:43

Add leaderboard functionality

8672677

Refactor leaderboard.py and add checkpoints directory

082baae

leaderboard.py: Refactored logic for configuring the harness and savi…

73d2504

…ng/loading files and keys

Add "reports" directory to required_dirs list

6d8ef40

Refactor leaderboard output-dir logic

f9676f7

updated leaderboard.py and helpers.py

8ce959e

fix score logic for accuracy and robustness

5bfa8f4

leaderboard.py: added method for saving accuracy, robustness summary …

0cc9423

…and updating summary

added support for muti-dataset in leaderboard and added update_leader…

6d0a564

…board method

Merge branch 'release/2.0.1' of https://github.com/JohnSnowLabs/langtest

bd4a277

into feature/Implement-Load-&-Save-for-Benchmark-Reports

Add blank line for readability in helpers.py

d1769b8

updated leaderboard.py

f65712c

Refactor JSONLDataset to aggregate JSONL files

ba4b7c7

updated datasource

94e6311

updated leaderboard.py

94b8eb8

resolved: default datasets paths

666278a

Fix dataset name typo in SecuritySample class

da2d5cd

Fix condition for checking custom labels in DataFactory class

c26b258

Fix JSONL file loading in DataFactory and JSONLDataset

8b871b5

chakravarthik27 linked an issue Mar 31, 2024 that may be closed by this pull request

Implement Load and Save Functionality for Benchmark Reports #997

Closed

Fix type hinting and formatting in datasource.py and leaderboard.py

1733e07

chakravarthik27 requested a review from ArshaanNazir April 1, 2024 07:15

chakravarthik27 added 3 commits April 1, 2024 14:07

Add saving generated results and model responses to CSV files, and pr…

67b8841

…int horizontal line in leaderboard

Add show-leaderboard command to langtest leaderboard.py

fe2bc19

Add MMLU dataset to JSONLDataset in aggregate_jsonl()

46c26de

chakravarthik27 assigned chakravarthik27 and Prikshit7766 Apr 1, 2024

chakravarthik27 added the v2.1.0 Issue or request to be done in v2.1.0 release label Apr 1, 2024

chakravarthik27 requested review from Prikshit7766 and removed request for ArshaanNazir April 1, 2024 13:42

Prikshit7766 approved these changes Apr 1, 2024

View reviewed changes

chakravarthik27 merged commit ff9bf2a into release/2.0.1 Apr 1, 2024

chakravarthik27 deleted the feature/Implement-Load-&-Save-for-Benchmark-Reports branch August 30, 2024 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/implement load & save for benchmark reports#999

Feature/implement load & save for benchmark reports#999
chakravarthik27 merged 23 commits intorelease/2.0.1from
feature/Implement-Load-&-Save-for-Benchmark-Reports

chakravarthik27 commented Mar 27, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chakravarthik27 commented Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How it works:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chakravarthik27 commented Mar 27, 2024 •

edited

Loading