Skip to content

Feature/implement load & save for benchmark reports#999

Merged
chakravarthik27 merged 23 commits intorelease/2.0.1from
feature/Implement-Load-&-Save-for-Benchmark-Reports
Apr 1, 2024
Merged

Feature/implement load & save for benchmark reports#999
chakravarthik27 merged 23 commits intorelease/2.0.1from
feature/Implement-Load-&-Save-for-Benchmark-Reports

Conversation

@chakravarthik27
Copy link
Copy Markdown
Collaborator

@chakravarthik27 chakravarthik27 commented Mar 27, 2024

Description

This pull request introduces a significant upgrade to LangTest's evaluation capabilities, focusing on report management and leaderboards. These enhancements empower you to:

  • Streamlined Reporting and Tracking: Effortlessly save and load detailed evaluation reports directly from the command line using langtest eval, enabling efficient performance tracking and comparative analysis over time, with manual file review options in the ~/.langtest or ./.langtest folder.

  • Enhanced Leaderboards: Gain valuable insights with the new langtest show-leaderboard command. This command displays existing leaderboards, providing a centralized view of ranked model performance across evaluations.

  • Average Model Ranking: Leaderboard now include the average ranking for each evaluated model. This metric provides a comprehensive understanding of model performance across various datasets and tests.

How it works:

First, create the parameter.json or parameter.yaml in the working directory

JSON Format

{
    "task": "question-answering",
    "model": {
        "model": "http://localhost:1234/v1/chat/completions",
        "hub": "lm-studio"
    },
    "data": [
        {
            "data_source": "MedMCQA"
        },
        {
            "data_source": "PubMedQA"
        },
        {
            "data_source": "MMLU"
        },
        {
            "data_source": "MedQA"
        }
    ],
    "config": {
        "model_parameters": {
            "max_tokens": 64
        },
        "tests": {
            "defaults": {
                "min_pass_rate": 1.0
            },
            "robustness": {
                "add_typo": {
                    "min_pass_rate": 0.70
                }
            },
            "accuracy": {
                "llm_eval": {
                    "min_score": 0.60
                }
                
            }
        }
    }
}

Yaml Format

task: question-answering
model:
  model: http://localhost:1234/v1/chat/completions
  hub: lm-studio
data:
- data_source: MedMCQA
- data_source: PubMedQA
- data_source: MMLU
- data_source: MedQA
config:
  model_parameters:
    max_tokens: 64
  tests:
    defaults:
      min_pass_rate: 1
    robustness:
      add_typo:
        min_pass_rate: 0.7
    accuracy:
      llm_eval:
        min_score: 0.6

And open the terminal or cmd in your system

langtest eval --model <your model name or endpoint> \
              --hub <model hub like hugging face, lm-studio, web ...> \
              -c < your configuration file like parameter.json or parameter.yaml>

Finally, we can know the leaderboard and rank of the model.
image


To visualize the leaderboard anytime using the CLI command

langtest show-leaderboard

image

@chakravarthik27 chakravarthik27 linked an issue Mar 31, 2024 that may be closed by this pull request
@chakravarthik27 chakravarthik27 added the v2.1.0 Issue or request to be done in v2.1.0 release label Apr 1, 2024
@chakravarthik27 chakravarthik27 requested review from Prikshit7766 and removed request for ArshaanNazir April 1, 2024 13:42
@chakravarthik27 chakravarthik27 merged commit ff9bf2a into release/2.0.1 Apr 1, 2024
@chakravarthik27 chakravarthik27 deleted the feature/Implement-Load-&-Save-for-Benchmark-Reports branch August 30, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v2.1.0 Issue or request to be done in v2.1.0 release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Load and Save Functionality for Benchmark Reports

2 participants