Skip to content

palguna26/EvalSmith

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EvalSmith

A local, model-agnostic prompt regression runner for teams shipping LLM features.


What it does

EvalSmith runs your prompts against real models, scores the outputs, and — most importantly — tells you when a new prompt version performs worse than the previous one before it reaches production.

It is not a benchmark suite. It is not an eval framework. It is a regression gate you drop into your CI pipeline.


Philosophy

Prompt engineering is iterative. Every iteration is a potential regression. EvalSmith makes regression visible by treating prompt versions the same way engineers treat code: with a diff, a score, and a pass/fail decision.

It runs 100% locally. You bring your own API keys. No data leaves your machine except through the provider APIs you explicitly configure.


Installation

# Clone the repository
git clone https://github.com/your-org/evalsmith.git
cd evalsmith

# Install dependencies
pip install -r evalsmith/requirements.txt

# Install the CLI (editable mode)
pip install -e .

API keys

EvalSmith reads provider keys from environment variables:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AIza...

Dataset format

EvalSmith accepts a CSV with at minimum two columns: one for model inputs and one for expected outputs.

question,answer
"What is the capital of France?","Paris"
"Who wrote 1984?","George Orwell"
"What year did WWII end?","1945"

You specify which columns to use with --input and --ground-truth.


CLI usage

evalsmith run — Score models against a dataset

evalsmith run \
  --models openai:gpt-4o,anthropic:claude-3-haiku \
  --dataset tests.csv \
  --prompt prompt_v3.txt \
  --input question \
  --ground-truth answer

Options:

Flag Description
--models Comma-separated provider:model list
--dataset Path to CSV
--prompt Path to prompt template file ({input} is substituted per row)
--input Input column name
--ground-truth Ground truth column name
--output report.json Save full JSON report to disk
--verbose Enable debug logging

evalsmith compare — Detect regressions between prompt versions

evalsmith compare \
  --model openai:gpt-4o \
  --dataset tests.csv \
  --baseline prompt_v2.txt \
  --candidate prompt_v3.txt \
  --input question \
  --ground-truth answer \
  --fail-on-regression \
  --output report.json

Options:

Flag Description
--model Single provider:model to use for both runs
--baseline Path to baseline prompt file
--candidate Path to candidate prompt file
--fail-on-regression Exit code 1 if accuracy drops or hallucination rate increases
--output report.json Save baseline + candidate + regression JSON

Scoring

Metric Method
Accuracy Case-insensitive exact match (whitespace-stripped)
Hallucination rate % of outputs containing capitalized entities absent from ground truth
Avg latency Mean request time in milliseconds
Total cost Estimated USD based on token usage and provider pricing
Cost / correct USD per successful answer

Regression detection

When running compare, EvalSmith identifies every case where:

  • A previously correct answer became incorrect
  • A new hallucinated entity appeared that was not in the ground truth

Example terminal output:

⚠️  3 regression(s) detected:
  • Case #12: Previously correct, now incorrect
  • Case #34: New hallucinated entity detected
  • Case #47: Previously correct, now incorrect; new hallucinated entity detected

A regression is triggered (for --fail-on-regression) if:

  • Accuracy decreases, OR
  • Hallucination rate increases

Supported providers

Provider Key env var Example model
OpenAI OPENAI_API_KEY openai:gpt-4o
Anthropic ANTHROPIC_API_KEY anthropic:claude-3-haiku
Google GOOGLE_API_KEY google:gemini-1.5-flash

CI integration

Add EvalSmith to your CI pipeline to gate deployments on prompt quality.

GitHub Actions example

name: Prompt Regression Check

on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  evalsmith:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install EvalSmith
        run: |
          pip install -r evalsmith/requirements.txt
          pip install -e .

      - name: Run regression check
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          evalsmith compare \
            --model openai:gpt-4o-mini \
            --dataset tests/regression_suite.csv \
            --baseline prompts/baseline.txt \
            --candidate prompts/candidate.txt \
            --input question \
            --ground-truth answer \
            --fail-on-regression \
            --output regression_report.json

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: regression-report
          path: regression_report.json

Project structure

evalsmith/
├── adapters/
│   ├── base.py               # BaseModelAdapter ABC + ModelResponse
│   ├── openai_adapter.py
│   ├── anthropic_adapter.py
│   └── gemini_adapter.py
├── core/
│   ├── runner.py             # Dataset loading + model execution
│   ├── regression.py         # Baseline vs candidate comparison
│   ├── scoring.py            # Accuracy + ScoringReport
│   ├── hallucination.py      # Entity-extraction heuristic
│   └── cost.py               # Token-based cost calculation
├── cli.py                    # Click CLI entry point
├── config.py                 # API key helpers + COST_TABLE
└── requirements.txt

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages