EvalSmith

A local, model-agnostic prompt regression runner for teams shipping LLM features.

What it does

EvalSmith runs your prompts against real models, scores the outputs, and — most importantly — tells you when a new prompt version performs worse than the previous one before it reaches production.

It is not a benchmark suite. It is not an eval framework. It is a regression gate you drop into your CI pipeline.

Philosophy

Prompt engineering is iterative. Every iteration is a potential regression. EvalSmith makes regression visible by treating prompt versions the same way engineers treat code: with a diff, a score, and a pass/fail decision.

It runs 100% locally. You bring your own API keys. No data leaves your machine except through the provider APIs you explicitly configure.

Installation

# Clone the repository
git clone https://github.com/your-org/evalsmith.git
cd evalsmith

# Install dependencies
pip install -r evalsmith/requirements.txt

# Install the CLI (editable mode)
pip install -e .

API keys

EvalSmith reads provider keys from environment variables:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AIza...

Dataset format

EvalSmith accepts a CSV with at minimum two columns: one for model inputs and one for expected outputs.

question,answer
"What is the capital of France?","Paris"
"Who wrote 1984?","George Orwell"
"What year did WWII end?","1945"

You specify which columns to use with --input and --ground-truth.

CLI usage

`evalsmith run` — Score models against a dataset

evalsmith run \
  --models openai:gpt-4o,anthropic:claude-3-haiku \
  --dataset tests.csv \
  --prompt prompt_v3.txt \
  --input question \
  --ground-truth answer

Options:

Flag	Description
`--models`	Comma-separated `provider:model` list
`--dataset`	Path to CSV
`--prompt`	Path to prompt template file (`{input}` is substituted per row)
`--input`	Input column name
`--ground-truth`	Ground truth column name
`--output report.json`	Save full JSON report to disk
`--verbose`	Enable debug logging

`evalsmith compare` — Detect regressions between prompt versions

evalsmith compare \
  --model openai:gpt-4o \
  --dataset tests.csv \
  --baseline prompt_v2.txt \
  --candidate prompt_v3.txt \
  --input question \
  --ground-truth answer \
  --fail-on-regression \
  --output report.json

Options:

Flag	Description
`--model`	Single `provider:model` to use for both runs
`--baseline`	Path to baseline prompt file
`--candidate`	Path to candidate prompt file
`--fail-on-regression`	Exit code 1 if accuracy drops or hallucination rate increases
`--output report.json`	Save baseline + candidate + regression JSON

Scoring

Metric	Method
Accuracy	Case-insensitive exact match (whitespace-stripped)
Hallucination rate	% of outputs containing capitalized entities absent from ground truth
Avg latency	Mean request time in milliseconds
Total cost	Estimated USD based on token usage and provider pricing
Cost / correct	USD per successful answer

Regression detection

When running compare, EvalSmith identifies every case where:

A previously correct answer became incorrect
A new hallucinated entity appeared that was not in the ground truth

Example terminal output:

⚠️  3 regression(s) detected:
  • Case #12: Previously correct, now incorrect
  • Case #34: New hallucinated entity detected
  • Case #47: Previously correct, now incorrect; new hallucinated entity detected

A regression is triggered (for --fail-on-regression) if:

Accuracy decreases, OR
Hallucination rate increases

Supported providers

Provider	Key env var	Example model
OpenAI	`OPENAI_API_KEY`	`openai:gpt-4o`
Anthropic	`ANTHROPIC_API_KEY`	`anthropic:claude-3-haiku`
Google	`GOOGLE_API_KEY`	`google:gemini-1.5-flash`

CI integration

Add EvalSmith to your CI pipeline to gate deployments on prompt quality.

GitHub Actions example

name: Prompt Regression Check

on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  evalsmith:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install EvalSmith
        run: |
          pip install -r evalsmith/requirements.txt
          pip install -e .

      - name: Run regression check
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          evalsmith compare \
            --model openai:gpt-4o-mini \
            --dataset tests/regression_suite.csv \
            --baseline prompts/baseline.txt \
            --candidate prompts/candidate.txt \
            --input question \
            --ground-truth answer \
            --fail-on-regression \
            --output regression_report.json

      - name: Upload report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: regression-report
          path: regression_report.json

Project structure

evalsmith/
├── adapters/
│   ├── base.py               # BaseModelAdapter ABC + ModelResponse
│   ├── openai_adapter.py
│   ├── anthropic_adapter.py
│   └── gemini_adapter.py
├── core/
│   ├── runner.py             # Dataset loading + model execution
│   ├── regression.py         # Baseline vs candidate comparison
│   ├── scoring.py            # Accuracy + ScoringReport
│   ├── hallucination.py      # Entity-extraction heuristic
│   └── cost.py               # Token-based cost calculation
├── cli.py                    # Click CLI entry point
├── config.py                 # API key helpers + COST_TABLE
└── requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
evalsmith		evalsmith
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
prompt_v1.txt		prompt_v1.txt
pyproject.toml		pyproject.toml
tests.csv		tests.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalSmith

What it does

Philosophy

Installation

API keys

Dataset format

CLI usage

`evalsmith run` — Score models against a dataset

`evalsmith compare` — Detect regressions between prompt versions

Scoring

Regression detection

Supported providers

CI integration

GitHub Actions example

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvalSmith

What it does

Philosophy

Installation

API keys

Dataset format

CLI usage

evalsmith run — Score models against a dataset

evalsmith compare — Detect regressions between prompt versions

Scoring

Regression detection

Supported providers

CI integration

GitHub Actions example

Project structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`evalsmith run` — Score models against a dataset

`evalsmith compare` — Detect regressions between prompt versions

Packages