A local, model-agnostic prompt regression runner for teams shipping LLM features.
EvalSmith runs your prompts against real models, scores the outputs, and — most importantly — tells you when a new prompt version performs worse than the previous one before it reaches production.
It is not a benchmark suite. It is not an eval framework. It is a regression gate you drop into your CI pipeline.
Prompt engineering is iterative. Every iteration is a potential regression. EvalSmith makes regression visible by treating prompt versions the same way engineers treat code: with a diff, a score, and a pass/fail decision.
It runs 100% locally. You bring your own API keys. No data leaves your machine except through the provider APIs you explicitly configure.
# Clone the repository
git clone https://github.com/your-org/evalsmith.git
cd evalsmith
# Install dependencies
pip install -r evalsmith/requirements.txt
# Install the CLI (editable mode)
pip install -e .EvalSmith reads provider keys from environment variables:
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AIza...EvalSmith accepts a CSV with at minimum two columns: one for model inputs and one for expected outputs.
question,answer
"What is the capital of France?","Paris"
"Who wrote 1984?","George Orwell"
"What year did WWII end?","1945"You specify which columns to use with --input and --ground-truth.
evalsmith run \
--models openai:gpt-4o,anthropic:claude-3-haiku \
--dataset tests.csv \
--prompt prompt_v3.txt \
--input question \
--ground-truth answerOptions:
| Flag | Description |
|---|---|
--models |
Comma-separated provider:model list |
--dataset |
Path to CSV |
--prompt |
Path to prompt template file ({input} is substituted per row) |
--input |
Input column name |
--ground-truth |
Ground truth column name |
--output report.json |
Save full JSON report to disk |
--verbose |
Enable debug logging |
evalsmith compare \
--model openai:gpt-4o \
--dataset tests.csv \
--baseline prompt_v2.txt \
--candidate prompt_v3.txt \
--input question \
--ground-truth answer \
--fail-on-regression \
--output report.jsonOptions:
| Flag | Description |
|---|---|
--model |
Single provider:model to use for both runs |
--baseline |
Path to baseline prompt file |
--candidate |
Path to candidate prompt file |
--fail-on-regression |
Exit code 1 if accuracy drops or hallucination rate increases |
--output report.json |
Save baseline + candidate + regression JSON |
| Metric | Method |
|---|---|
| Accuracy | Case-insensitive exact match (whitespace-stripped) |
| Hallucination rate | % of outputs containing capitalized entities absent from ground truth |
| Avg latency | Mean request time in milliseconds |
| Total cost | Estimated USD based on token usage and provider pricing |
| Cost / correct | USD per successful answer |
When running compare, EvalSmith identifies every case where:
- A previously correct answer became incorrect
- A new hallucinated entity appeared that was not in the ground truth
Example terminal output:
⚠️ 3 regression(s) detected:
• Case #12: Previously correct, now incorrect
• Case #34: New hallucinated entity detected
• Case #47: Previously correct, now incorrect; new hallucinated entity detected
A regression is triggered (for --fail-on-regression) if:
- Accuracy decreases, OR
- Hallucination rate increases
| Provider | Key env var | Example model |
|---|---|---|
| OpenAI | OPENAI_API_KEY |
openai:gpt-4o |
| Anthropic | ANTHROPIC_API_KEY |
anthropic:claude-3-haiku |
GOOGLE_API_KEY |
google:gemini-1.5-flash |
Add EvalSmith to your CI pipeline to gate deployments on prompt quality.
name: Prompt Regression Check
on:
pull_request:
paths:
- 'prompts/**'
jobs:
evalsmith:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install EvalSmith
run: |
pip install -r evalsmith/requirements.txt
pip install -e .
- name: Run regression check
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
evalsmith compare \
--model openai:gpt-4o-mini \
--dataset tests/regression_suite.csv \
--baseline prompts/baseline.txt \
--candidate prompts/candidate.txt \
--input question \
--ground-truth answer \
--fail-on-regression \
--output regression_report.json
- name: Upload report
if: always()
uses: actions/upload-artifact@v4
with:
name: regression-report
path: regression_report.jsonevalsmith/
├── adapters/
│ ├── base.py # BaseModelAdapter ABC + ModelResponse
│ ├── openai_adapter.py
│ ├── anthropic_adapter.py
│ └── gemini_adapter.py
├── core/
│ ├── runner.py # Dataset loading + model execution
│ ├── regression.py # Baseline vs candidate comparison
│ ├── scoring.py # Accuracy + ScoringReport
│ ├── hallucination.py # Entity-extraction heuristic
│ └── cost.py # Token-based cost calculation
├── cli.py # Click CLI entry point
├── config.py # API key helpers + COST_TABLE
└── requirements.txt