Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
7aa3b49
feat(evals): ShortCircuit — skip expensive evaluators on early fail
renaudcepre Mar 30, 2026
93c85f5
feat(core): eval-aware types, events, and DI fixes
renaudcepre Mar 25, 2026
5041457
feat(evals): native eval system with @session.eval()
renaudcepre Mar 26, 2026
4310e57
feat(reporters): Rich eval table, multi-model history, console.print
renaudcepre Mar 27, 2026
82f736b
feat: tests, examples, and documentation
renaudcepre Mar 29, 2026
29204bc
chore: entity exports, pyproject config
renaudcepre Mar 29, 2026
bc4d16d
Merge branch 'main' into feat/evals-native
renaudcepre Mar 30, 2026
3ed68a4
fix ci
renaudcepre Mar 30, 2026
7abeb1d
Merge branch 'feat/evals-native' of github.com:renaudcepre/protest in…
renaudcepre Mar 30, 2026
ad7a207
chore: fix all lint — move imports to top-level, no lazy imports
renaudcepre Mar 30, 2026
5f5e9a0
feat(evals): Judge protocol — LLM-as-judge via inversion of dependency
renaudcepre Mar 31, 2026
015c451
fix(reporters): show in/out token split in eval usage summary
renaudcepre Mar 31, 2026
8e748ce
fix(history): exclude error-only runs from stats, propagate is_error …
renaudcepre Mar 31, 2026
6149633
refactor: remove getattr abuse — proper typing and Protocol contracts
renaudcepre Mar 31, 2026
c081255
fix(hashing): fail-hard canonicalization, evaluator_identity() protocol
renaudcepre Mar 31, 2026
d7fbba3
refactor: replace kind string literals with SuiteKind StrEnum
renaudcepre Mar 31, 2026
905d3c8
refactor: move lazy imports to top-level, remove PLC0415 per-file ign…
renaudcepre Mar 31, 2026
b703fa8
fix: resolve all 32 mypy errors, type EvalContext generics properly
renaudcepre Mar 31, 2026
39bd555
refactor: remove dead duck-typed evaluator markers, add typed examples
renaudcepre Mar 31, 2026
155db22
ci: update workflow to install dependencies and fix mypy invocation
renaudcepre Mar 31, 2026
96d3632
refactor: remove redundant type ignores, update dependency management
renaudcepre Mar 31, 2026
752ddbc
refactor(evals): replace `session.eval` with `EvalSuite` for cleaner API
renaudcepre Apr 3, 2026
62a12a3
refactor(evals): replace `dict` with `EvalCase` for eval cases, updat…
renaudcepre Apr 3, 2026
d3f542c
refactor(evals): enhance docstrings for EvalSuite and EvalSession wit…
renaudcepre Apr 4, 2026
6b3c203
feat(reporting): add eval suite and case payloads to web reporting
renaudcepre Apr 4, 2026
924615f
refactor(evals): remove EvalSession, merge history plugins, always-on…
renaudcepre Apr 14, 2026
9c58302
refactor(evals): replace evaluator function wrapper with `Evaluator` …
renaudcepre Apr 14, 2026
6f6d16a
refactor(examples): replace dict-based eval cases with `EvalCase` obj…
renaudcepre Apr 14, 2026
1d42252
docs(evals): clarify that `EvalCase` must replace plain dicts
renaudcepre Apr 14, 2026
9a4ce43
refactor(reporting): centralize shared formatting logic and add CLI o…
renaudcepre Apr 24, 2026
ef0c176
refactor(reporting, examples, core): add `_safe_repr` for JSON-safe s…
renaudcepre Apr 24, 2026
67c4887
tests: add coverage for `EvalCase` invariants, `history --compare` lo…
renaudcepre Apr 24, 2026
909ac72
tests(evals): add tests for `EvalCaseResult.from_test_result` and ref…
renaudcepre Apr 24, 2026
fee2bf6
tests(evals): add tests for `EvalCase.metadata['tags']` wiring and en…
renaudcepre Apr 24, 2026
46c54d3
tests(history): add concurrency tests for `append_entry` and implemen…
renaudcepre Apr 24, 2026
f2909b2
tests(history): add isolation tests for `DEFAULT_HISTORY_DIR` and ove…
renaudcepre Apr 24, 2026
acdacfd
tests(execution): add tests for `real_stdout` / `real_stderr` and rep…
renaudcepre Apr 24, 2026
715857e
refactor(console, capture): improve type annotations and clarify even…
renaudcepre Apr 24, 2026
594bb54
feat(history): version JSONL entries via `schema_version` with skip+w…
renaudcepre Apr 25, 2026
4276e5d
fix(evals): use `statistics.quantiles` for true p5/p95 in `ScoreStats`
renaudcepre Apr 25, 2026
a7f29cc
chore: address review minors (m2, m3, m4, m6, m7, m10, m11)
renaudcepre Apr 25, 2026
18078d4
ci: ensure matrix Python version consistency and add verification step
renaudcepre Apr 25, 2026
6b9cc83
refactor: replace `StrEnum` with `str, Enum` for Python 3.10 compatib…
renaudcepre Apr 25, 2026
ef5a65b
chore: remove `pydantic-evals` dependency and related code
renaudcepre Apr 25, 2026
72a8457
tests(history): ensure `--runs` displays newest entries first
renaudcepre Apr 25, 2026
8b64322
refactor(evals): replace `keyword_check` with `contains_keywords` and…
renaudcepre Apr 25, 2026
bf27f4c
tests(evals): add stricter `contains_keywords` threshold tests and im…
renaudcepre Apr 25, 2026
bfa9d14
tests(evals): add `not_empty` tests for Sized containers and clarify …
renaudcepre Apr 25, 2026
e54f179
tests(evals): add precision tests for sub-millisecond durations and a…
renaudcepre Apr 25, 2026
1779d4a
tests(console): add payload, prefix handling, and reporter tests
renaudcepre Apr 25, 2026
0f25a1b
tests(console): surface handler errors and add fallback handling tests
renaudcepre Apr 25, 2026
7a78560
tests(history): ensure clean_dirty concurrency preserves all appends
renaudcepre Apr 25, 2026
2289485
docs(evals): add details on native LLM support and evaluator enhancem…
renaudcepre Apr 25, 2026
2f0bfcb
refactor(evals): migrate `tags` from `metadata` to first-class `EvalC…
renaudcepre Apr 26, 2026
fa5a7ee
tests(evals): add validation for multiple `EvalCase` params and CLI f…
renaudcepre Apr 26, 2026
53d4813
fix(executor): raise builtin TimeoutError to match Python 3.10 semantics
renaudcepre Apr 26, 2026
4564380
fix(evals): tier-1 polish from naive-agent feedback
renaudcepre Apr 26, 2026
3d1fe48
fix(history,cli,docs): tier-2 polish from naive-agent v2 feedback
renaudcepre Apr 27, 2026
db671a6
fix(history): refuse cross-model compare to avoid phantom regressions
renaudcepre Apr 28, 2026
37d5c09
refactor(evals): split Evaluator __call__/run, require @evaluator at …
renaudcepre Apr 28, 2026
8e388ca
fix(evals,history): polish from naive-agent v4 feedback
renaudcepre Apr 28, 2026
99d512f
refactor(examples): rename yorkshire dataset.py to cases.py
renaudcepre Apr 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,11 @@ jobs:
with:
python-version: "3.12"

- name: Install dependencies
run: uv sync --all-extras --group dev

- name: Type check
run: uvx mypy --strict protest
run: uv run mypy protest

test:
needs: lint
Expand All @@ -70,6 +73,11 @@ jobs:
- os: windows-latest
python-version: "3.12"
runs-on: ${{ matrix.os }}
env:
# Force uv to honor the matrix Python version. Without this, uv picks
# the newest interpreter satisfying `requires-python` (often the system
# 3.12), making the matrix cosmetic.
UV_PYTHON: ${{ matrix.python-version }}

steps:
- uses: actions/checkout@v6
Expand All @@ -87,6 +95,9 @@ jobs:
- name: Install dependencies
run: uv sync --dev

- name: Verify Python version
run: uv run python -c "import sys; v = '${{ matrix.python-version }}'; assert sys.version.startswith(v), f'expected {v}, got {sys.version}'"

- name: Run tests
if: matrix.os != 'ubuntu-latest' || matrix.python-version != '3.12'
run: uv run pytest -vv
Expand All @@ -103,7 +114,7 @@ jobs:
files: coverage.xml
fail_ci_if_error: false

c docs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
Expand Down
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,24 @@ CODES = ForEach([200, 201])
def test_status(code: Annotated[int, From(CODES)]): ...
```

### Native LLM Evals

Score model outputs alongside your tests — same fixtures, same parallelism, same `protest` CLI. Cases get pass/fail + numeric metrics, persisted to JSONL for run-over-run comparison.

```python
@chatbot_suite.eval(evaluators=[contains_keywords(keywords=["paris"])])
async def chatbot(case: Annotated[EvalCase, From(cases)]) -> str:
return await my_agent(case.inputs)
```

```bash
protest eval evals.session:session
protest history --runs # recent runs
protest history --compare # current vs previous
```

See [Evals docs](https://renaudcepre.github.io/protest/evals/) for evaluators, judges, history tracking.

---

## Quick Start
Expand Down Expand Up @@ -120,6 +138,7 @@ protest run module:session --ctrf-output r.json # CTRF report for CI/CD
- **Plugin system** - Custom reporters, filters
- **Last-failed mode** - Re-run only failed tests with `--lf`
- **CTRF reports** - Standardized JSON for CI/CD integration
- **Native LLM evals** - Scored cases, JSONL history, `protest eval` (see [evals docs](https://renaudcepre.github.io/protest/evals/))

## Why Not pytest?

Expand Down
171 changes: 171 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ protest <command> [options] <target>
| Command | Description |
|---------|-------------|
| `run` | Run tests |
| `eval` | Run evaluations |
| `history` | Browse run history (tests and evals) |
| `live` | Start live reporter server |
| `tags list` | List tags in a session |

Expand Down Expand Up @@ -276,6 +278,175 @@ protest run tests:session

---

## protest eval

Run evaluations from a session.

`protest eval` is the eval-suite counterpart of `protest run`. It shares
the same target format, filters, capture flags and reporting options as
`run`; the differences are listed below.

### Syntax

```bash
protest eval <target> [options]
```

### Options

`protest eval` accepts every option from `protest run` (see above:
`-n/--concurrency`, `--collect-only`, `-x/--exitfirst`, `-s/--no-capture`,
`-q/--quiet`, `-v/--verbose`, `--show-logs`, `-t/--tag`, `--no-tag`,
`-k/--keyword`, `--lf`, `--cache-clear`, `--no-color`, `--ctrf-output`,
`--no-log-file`, `--app-dir`), plus one eval-only flag:

| Option | Description | Default |
|--------|-------------|---------|
| `--show-output` | Print `inputs` / `output` / `expected` for **every** case (failed cases always print these). | off |

### Examples

```bash
# Run all evals in a session
protest eval evals.session:session

# One specific suite
protest eval evals.session:session::helpdesk_struct

# One ticket by name
protest eval evals.session:session -k T001

# All cases tagged "cat:hardware"
protest eval evals.session:session --tag cat:hardware

# Re-run only the cases that failed last time
protest eval evals.session:session --lf

# Show the input/output of every case (not just failures)
protest eval evals.session:session --show-output
```

### Output

Each case prints one line:

```
✓ classify_ticket_struct[T011] (2ms) category_is_allowed=✓ summary_keyword_recall=1.00 …
```

After every suite, an aggregate-stats table summarizes the `Metric`
fields across cases (mean / p50 / p5 / p95). `Verdict` and `Reason`
fields don't appear in this table — only numeric `Metric` fields do.

Per-case markdown artifacts are written to
`.protest/results/<suite>_<timestamp>/<case-id>.md`, with the full
input, output, expected, and per-evaluator scores.

---

## protest history

Browse persisted run history (tests and evals).

Every run appends one entry to `.protest/history.jsonl`; `protest history`
queries that file via sub-commands.

### Syntax

```bash
protest history <subcommand> [filters]
```

If no sub-command is given, `list` runs by default — so
`protest history --tail 5` is equivalent to
`protest history list --tail 5`.

### Sub-commands

| Sub-command | Description |
|-------------|-------------|
| `list` | Per-suite trend table: pass-rate trend + score arrows. **Default** when no sub-command is given. |
| `runs` | Run-by-run pass rates, most recent first. |
| `show [N]` | Detailed panel for the Nth most recent run (`N=0` = latest, the default). |
| `compare` | Compare the two most recent runs of the same model. |
| `clean` | Remove entries from runs made on a dirty working tree. **Dry-run by default** — pass `--apply` to actually modify the file. |

### Filters (shared by every sub-command)

| Flag | Description | Default |
|------|-------------|---------|
| `--tail N`, `-n N` | Limit to the N most recent entries | 10 |
| `--evals` | Show eval runs only | _all kinds_ |
| `--tests` | Show test runs only | _all kinds_ |
| `--model NAME` | Keep only suites whose `ModelLabel.name` matches | _all_ |
| `--suite NAME` | Keep only the suite with this name | _all_ |
| `--path DIR` | Use a custom history directory | `.protest/` |

`--model` and `--suite` filter at the **suite level**: a run that
contains *several* suites with different models keeps the entry alive,
with non-matching suites pruned out of the displayed view.

### Reading `--compare`

`--compare` reports four kinds of change between the two most recent
runs of the same model:

| Marker | Label | Meaning |
|--------|-------|---------|
| `+` | Fixed | Case was failing in the previous run, passes now |
| `-` | Regressions | Case was passing in the previous run, fails now |
| `⟳` | Modified | Case is recognizable (same name) but its content changed |
| `*` | New | Case did not exist in the previous run |
| `✗` | Deleted | Case existed in the previous run, gone now |

The `Modified` line tells you **what** changed by suffixing the case
name:

- `T001 (case modified)` — `inputs` or `expected` changed (`case_hash`
diff)
- `T001 (scoring modified)` — only the evaluator configuration changed
(`eval_hash` diff). Inputs and expected output are intact; you've
edited an evaluator or its parameters.

### Examples

```bash
# Per-suite trend across last 10 eval runs (default sub-command: list)
protest history --evals

# Run-by-run breakdown of the last 5 eval runs
protest history runs --evals --tail 5

# Detailed panel for the most recent eval run
protest history show --evals

# Detailed panel for the run before that (1 = next-most-recent)
protest history show 1 --evals

# Compare the two most recent runs of the same model
protest history compare --evals

# Filter to one model — only suites with this model are shown
protest history list --evals --model qwen-2.5

# Preview which entries `clean` would remove (no file changes)
protest history clean --evals

# Actually remove dirty entries
protest history clean --apply
```

### Notes

- When the project is not a git repo, the per-run commit / dirty
columns display `?`. `clean` is a no-op in that case.
- `--evals` and `--tests` are mutually exclusive; omit both to see
every kind.
- Per-case detail (input, output, expected, evaluator scores) lives in
`.protest/results/`, not in the history file.

---

## protest live

Start a persistent live reporter server for real-time test visualization.
Expand Down
49 changes: 49 additions & 0 deletions docs/core-concepts/console.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Console Output

Print progress and debug messages that bypass test capture.

## The Problem

`print()` inside tests and fixtures is captured by ProTest. During long-running fixtures (pipeline imports, graph seeding), there's no visible feedback.

## `console.print`

```python
from protest import console

@fixture()
async def pipeline():
for i, scene in enumerate(scenes):
console.print(f"[cyan]pipeline:[/] importing {scene.name} ({i+1}/{len(scenes)})")
await import_scene(scene)
return driver
```

Messages appear inline in the reporter output, between test results.

## Rich Markup

`console.print` supports Rich markup. The Rich reporter renders colors; the ASCII reporter strips tags.

```python
console.print(f"[bold green]done[/] in {duration:.1f}s")
console.print(f"[yellow]warning:[/] slow query ({elapsed:.2f}s)")
```

## Raw Mode

Skip markup processing with `raw=True`:

```python
console.print("debug: raw bytes here", raw=True)
```

The message is passed as-is to both reporters.

## How It Works

`console.print` sends a `USER_PRINT` event through the event bus. The reporter receives it and writes to the real stdout (bypassing test capture). This means:

- Messages appear immediately, not buffered until test end
- Works with `-n 4` (concurrent tests) — the event bus serializes per plugin
- No interference with test capture or `result.output`
18 changes: 18 additions & 0 deletions docs/core-concepts/dependency-injection.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,24 @@ async def test_query(db: Annotated[Database, Use(database)]):

The `Use` marker takes a **function reference**, not a string. This makes dependencies explicit and enables IDE navigation.

### `Type` is a hint, not a runtime check

In `Annotated[Type, Use(fixture)]`, `Type` is a **type hint for your IDE and static checkers** — ProTest does not validate at runtime that `fixture()` actually returns a `Type`. This matches FastAPI's behavior with `Annotated[Type, Depends(fn)]`: the type is taken on faith, not enforced.

```python
@fixture()
def returns_str() -> str:
return "hello"

@session.test()
def test_mismatch(value: Annotated[int, Use(returns_str)]):
# `value` is actually a `str` at runtime — ProTest will not warn.
# The mismatch surfaces only when `value` is used as an `int`.
...
```

In practice this is rarely a problem: keep your fixture return types and your call-site annotations aligned, and rely on `mypy`/`pyright` for the static check on the fixture itself.

## Why Function References?

Using function references instead of string names has benefits:
Expand Down
Loading
Loading