renaudcepre · renaudcepre · Mar 30, 2026 · Mar 25, 2026 · Mar 26, 2026 · Mar 27, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -46,8 +46,11 @@ jobs:
         with:
           python-version: "3.12"
 
+      - name: Install dependencies
+        run: uv sync --all-extras --group dev
+
       - name: Type check
-        run: uvx mypy --strict protest
+        run: uv run mypy protest
 
   test:
     needs: lint
@@ -70,6 +73,11 @@ jobs:
           - os: windows-latest
             python-version: "3.12"
     runs-on: ${{ matrix.os }}
+    env:
+      # Force uv to honor the matrix Python version. Without this, uv picks
+      # the newest interpreter satisfying `requires-python` (often the system
+      # 3.12), making the matrix cosmetic.
+      UV_PYTHON: ${{ matrix.python-version }}
 
     steps:
       - uses: actions/checkout@v6
@@ -87,6 +95,9 @@ jobs:
       - name: Install dependencies
         run: uv sync --dev
 
+      - name: Verify Python version
+        run: uv run python -c "import sys; v = '${{ matrix.python-version }}'; assert sys.version.startswith(v), f'expected {v}, got {sys.version}'"
+
       - name: Run tests
         if: matrix.os != 'ubuntu-latest' || matrix.python-version != '3.12'
         run: uv run pytest -vv
@@ -103,7 +114,7 @@ jobs:
           files: coverage.xml
           fail_ci_if_error: false
 
-c  docs:
+  docs:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v6

diff --git a/README.md b/README.md
@@ -62,6 +62,24 @@ CODES = ForEach([200, 201])
 def test_status(code: Annotated[int, From(CODES)]): ...
 ```
 
+### Native LLM Evals
+
+Score model outputs alongside your tests — same fixtures, same parallelism, same `protest` CLI. Cases get pass/fail + numeric metrics, persisted to JSONL for run-over-run comparison.
+
+```python
+@chatbot_suite.eval(evaluators=[contains_keywords(keywords=["paris"])])
+async def chatbot(case: Annotated[EvalCase, From(cases)]) -> str:
+    return await my_agent(case.inputs)
+```
+
+```bash
+protest eval evals.session:session
+protest history --runs        # recent runs
+protest history --compare     # current vs previous
+```
+
+See [Evals docs](https://renaudcepre.github.io/protest/evals/) for evaluators, judges, history tracking.
+
 ---
 
 ## Quick Start
@@ -120,6 +138,7 @@ protest run module:session --ctrf-output r.json  # CTRF report for CI/CD
 - **Plugin system** - Custom reporters, filters
 - **Last-failed mode** - Re-run only failed tests with `--lf`
 - **CTRF reports** - Standardized JSON for CI/CD integration
+- **Native LLM evals** - Scored cases, JSONL history, `protest eval` (see [evals docs](https://renaudcepre.github.io/protest/evals/))
 
 ## Why Not pytest?
 

diff --git a/docs/cli.md b/docs/cli.md
@@ -13,6 +13,8 @@ protest <command> [options] <target>
 | Command | Description |
 |---------|-------------|
 | `run` | Run tests |
+| `eval` | Run evaluations |
+| `history` | Browse run history (tests and evals) |
 | `live` | Start live reporter server |
 | `tags list` | List tags in a session |
 
@@ -276,6 +278,175 @@ protest run tests:session
 
 ---
 
+## protest eval
+
+Run evaluations from a session.
+
+`protest eval` is the eval-suite counterpart of `protest run`. It shares
+the same target format, filters, capture flags and reporting options as
+`run`; the differences are listed below.
+
+### Syntax
+
+```bash
+protest eval <target> [options]
+```
+
+### Options
+
+`protest eval` accepts every option from `protest run` (see above:
+`-n/--concurrency`, `--collect-only`, `-x/--exitfirst`, `-s/--no-capture`,
+`-q/--quiet`, `-v/--verbose`, `--show-logs`, `-t/--tag`, `--no-tag`,
+`-k/--keyword`, `--lf`, `--cache-clear`, `--no-color`, `--ctrf-output`,
+`--no-log-file`, `--app-dir`), plus one eval-only flag:
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--show-output` | Print `inputs` / `output` / `expected` for **every** case (failed cases always print these). | off |
+
+### Examples
+
+```bash
+# Run all evals in a session
+protest eval evals.session:session
+
+# One specific suite
+protest eval evals.session:session::helpdesk_struct
+
+# One ticket by name
+protest eval evals.session:session -k T001
+
+# All cases tagged "cat:hardware"
+protest eval evals.session:session --tag cat:hardware
+
+# Re-run only the cases that failed last time
+protest eval evals.session:session --lf
+
+# Show the input/output of every case (not just failures)
+protest eval evals.session:session --show-output
+```
+
+### Output
+
+Each case prints one line:
+
+```
+✓   classify_ticket_struct[T011] (2ms) category_is_allowed=✓ summary_keyword_recall=1.00 …
+```
+
+After every suite, an aggregate-stats table summarizes the `Metric`
+fields across cases (mean / p50 / p5 / p95). `Verdict` and `Reason`
+fields don't appear in this table — only numeric `Metric` fields do.
+
+Per-case markdown artifacts are written to
+`.protest/results/<suite>_<timestamp>/<case-id>.md`, with the full
+input, output, expected, and per-evaluator scores.
+
+---
+
+## protest history
+
+Browse persisted run history (tests and evals).
+
+Every run appends one entry to `.protest/history.jsonl`; `protest history`
+queries that file via sub-commands.
+
+### Syntax
+
+```bash
+protest history <subcommand> [filters]
+```
+
+If no sub-command is given, `list` runs by default — so
+`protest history --tail 5` is equivalent to
+`protest history list --tail 5`.
+
+### Sub-commands
+
+| Sub-command | Description |
+|-------------|-------------|
+| `list` | Per-suite trend table: pass-rate trend + score arrows. **Default** when no sub-command is given. |
+| `runs` | Run-by-run pass rates, most recent first. |
+| `show [N]` | Detailed panel for the Nth most recent run (`N=0` = latest, the default). |
+| `compare` | Compare the two most recent runs of the same model. |
+| `clean` | Remove entries from runs made on a dirty working tree. **Dry-run by default** — pass `--apply` to actually modify the file. |
+
+### Filters (shared by every sub-command)
+
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--tail N`, `-n N` | Limit to the N most recent entries | 10 |
+| `--evals` | Show eval runs only | _all kinds_ |
+| `--tests` | Show test runs only | _all kinds_ |
+| `--model NAME` | Keep only suites whose `ModelLabel.name` matches | _all_ |
+| `--suite NAME` | Keep only the suite with this name | _all_ |
+| `--path DIR` | Use a custom history directory | `.protest/` |
+
+`--model` and `--suite` filter at the **suite level**: a run that
+contains *several* suites with different models keeps the entry alive,
+with non-matching suites pruned out of the displayed view.
+
+### Reading `--compare`
+
+`--compare` reports four kinds of change between the two most recent
+runs of the same model:
+
+| Marker | Label | Meaning |
+|--------|-------|---------|
+| `+` | Fixed | Case was failing in the previous run, passes now |
+| `-` | Regressions | Case was passing in the previous run, fails now |
+| `⟳` | Modified | Case is recognizable (same name) but its content changed |
+| `*` | New | Case did not exist in the previous run |
+| `✗` | Deleted | Case existed in the previous run, gone now |
+
+The `Modified` line tells you **what** changed by suffixing the case
+name:
+
+- `T001 (case modified)` — `inputs` or `expected` changed (`case_hash`
+  diff)
+- `T001 (scoring modified)` — only the evaluator configuration changed
+  (`eval_hash` diff). Inputs and expected output are intact; you've
+  edited an evaluator or its parameters.
+
+### Examples
+
+```bash
+# Per-suite trend across last 10 eval runs (default sub-command: list)
+protest history --evals
+
+# Run-by-run breakdown of the last 5 eval runs
+protest history runs --evals --tail 5
+
+# Detailed panel for the most recent eval run
+protest history show --evals
+
+# Detailed panel for the run before that (1 = next-most-recent)
+protest history show 1 --evals
+
+# Compare the two most recent runs of the same model
+protest history compare --evals
+
+# Filter to one model — only suites with this model are shown
+protest history list --evals --model qwen-2.5
+
+# Preview which entries `clean` would remove (no file changes)
+protest history clean --evals
+
+# Actually remove dirty entries
+protest history clean --apply
+```
+
+### Notes
+
+- When the project is not a git repo, the per-run commit / dirty
+  columns display `?`. `clean` is a no-op in that case.
+- `--evals` and `--tests` are mutually exclusive; omit both to see
+  every kind.
+- Per-case detail (input, output, expected, evaluator scores) lives in
+  `.protest/results/`, not in the history file.
+
+---
+
 ## protest live
 
 Start a persistent live reporter server for real-time test visualization.

diff --git a/docs/core-concepts/console.md b/docs/core-concepts/console.md
@@ -0,0 +1,49 @@
+# Console Output
+
+Print progress and debug messages that bypass test capture.
+
+## The Problem
+
+`print()` inside tests and fixtures is captured by ProTest. During long-running fixtures (pipeline imports, graph seeding), there's no visible feedback.
+
+## `console.print`
+
+```python
+from protest import console
+
+@fixture()
+async def pipeline():
+    for i, scene in enumerate(scenes):
+        console.print(f"[cyan]pipeline:[/] importing {scene.name} ({i+1}/{len(scenes)})")
+        await import_scene(scene)
+    return driver
+```
+
+Messages appear inline in the reporter output, between test results.
+
+## Rich Markup
+
+`console.print` supports Rich markup. The Rich reporter renders colors; the ASCII reporter strips tags.
+
+```python
+console.print(f"[bold green]done[/] in {duration:.1f}s")
+console.print(f"[yellow]warning:[/] slow query ({elapsed:.2f}s)")
+```
+
+## Raw Mode
+
+Skip markup processing with `raw=True`:
+
+```python
+console.print("debug: raw bytes here", raw=True)
+```
+
+The message is passed as-is to both reporters.
+
+## How It Works
+
+`console.print` sends a `USER_PRINT` event through the event bus. The reporter receives it and writes to the real stdout (bypassing test capture). This means:
+
+- Messages appear immediately, not buffered until test end
+- Works with `-n 4` (concurrent tests) — the event bus serializes per plugin
+- No interference with test capture or `result.output`
diff --git a/docs/core-concepts/dependency-injection.md b/docs/core-concepts/dependency-injection.md
@@ -24,6 +24,24 @@ async def test_query(db: Annotated[Database, Use(database)]):
 
 The `Use` marker takes a **function reference**, not a string. This makes dependencies explicit and enables IDE navigation.
 
+### `Type` is a hint, not a runtime check
+
+In `Annotated[Type, Use(fixture)]`, `Type` is a **type hint for your IDE and static checkers** — ProTest does not validate at runtime that `fixture()` actually returns a `Type`. This matches FastAPI's behavior with `Annotated[Type, Depends(fn)]`: the type is taken on faith, not enforced.
+
+```python
+@fixture()
+def returns_str() -> str:
+    return "hello"
+
+@session.test()
+def test_mismatch(value: Annotated[int, Use(returns_str)]):
+    # `value` is actually a `str` at runtime — ProTest will not warn.
+    # The mismatch surfaces only when `value` is used as an `int`.
+    ...
+```
+
+In practice this is rarely a problem: keep your fixture return types and your call-site annotations aligned, and rely on `mypy`/`pyright` for the static check on the fixture itself.
+
 ## Why Function References?
 
 Using function references instead of string names has benefits: