feat: evals by renaudcepre · Pull Request #89 · renaudcepre/protest

renaudcepre · 2026-03-30T08:40:40Z

No description provided.

evaluators=[ not_empty, ShortCircuit([ contains_expected_facts(min_score=0.5), llm_judge(rubric="..."), # skipped if above fails ]), ] First Verdict=False stops the group. Evaluators outside run regardless.

- EvalPayload, EvalScoreEntry on TestResult for eval case results - EVAL_SUITE_END event emitted by core runner - is_eval flag on TestRegistration/TestItem - KindFilterPlugin for protest run vs protest eval - get_type_hints_compat: PEP 563 + TYPE_CHECKING support in all DI sites - Async fixture teardown on same event loop (no more loop mismatch) - Fixture resolution time excluded from test duration - Log records captured on TestResult for --show-logs

@evaluator

An eval is a test that returns a scored value. Uses ForEach/From for parametrization — no separate EvalSuite/EvalCase framework. - @session.eval(evaluators=[...]) decorator - @evaluator decorator with partial-application binding - EvalSession(model=) for eval-focused sessions - EvalContext passed to evaluators - Scoring v2: evaluators return bool or dataclass - Annotated[bool, Verdict] → pass/fail - Annotated[float, Metric] → stats aggregation - Annotated[str, Reason] → displayed on failure - EvalCase dataclass for typed ForEach data - Built-in evaluators: contains_keywords, not_empty, max_length, etc. - EvalHistoryPlugin listens to EVAL_SUITE_END - EvalResultsWriter for per-case .md files - Evaluator exception → error (not fail)

- on_eval_suite_end: Rich table for scores, plain text for ASCII - Scores inline in -v, --show-output for inputs/output/expected - --show-logs flag for captured log records - Fixture setup time always displayed - protest history --runs: per-suite breakdown with model - protest.console.print(): progress output bypassing capture - Lifecycle messages bypass capture (no re-display on fail) - Output truncated at 20 lines with pointer to full output - Case id in lifecycle messages (chatbot[lookup] not chatbot)

- 1063 tests (56 eval-specific) - Yorkshire chatbot example with @session.eval + ForEach - History module: JSONL storage, git info, env info - docs/evals.md: full guide (scoring, evaluators, CLI, history) - docs/core-concepts/console.md: console.print guide

…to feat/evals-native

ProTest owns the interface, user plugs in their LLM library. - Judge protocol: `async judge(prompt, output_type) -> JudgeResponse[T]` - JudgeResponse wraps output with optional tokens/cost tracking - EvalContext.judge() unwraps for evaluators, accumulates usage stats - JudgeInfo auto-derived from instance for history - EvalPayload carries judge_call_count, tokens, cost per case - EvalSession(judge=MyJudge()) wires through to evaluators - suite.eval(judge=) for standalone usage - 19 new tests (protocol, ctx.judge, e2e, structured output, tokens)

Task: 45.2k in / 27.1k out, $0.0142 Judge: 5 calls, 800 in / 400 out, $0.0030

…flag Fixture crashes (errored >= total_cases) were counted in pass_rates, score_values, and flaky — polluting stats with noise. Now: - EvalCaseResult.is_error propagated from TestResult.is_fixture_error - History serializes errored count per suite + is_error per case - _aggregate_suites skips error-only runs from stats entirely - _track_cases skips error cases from score_values and flaky - Error runs still visible in `protest history --runs` Also: docs/evals.md updated for TaskResult section and Judge protocol fix.

- Remove defensive getattr in session.py where types are known - Type plugin setup(session: ProTestSession) instead of Any - Add name/provider to Judge Protocol — explicit contract - Delete ModelInfo.from_agent and JudgeInfo.from_instance — user wires - Fix lint: PLR2004 magic values, PLR0912 noqa, ambiguous unicode

Replace fragile repr() fallback with explicit error on unknown types. Add evaluator_identity() as user-controlled escape hatch for custom evaluators. Introspect dataclass/partial/callable as fallback only. - Remove hasattr(obj, "model_dump") duck-typing (Pydantic leak) - Remove default=str silent fallback in json.dumps - Skip _prefixed dataclass fields (runtime internals, not config) - Add functools.partial support (qualname + bound kwargs) - Add ShortCircuit.evaluator_identity() - 33 tests covering all paths including fail-hard

Type-safe suite kind across the codebase. StrEnum keeps JSON compat (SuiteKind.EVAL == "eval") so no migration needed.

…ores 28 lazy imports in protest/, none resolving a real circular dependency. Moved all to top-level except justified cases (optional deps like rich, conditional wiring, and one true circular import in evals/__init__.py). Removed blanket PLC0415 per-file-ignores from pyproject.toml — remaining suppressions use inline noqa with justification.

- Type built-in evaluators as EvalContext[Any, str] (text evaluators) - not_empty typed EvalContext[Any, Any] (works on any output) - Fix mypy running outside venv (uv run mypy in justfile) - Add mypy config in pyproject.toml with rich stubs override - Fix no-any-return, arg-type, unused type-ignore across codebase - Remove stale type: ignore[import-not-found] on rich imports

- Remove is_async_evaluator(), _is_evaluator, _is_async_evaluator (written but never read — dead code with hasattr duck-typing) - Add yorkshire example evaluators showing EvalContext generics: [Any, str] for text, [str, float] for numeric, [str, bytes] for binary

- Removed unnecessary `# type: ignore[import-not-found]` markers on imports. - Added `--group dev` flag to dependency sync in CI workflow. - Updated `uv.lock` to include new packages: `librt` and `mypy`.

- Introduced `EvalSuite` class to encapsulate eval logic, replacing inline `session.eval()` definitions. - Removed duplicate `eval` methods in `ProTestSession` and `ProTestSuite`. - Updated tests and examples to leverage `EvalSuite`.

…e APIs and tests - Standardized eval cases by replacing untyped `dict` with `EvalCase` objects across codebase. - Updated evaluator helpers to work exclusively with `EvalCase` instances. - Refactored `make_eval_wrapper` to remove unused `expected_key` argument. - Updated tests and examples to adopt `EvalCase` usage for improved type safety and code clarity.

…h detailed functionality descriptions

- Added support for emitting an `EVAL_SUITE_END` event with detailed suite-level metrics and score statistics. - Extended `SUITE_END` payloads to include evaluation-related details when processing eval-specific results.

… architecture - Delete EvalSession — ProTestSession is the only session - Merge HistoryPlugin + EvalHistoryPlugin into single always-on plugin - EvalResultsWriter now always-on (no-op without evals) - Model/judge live entirely on EvalSuite, no session propagation - history=True by default on ProTestSession - Remove apply_defaults, _wire_eval_support, add_suite override

…class - Introduced `Evaluator

…ects in Yorkshire example dataset

…actor writer construction - Added comprehensive tests for `EvalCaseResult.from_test_result` to validate field mappings and defensive checks. - Refactored writer logic to use `EvalCaseResult.from_test_result`, simplifying redundant helper methods.

…hance tag propagation logic - Added tests to verify that `EvalCase.metadata['tags']` are merged into `TestItem.tags`. - Updated `Collector` to propagate tags from `EvalCase.metadata` into `TestItem` during collection. - Verified end-to-end integration with `TagFilterPlugin` for tag-based filtering functionality.

…t cross-platform file locking - Added tests to ensure `append_entry` supports concurrent writes without line corruption. - Implemented cross-platform file locking: `fcntl.flock` on POSIX and `msvcrt.locking` on Windows using a sibling `.lock` file. - Ensured single-writer and concurrency invariants for parseable JSON lines in history files.

…rride behaviors - Added regression tests to ensure `_isolate_protest_history` fixture correctly overrides `DEFAULT_HISTORY_DIR` with a per-test temp directory. - Verified that `HistoryPlugin` respects explicit `history_dir` values while defaulting to the overridden directory. - Updated `conftest.py` with autouse fixture to prevent test pollution of real `.protest/history.jsonl`.

…lace sys stream duck-typing - Added unit tests for `real_stdout` and `real_stderr` to ensure proper unwrapping of `TaskAwareStream` and correct fallback to original streams. - Replaced `getattr(sys.stdout, "_original", ...)` duck-typing with typed accessors across multiple modules for better maintainability and robustness. - Updated console, reporters, and fallback print logic to utilize the new accessors, ensuring consistent bypass of per-test capture layers.

…t bus usage - Added `EventBus` type annotations for `_event_bus_ref` and related methods to improve clarity and type safety. - Updated comments in `console.print` to explain the necessity of private access to `bus._handlers` and its rationale. - Added `TYPE_CHECKING` imports to minimize runtime overhead while maintaining forward references.

…arn on future versions - `SCHEMA_VERSION = 1` constant in `storage`; `HistoryPlugin` stamps it on every new entry. - Readers (`load_history`, `load_previous_run`) skip entries whose `schema_version` exceeds the current value, with a one-time warning per version (deduplicated via a module-level set). - Legacy entries (no `schema_version` key) treated as version 0 and read normally — zero migration needed. - Add `tests/history/test_schema_version.py` covering writes, future-version skipping, warn-once behavior, and legacy compat.

Replaces naive `int(n * 0.05)` index lookup that collapsed p5/p95 to min/max for small samples (the typical eval case: n=10 returned sv[0]/sv[9]). Now uses `statistics.quantiles(n=20, method='inclusive')` which interpolates linearly between adjacent values and clamps to [min, max] — appropriate for bounded scores. - Single-value case (n=1) falls back to that value (percentiles undefined). - Empty case unchanged: zeroed stats. - `_MIN_VALUES_FOR_PERCENTILES = 2` constant gates the quantiles call. - Add `tests/evals/test_score_stats.py` covering empty / n=1 / n=2 / n=10 (the regression case) / n=100 / sort-independence.

- m2: replace `lambda` with `functools.partial` in CLI command dispatch (`protest/cli/main.py`). - m3: route `EvalResultsWriter` "Results: ..." line through `console.print` instead of builtin `print`, so it bypasses test capture consistently. - m4: `Evaluator.__call__` now always returns a fresh clone in the re-binding path; removes the surprising `f is f()` identity. - m6: replace `"tests"` sentinel for `_default_suite_name` with `None`, fall back to the literal `"tests"` only when no test suite registered. A user-defined suite literally named `"tests"` no longer collides with the default-detection heuristic. - m7: add a Contents section (TOC) to `docs/evals.md` for raw-file navigability (mkdocs already auto-generates a sidebar TOC). - m10: clarify `FakeJudge.judge` comment — caller must use a dataclass with all-default fields. - m11: type `EvalSuite.eval(judge=)` as `Judge | None` (was `Any`) and document the per-eval override behavior in the docstring. Verified intentional / already-resolved: m1 (`console.print` shadow is the API), m5 (deduplicated via M3), m8 (deferred — needs PEP 696), m9 (`_canonical` resolution order is documented), m12 (`SuiteKind` is a `StrEnum`, no mismatch between str/enum comparisons).

- Set `UV_PYTHON` to enforce the selected Python version in the matrix. - Add a verification step to confirm the expected Python version is used.

…ility - Updated `SuiteKind` to inherit from `str` and `Enum` instead of `StrEnum`, ensuring compatibility with Python 3.10. - Adjusted `SuiteKind.__str__` method for consistent behavior. - Modified history plugin to handle `Enum.value` directly while maintaining default behavior. - Moved `Self` import to `protest.compat` for streamlined typing support.

- Dropped `pydantic-evals` from dependencies and `pyproject.toml` `evals` extra. - Removed references to `pydantic-evals` in code and version reporting. - Cleaned up `uv.lock` and related metadata.

- Added `TestRunsOrderRecentFirst` to validate that `--runs` follows the git log convention, showing the most recent entries first. - Updated CLI logic to reverse storage order (oldest → newest) for display consistency. - Adjusted index formatting and numbering in both plain and rich output modes to reflect the newest-first display.

… update evaluator logic

…prove evaluator logic documentation - Introduced tests for `min_recall` edge cases, including exact threshold passing, discontinuity fixes, and below-threshold failures. - Updated `contains_keywords` evaluator to simplify `all_keywords_present` logic and ensure consistent behavior across thresholds. - Adjusted default `min_recall` to `1.0` in docs and implementation for stricter compliance.

…evaluator behavior - Added tests to ensure `not_empty` correctly handles empty and non-empty lists, dicts, and sets. - Updated `not_empty` docstring and logic to explicitly check `Sized` objects using `len()`.

…daptive formatting - Added tests to ensure `_serialize_eval_case` preserves 10 µs precision, preventing sub-ms durations from collapsing to `0.0`. - Introduced `_format_case_duration` tests for adaptive time unit rendering across microseconds, milliseconds, and seconds. - Updated markdown renderer to use `_format_case_duration` for task durations. - Increased duration serialization precision from 3 to 5 decimals in history plugin.

- Added tests for 3-tuple payload behavior in `console.print` with flags `raw` and `prefix`. - Verified ASCII and Rich reporters correctly render messages with/without test prefixes and markup. - Updated `console.print` to support a new `prefix` parameter for suite-level outputs (e.g., "Results: ..."). - Adjusted `on_user_print` implementations across reporters to handle the `prefix` flag correctly.

- Updated `console.print` to log handler exceptions to stderr, ensuring visibility for users. - Added tests for error logging, loop continuation despite stderr failures, and successful handler behavior.

- Added tests to validate `clean_dirty` concurrency handling, ensuring no appends are silently dropped due to interleaved truncate operations. - Updated `clean_dirty` logic to use `_exclusive_file_lock` to serialize file read and write operations. - Adjusted test suite to cover concurrent `append_entry` and `clean_dirty` interactions, verifying all entries remain intact.

…ents - Expanded documentation to introduce native LLM evals, including pass/fail and numeric scoring with JSONL history. - Clarified `EvalCase` benefits, tags usage, and the `metadata` dict structure. - Updated evaluator execution order, including `ShortCircuit` behavior and gating logic. - Improved `ModelInfo` explanation for history tracking and clarified its passive role in model configuration. - Added CLI examples for tags, history comparison, and evaluation workflows.

…ase` field and update tests

…lag exclusion - Added decorator-time validation to ensure eval functions declare only one `EvalCase` parameter, raising clear errors on conflicts. - Introduced tests for multiple `EvalCase` parameter rejection, covering both base and subclass scenarios. - Updated CLI parser to exclude eval-only flags (e.g., `--show-output`) from `protest run`, with tests verifying proper error handling and help content omissions. - Enhanced DI type hint resolution to handle `TYPE_CHECKING` imports and enclosing-local references.

asyncio.TimeoutError and builtins.TimeoutError were distinct classes before Python 3.11. Reporters and tests check isinstance against the builtin, so on 3.10 the previous `raise asyncio.TimeoutError(...)` made those checks fail. On 3.11+ both names alias the builtin, so this is a no-op. Fixes 6 timeout/retry tests on the 3.10 CI matrix.

Agent test (Claude Code in isolated dir, public docs only) surfaced several rough edges. This batch addresses the ones blocking a clean re-run signal: - ScoreNameCollisionError: dataclass evaluators with overlapping field names previously overwrote each other silently in the per-case scores dict (and the history file). Now raises at runtime with the case name and duplicate names; doc rewritten to remove the false auto-prefix promise. - ModelInfo -> ModelLabel: rename clarifies it is a passive history label, not a runtime model config (the doc warning becomes obsolete and is replaced by a plain description). - rich made truly optional: lazy-imported inside RichReporter methods so `import protest` works without rich; AsciiReporter.activate() takes over when rich is missing. Verified in a venv with no extras. - EvalSuite re-exported from protest.evals so users only need one import path for the eval API. - Top-level `protest --help` epilog now includes eval/history/live examples (was 9 run + 1 tags, none for eval/history/live). - cli.md gets full `protest eval` and `protest history` sections, including --compare's case-modified vs scoring-modified semantics.

Agent v2 confirmed the tier-1 fixes landed cleanly and surfaced a new bucket of frictions concentrated on `protest history`. This batch addresses them. CLI refactor: - `protest history` is now sub-command based (`list`, `runs`, `show`, `compare`, `clean`) instead of mutually-exclusive flags. `list` remains the implicit default so `protest history --tail 5` still works without typing the sub-command. The previous flag-as-mode form (`--runs`, `--show`, `--compare`, `--clean-dirty`) is removed. - `protest history clean` is dry-run by default. `--apply` actually modifies the file. Eliminates the "destructive without warning" footgun. - `--model` and `--suite` filter at the suite level: a run with several suites under different models keeps the entry, with non- matching suites pruned out of the displayed view. The previous run-level filter would surprise users by dropping the whole run. - `--tail N` now narrows the entries before aggregation, so the `list` (trend) view actually scopes to the requested window. - Added `--short` for `protest eval`: hide passing scores per case to keep the output readable on suites with many evaluators. Docs: - `cli.md` rewritten for the new sub-command layout, with explicit examples for each sub-command and a note on suite-level filtering. - `evals.md` gets a callout on writing custom evaluators when the eval task returns a non-string output (dict / dataclass / pydantic), and a tip clarifying that "first run successful" doesn't mean every case passes — evals are expected to surface failing cases. - `evals.md` quick-start now imports `EvalSuite` from `protest.evals` (single canonical path). - `installation.md` adds an IDE / type-checker setup section (Pyright/Pylance/mypy + uv). Storage: - `is_dirty_entry()` and `count_dirty_entries()` extracted as helpers so the dry-run path can compute counts without touching the file. The remaining cross-suite/cross-model `compare` ask is tracked in #101.

`protest history compare` previously aplatted cases across all suites in the two most recent runs. When the runs contained suites under different ModelLabels (e.g. rules_v1 + rules_v2 in a multi-model session), a case-id present under both models would surface as "regressed" or "fixed" depending on which suite the diff happened to scan first. Reported by the v3 naive-agent test: 5 strictly-identical runs produced fake "Regressions: T010, T016" because T010 passed under v2 and failed under v1 — the diff conflated the two contexts. Fix: detect distinct ModelLabel.names across the two compared entries and refuse to run when more than one is present, asking the user to disambiguate via --model NAME or --suite NAME (which already suite-prune entries at load time, leaving a single-model comparison). Two new tests cover the rejection and the --model-disambiguated success path. Top-level `protest --help` epilog and the test-bed MISSION.md also get a small refresh to use the new sub-command syntax (`protest history compare/runs/clean`) rather than the now-removed flag-as-mode form.

@evaluator

…registration The single Evaluator.__call__ that switched on isinstance(args[0], EvalContext) forced an Any-typed signature and produced the surprising f is f() identity for the no-kwargs case. Split into __call__(**kwargs) for rebinding and run(ctx) for execution: each method is monomorphic and pyright can read it without overloads. Plain callables are no longer accepted in evaluators=[...]. validate_evaluators runs at registration boundaries (make_eval_wrapper, EvalCase, ShortCircuit) and raises a clear TypeError pointing at @evaluator. The executor then operates on a uniform Evaluator | ShortCircuit Union — the only remaining isinstance is the narrowing on that real disjoint Union.

- evals.md: EvalCase field table listed `tags` as a special metadata key while the example below used `tags=[...]` as a kwarg and the dataclass declares it first-class. Split into separate `tags` / `metadata` rows. - evals.md: history compare example now shows `--model NAME` with the rationale, so users hit the constraint at read time instead of via the runtime "multiple models" rejection. - history.py: Run Detail panel title now carries a "(+ pass · - fail)" legend; the +/- markers were unlabeled and required inference.

Vestige from the pydantic-evals era — there is no Dataset concept in the native eval API. The file holds EvalCase instances, so cases.py matches the vocabulary used by EvalSuite, EvalCase, and the --last-failed CLI flag.

renaudcepre added 6 commits March 30, 2026 07:30

feat(evals): ShortCircuit — skip expensive evaluators on early fail

7aa3b49

evaluators=[ not_empty, ShortCircuit([ contains_expected_facts(min_score=0.5), llm_judge(rubric="..."), # skipped if above fails ]), ] First Verdict=False stops the group. Evaluators outside run regardless.

chore: entity exports, pyproject config

29204bc

renaudcepre changed the title ~~Feat/evals native~~ feat: evals Mar 30, 2026

renaudcepre force-pushed the feat/evals-native branch from 4e4d91d to 29204bc Compare March 30, 2026 09:53

renaudcepre added 22 commits March 30, 2026 22:09

Merge branch 'main' into feat/evals-native

bc4d16d

fix ci

3ed68a4

Merge branch 'feat/evals-native' of github.com:renaudcepre/protest in…

7abeb1d

…to feat/evals-native

chore: fix all lint — move imports to top-level, no lazy imports

ad7a207

fix(reporters): show in/out token split in eval usage summary

015c451

Task: 45.2k in / 27.1k out, $0.0142 Judge: 5 calls, 800 in / 400 out, $0.0030

refactor: replace kind string literals with SuiteKind StrEnum

d7fbba3

Type-safe suite kind across the codebase. StrEnum keeps JSON compat (SuiteKind.EVAL == "eval") so no migration needed.

ci: update workflow to install dependencies and fix mypy invocation

155db22

refactor: remove redundant type ignores, update dependency management

96d3632

- Removed unnecessary `# type: ignore[import-not-found]` markers on imports. - Added `--group dev` flag to dependency sync in CI workflow. - Updated `uv.lock` to include new packages: `librt` and `mypy`.

refactor(evals): enhance docstrings for EvalSuite and EvalSession wit…

d3f542c

…h detailed functionality descriptions

feat(reporting): add eval suite and case payloads to web reporting

6b3c203

- Added support for emitting an `EVAL_SUITE_END` event with detailed suite-level metrics and score statistics. - Extended `SUITE_END` payloads to include evaluation-related details when processing eval-specific results.

refactor(evals): replace evaluator function wrapper with Evaluator …

9c58302

…class - Introduced `Evaluator

refactor(examples): replace dict-based eval cases with EvalCase obj…

6f6d16a

…ects in Yorkshire example dataset

renaudcepre added 30 commits April 24, 2026 22:01

ci: ensure matrix Python version consistency and add verification step

18078d4

- Set `UV_PYTHON` to enforce the selected Python version in the matrix. - Add a verification step to confirm the expected Python version is used.

chore: remove pydantic-evals dependency and related code

ef5a65b

- Dropped `pydantic-evals` from dependencies and `pyproject.toml` `evals` extra. - Removed references to `pydantic-evals` in code and version reporting. - Cleaned up `uv.lock` and related metadata.

refactor(evals): replace keyword_check with contains_keywords and…

8b64322

… update evaluator logic

tests(evals): add not_empty tests for Sized containers and clarify …

bfa9d14

…evaluator behavior - Added tests to ensure `not_empty` correctly handles empty and non-empty lists, dicts, and sets. - Updated `not_empty` docstring and logic to explicitly check `Sized` objects using `len()`.

tests(console): surface handler errors and add fallback handling tests

0f25a1b

- Updated `console.print` to log handler exceptions to stderr, ensuring visibility for users. - Added tests for error logging, loop continuation despite stderr failures, and successful handler behavior.

refactor(evals): migrate tags from metadata to first-class `EvalC…

2f0bfcb

…ase` field and update tests

refactor(examples): rename yorkshire dataset.py to cases.py

99d512f

Vestige from the pydantic-evals era — there is no Dataset concept in the native eval API. The file holds EvalCase instances, so cases.py matches the vocabulary used by EvalSuite, EvalCase, and the --last-failed CLI flag.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evals#89

feat: evals#89
renaudcepre wants to merge 62 commits intomainfrom
feat/evals-native

renaudcepre commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renaudcepre commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant