feat: evals#89
Open
renaudcepre wants to merge 62 commits intomainfrom
Open
Conversation
evaluators=[
not_empty,
ShortCircuit([
contains_expected_facts(min_score=0.5),
llm_judge(rubric="..."), # skipped if above fails
]),
]
First Verdict=False stops the group. Evaluators outside run regardless.
- EvalPayload, EvalScoreEntry on TestResult for eval case results - EVAL_SUITE_END event emitted by core runner - is_eval flag on TestRegistration/TestItem - KindFilterPlugin for protest run vs protest eval - get_type_hints_compat: PEP 563 + TYPE_CHECKING support in all DI sites - Async fixture teardown on same event loop (no more loop mismatch) - Fixture resolution time excluded from test duration - Log records captured on TestResult for --show-logs
An eval is a test that returns a scored value. Uses ForEach/From for parametrization — no separate EvalSuite/EvalCase framework. - @session.eval(evaluators=[...]) decorator - @evaluator decorator with partial-application binding - EvalSession(model=) for eval-focused sessions - EvalContext passed to evaluators - Scoring v2: evaluators return bool or dataclass - Annotated[bool, Verdict] → pass/fail - Annotated[float, Metric] → stats aggregation - Annotated[str, Reason] → displayed on failure - EvalCase dataclass for typed ForEach data - Built-in evaluators: contains_keywords, not_empty, max_length, etc. - EvalHistoryPlugin listens to EVAL_SUITE_END - EvalResultsWriter for per-case .md files - Evaluator exception → error (not fail)
- on_eval_suite_end: Rich table for scores, plain text for ASCII - Scores inline in -v, --show-output for inputs/output/expected - --show-logs flag for captured log records - Fixture setup time always displayed - protest history --runs: per-suite breakdown with model - protest.console.print(): progress output bypassing capture - Lifecycle messages bypass capture (no re-display on fail) - Output truncated at 20 lines with pointer to full output - Case id in lifecycle messages (chatbot[lookup] not chatbot)
- 1063 tests (56 eval-specific) - Yorkshire chatbot example with @session.eval + ForEach - History module: JSONL storage, git info, env info - docs/evals.md: full guide (scoring, evaluators, CLI, history) - docs/core-concepts/console.md: console.print guide
4e4d91d to
29204bc
Compare
…to feat/evals-native
ProTest owns the interface, user plugs in their LLM library. - Judge protocol: `async judge(prompt, output_type) -> JudgeResponse[T]` - JudgeResponse wraps output with optional tokens/cost tracking - EvalContext.judge() unwraps for evaluators, accumulates usage stats - JudgeInfo auto-derived from instance for history - EvalPayload carries judge_call_count, tokens, cost per case - EvalSession(judge=MyJudge()) wires through to evaluators - suite.eval(judge=) for standalone usage - 19 new tests (protocol, ctx.judge, e2e, structured output, tokens)
Task: 45.2k in / 27.1k out, $0.0142 Judge: 5 calls, 800 in / 400 out, $0.0030
…flag Fixture crashes (errored >= total_cases) were counted in pass_rates, score_values, and flaky — polluting stats with noise. Now: - EvalCaseResult.is_error propagated from TestResult.is_fixture_error - History serializes errored count per suite + is_error per case - _aggregate_suites skips error-only runs from stats entirely - _track_cases skips error cases from score_values and flaky - Error runs still visible in `protest history --runs` Also: docs/evals.md updated for TaskResult section and Judge protocol fix.
- Remove defensive getattr in session.py where types are known - Type plugin setup(session: ProTestSession) instead of Any - Add name/provider to Judge Protocol — explicit contract - Delete ModelInfo.from_agent and JudgeInfo.from_instance — user wires - Fix lint: PLR2004 magic values, PLR0912 noqa, ambiguous unicode
Replace fragile repr() fallback with explicit error on unknown types. Add evaluator_identity() as user-controlled escape hatch for custom evaluators. Introspect dataclass/partial/callable as fallback only. - Remove hasattr(obj, "model_dump") duck-typing (Pydantic leak) - Remove default=str silent fallback in json.dumps - Skip _prefixed dataclass fields (runtime internals, not config) - Add functools.partial support (qualname + bound kwargs) - Add ShortCircuit.evaluator_identity() - 33 tests covering all paths including fail-hard
Type-safe suite kind across the codebase. StrEnum keeps JSON compat (SuiteKind.EVAL == "eval") so no migration needed.
…ores 28 lazy imports in protest/, none resolving a real circular dependency. Moved all to top-level except justified cases (optional deps like rich, conditional wiring, and one true circular import in evals/__init__.py). Removed blanket PLC0415 per-file-ignores from pyproject.toml — remaining suppressions use inline noqa with justification.
- Type built-in evaluators as EvalContext[Any, str] (text evaluators) - not_empty typed EvalContext[Any, Any] (works on any output) - Fix mypy running outside venv (uv run mypy in justfile) - Add mypy config in pyproject.toml with rich stubs override - Fix no-any-return, arg-type, unused type-ignore across codebase - Remove stale type: ignore[import-not-found] on rich imports
- Remove is_async_evaluator(), _is_evaluator, _is_async_evaluator (written but never read — dead code with hasattr duck-typing) - Add yorkshire example evaluators showing EvalContext generics: [Any, str] for text, [str, float] for numeric, [str, bytes] for binary
- Removed unnecessary `# type: ignore[import-not-found]` markers on imports. - Added `--group dev` flag to dependency sync in CI workflow. - Updated `uv.lock` to include new packages: `librt` and `mypy`.
- Introduced `EvalSuite` class to encapsulate eval logic, replacing inline `session.eval()` definitions. - Removed duplicate `eval` methods in `ProTestSession` and `ProTestSuite`. - Updated tests and examples to leverage `EvalSuite`.
…e APIs and tests - Standardized eval cases by replacing untyped `dict` with `EvalCase` objects across codebase. - Updated evaluator helpers to work exclusively with `EvalCase` instances. - Refactored `make_eval_wrapper` to remove unused `expected_key` argument. - Updated tests and examples to adopt `EvalCase` usage for improved type safety and code clarity.
…h detailed functionality descriptions
- Added support for emitting an `EVAL_SUITE_END` event with detailed suite-level metrics and score statistics. - Extended `SUITE_END` payloads to include evaluation-related details when processing eval-specific results.
… architecture - Delete EvalSession — ProTestSession is the only session - Merge HistoryPlugin + EvalHistoryPlugin into single always-on plugin - EvalResultsWriter now always-on (no-op without evals) - Model/judge live entirely on EvalSuite, no session propagation - history=True by default on ProTestSession - Remove apply_defaults, _wire_eval_support, add_suite override
…class - Introduced `Evaluator
…ects in Yorkshire example dataset
…actor writer construction - Added comprehensive tests for `EvalCaseResult.from_test_result` to validate field mappings and defensive checks. - Refactored writer logic to use `EvalCaseResult.from_test_result`, simplifying redundant helper methods.
…hance tag propagation logic - Added tests to verify that `EvalCase.metadata['tags']` are merged into `TestItem.tags`. - Updated `Collector` to propagate tags from `EvalCase.metadata` into `TestItem` during collection. - Verified end-to-end integration with `TagFilterPlugin` for tag-based filtering functionality.
…t cross-platform file locking - Added tests to ensure `append_entry` supports concurrent writes without line corruption. - Implemented cross-platform file locking: `fcntl.flock` on POSIX and `msvcrt.locking` on Windows using a sibling `.lock` file. - Ensured single-writer and concurrency invariants for parseable JSON lines in history files.
…rride behaviors - Added regression tests to ensure `_isolate_protest_history` fixture correctly overrides `DEFAULT_HISTORY_DIR` with a per-test temp directory. - Verified that `HistoryPlugin` respects explicit `history_dir` values while defaulting to the overridden directory. - Updated `conftest.py` with autouse fixture to prevent test pollution of real `.protest/history.jsonl`.
…lace sys stream duck-typing - Added unit tests for `real_stdout` and `real_stderr` to ensure proper unwrapping of `TaskAwareStream` and correct fallback to original streams. - Replaced `getattr(sys.stdout, "_original", ...)` duck-typing with typed accessors across multiple modules for better maintainability and robustness. - Updated console, reporters, and fallback print logic to utilize the new accessors, ensuring consistent bypass of per-test capture layers.
…t bus usage - Added `EventBus` type annotations for `_event_bus_ref` and related methods to improve clarity and type safety. - Updated comments in `console.print` to explain the necessity of private access to `bus._handlers` and its rationale. - Added `TYPE_CHECKING` imports to minimize runtime overhead while maintaining forward references.
…arn on future versions - `SCHEMA_VERSION = 1` constant in `storage`; `HistoryPlugin` stamps it on every new entry. - Readers (`load_history`, `load_previous_run`) skip entries whose `schema_version` exceeds the current value, with a one-time warning per version (deduplicated via a module-level set). - Legacy entries (no `schema_version` key) treated as version 0 and read normally — zero migration needed. - Add `tests/history/test_schema_version.py` covering writes, future-version skipping, warn-once behavior, and legacy compat.
Replaces naive `int(n * 0.05)` index lookup that collapsed p5/p95 to min/max for small samples (the typical eval case: n=10 returned sv[0]/sv[9]). Now uses `statistics.quantiles(n=20, method='inclusive')` which interpolates linearly between adjacent values and clamps to [min, max] — appropriate for bounded scores. - Single-value case (n=1) falls back to that value (percentiles undefined). - Empty case unchanged: zeroed stats. - `_MIN_VALUES_FOR_PERCENTILES = 2` constant gates the quantiles call. - Add `tests/evals/test_score_stats.py` covering empty / n=1 / n=2 / n=10 (the regression case) / n=100 / sort-independence.
- m2: replace `lambda` with `functools.partial` in CLI command dispatch (`protest/cli/main.py`). - m3: route `EvalResultsWriter` "Results: ..." line through `console.print` instead of builtin `print`, so it bypasses test capture consistently. - m4: `Evaluator.__call__` now always returns a fresh clone in the re-binding path; removes the surprising `f is f()` identity. - m6: replace `"tests"` sentinel for `_default_suite_name` with `None`, fall back to the literal `"tests"` only when no test suite registered. A user-defined suite literally named `"tests"` no longer collides with the default-detection heuristic. - m7: add a Contents section (TOC) to `docs/evals.md` for raw-file navigability (mkdocs already auto-generates a sidebar TOC). - m10: clarify `FakeJudge.judge` comment — caller must use a dataclass with all-default fields. - m11: type `EvalSuite.eval(judge=)` as `Judge | None` (was `Any`) and document the per-eval override behavior in the docstring. Verified intentional / already-resolved: m1 (`console.print` shadow is the API), m5 (deduplicated via M3), m8 (deferred — needs PEP 696), m9 (`_canonical` resolution order is documented), m12 (`SuiteKind` is a `StrEnum`, no mismatch between str/enum comparisons).
- Set `UV_PYTHON` to enforce the selected Python version in the matrix. - Add a verification step to confirm the expected Python version is used.
…ility - Updated `SuiteKind` to inherit from `str` and `Enum` instead of `StrEnum`, ensuring compatibility with Python 3.10. - Adjusted `SuiteKind.__str__` method for consistent behavior. - Modified history plugin to handle `Enum.value` directly while maintaining default behavior. - Moved `Self` import to `protest.compat` for streamlined typing support.
- Dropped `pydantic-evals` from dependencies and `pyproject.toml` `evals` extra. - Removed references to `pydantic-evals` in code and version reporting. - Cleaned up `uv.lock` and related metadata.
- Added `TestRunsOrderRecentFirst` to validate that `--runs` follows the git log convention, showing the most recent entries first. - Updated CLI logic to reverse storage order (oldest → newest) for display consistency. - Adjusted index formatting and numbering in both plain and rich output modes to reflect the newest-first display.
… update evaluator logic
…prove evaluator logic documentation - Introduced tests for `min_recall` edge cases, including exact threshold passing, discontinuity fixes, and below-threshold failures. - Updated `contains_keywords` evaluator to simplify `all_keywords_present` logic and ensure consistent behavior across thresholds. - Adjusted default `min_recall` to `1.0` in docs and implementation for stricter compliance.
…evaluator behavior - Added tests to ensure `not_empty` correctly handles empty and non-empty lists, dicts, and sets. - Updated `not_empty` docstring and logic to explicitly check `Sized` objects using `len()`.
…daptive formatting - Added tests to ensure `_serialize_eval_case` preserves 10 µs precision, preventing sub-ms durations from collapsing to `0.0`. - Introduced `_format_case_duration` tests for adaptive time unit rendering across microseconds, milliseconds, and seconds. - Updated markdown renderer to use `_format_case_duration` for task durations. - Increased duration serialization precision from 3 to 5 decimals in history plugin.
- Added tests for 3-tuple payload behavior in `console.print` with flags `raw` and `prefix`. - Verified ASCII and Rich reporters correctly render messages with/without test prefixes and markup. - Updated `console.print` to support a new `prefix` parameter for suite-level outputs (e.g., "Results: ..."). - Adjusted `on_user_print` implementations across reporters to handle the `prefix` flag correctly.
- Updated `console.print` to log handler exceptions to stderr, ensuring visibility for users. - Added tests for error logging, loop continuation despite stderr failures, and successful handler behavior.
- Added tests to validate `clean_dirty` concurrency handling, ensuring no appends are silently dropped due to interleaved truncate operations. - Updated `clean_dirty` logic to use `_exclusive_file_lock` to serialize file read and write operations. - Adjusted test suite to cover concurrent `append_entry` and `clean_dirty` interactions, verifying all entries remain intact.
…ents - Expanded documentation to introduce native LLM evals, including pass/fail and numeric scoring with JSONL history. - Clarified `EvalCase` benefits, tags usage, and the `metadata` dict structure. - Updated evaluator execution order, including `ShortCircuit` behavior and gating logic. - Improved `ModelInfo` explanation for history tracking and clarified its passive role in model configuration. - Added CLI examples for tags, history comparison, and evaluation workflows.
…ase` field and update tests
…lag exclusion - Added decorator-time validation to ensure eval functions declare only one `EvalCase` parameter, raising clear errors on conflicts. - Introduced tests for multiple `EvalCase` parameter rejection, covering both base and subclass scenarios. - Updated CLI parser to exclude eval-only flags (e.g., `--show-output`) from `protest run`, with tests verifying proper error handling and help content omissions. - Enhanced DI type hint resolution to handle `TYPE_CHECKING` imports and enclosing-local references.
asyncio.TimeoutError and builtins.TimeoutError were distinct classes before Python 3.11. Reporters and tests check isinstance against the builtin, so on 3.10 the previous `raise asyncio.TimeoutError(...)` made those checks fail. On 3.11+ both names alias the builtin, so this is a no-op. Fixes 6 timeout/retry tests on the 3.10 CI matrix.
Agent test (Claude Code in isolated dir, public docs only) surfaced several rough edges. This batch addresses the ones blocking a clean re-run signal: - ScoreNameCollisionError: dataclass evaluators with overlapping field names previously overwrote each other silently in the per-case scores dict (and the history file). Now raises at runtime with the case name and duplicate names; doc rewritten to remove the false auto-prefix promise. - ModelInfo -> ModelLabel: rename clarifies it is a passive history label, not a runtime model config (the doc warning becomes obsolete and is replaced by a plain description). - rich made truly optional: lazy-imported inside RichReporter methods so `import protest` works without rich; AsciiReporter.activate() takes over when rich is missing. Verified in a venv with no extras. - EvalSuite re-exported from protest.evals so users only need one import path for the eval API. - Top-level `protest --help` epilog now includes eval/history/live examples (was 9 run + 1 tags, none for eval/history/live). - cli.md gets full `protest eval` and `protest history` sections, including --compare's case-modified vs scoring-modified semantics.
Agent v2 confirmed the tier-1 fixes landed cleanly and surfaced a new bucket of frictions concentrated on `protest history`. This batch addresses them. CLI refactor: - `protest history` is now sub-command based (`list`, `runs`, `show`, `compare`, `clean`) instead of mutually-exclusive flags. `list` remains the implicit default so `protest history --tail 5` still works without typing the sub-command. The previous flag-as-mode form (`--runs`, `--show`, `--compare`, `--clean-dirty`) is removed. - `protest history clean` is dry-run by default. `--apply` actually modifies the file. Eliminates the "destructive without warning" footgun. - `--model` and `--suite` filter at the suite level: a run with several suites under different models keeps the entry, with non- matching suites pruned out of the displayed view. The previous run-level filter would surprise users by dropping the whole run. - `--tail N` now narrows the entries before aggregation, so the `list` (trend) view actually scopes to the requested window. - Added `--short` for `protest eval`: hide passing scores per case to keep the output readable on suites with many evaluators. Docs: - `cli.md` rewritten for the new sub-command layout, with explicit examples for each sub-command and a note on suite-level filtering. - `evals.md` gets a callout on writing custom evaluators when the eval task returns a non-string output (dict / dataclass / pydantic), and a tip clarifying that "first run successful" doesn't mean every case passes — evals are expected to surface failing cases. - `evals.md` quick-start now imports `EvalSuite` from `protest.evals` (single canonical path). - `installation.md` adds an IDE / type-checker setup section (Pyright/Pylance/mypy + uv). Storage: - `is_dirty_entry()` and `count_dirty_entries()` extracted as helpers so the dry-run path can compute counts without touching the file. The remaining cross-suite/cross-model `compare` ask is tracked in #101.
`protest history compare` previously aplatted cases across all suites in the two most recent runs. When the runs contained suites under different ModelLabels (e.g. rules_v1 + rules_v2 in a multi-model session), a case-id present under both models would surface as "regressed" or "fixed" depending on which suite the diff happened to scan first. Reported by the v3 naive-agent test: 5 strictly-identical runs produced fake "Regressions: T010, T016" because T010 passed under v2 and failed under v1 — the diff conflated the two contexts. Fix: detect distinct ModelLabel.names across the two compared entries and refuse to run when more than one is present, asking the user to disambiguate via --model NAME or --suite NAME (which already suite-prune entries at load time, leaving a single-model comparison). Two new tests cover the rejection and the --model-disambiguated success path. Top-level `protest --help` epilog and the test-bed MISSION.md also get a small refresh to use the new sub-command syntax (`protest history compare/runs/clean`) rather than the now-removed flag-as-mode form.
…registration The single Evaluator.__call__ that switched on isinstance(args[0], EvalContext) forced an Any-typed signature and produced the surprising f is f() identity for the no-kwargs case. Split into __call__(**kwargs) for rebinding and run(ctx) for execution: each method is monomorphic and pyright can read it without overloads. Plain callables are no longer accepted in evaluators=[...]. validate_evaluators runs at registration boundaries (make_eval_wrapper, EvalCase, ShortCircuit) and raises a clear TypeError pointing at @evaluator. The executor then operates on a uniform Evaluator | ShortCircuit Union — the only remaining isinstance is the narrowing on that real disjoint Union.
- evals.md: EvalCase field table listed `tags` as a special metadata key while the example below used `tags=[...]` as a kwarg and the dataclass declares it first-class. Split into separate `tags` / `metadata` rows. - evals.md: history compare example now shows `--model NAME` with the rationale, so users hit the constraint at read time instead of via the runtime "multiple models" rejection. - history.py: Run Detail panel title now carries a "(+ pass · - fail)" legend; the +/- markers were unlabeled and required inference.
Vestige from the pydantic-evals era — there is no Dataset concept in the native eval API. The file holds EvalCase instances, so cases.py matches the vocabulary used by EvalSuite, EvalCase, and the --last-failed CLI flag.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.