Skip to content

feat: evals#89

Open
renaudcepre wants to merge 62 commits intomainfrom
feat/evals-native
Open

feat: evals#89
renaudcepre wants to merge 62 commits intomainfrom
feat/evals-native

Conversation

@renaudcepre
Copy link
Copy Markdown
Owner

No description provided.

    evaluators=[
        not_empty,
        ShortCircuit([
            contains_expected_facts(min_score=0.5),
            llm_judge(rubric="..."),  # skipped if above fails
        ]),
    ]

First Verdict=False stops the group. Evaluators outside run regardless.
- EvalPayload, EvalScoreEntry on TestResult for eval case results
- EVAL_SUITE_END event emitted by core runner
- is_eval flag on TestRegistration/TestItem
- KindFilterPlugin for protest run vs protest eval
- get_type_hints_compat: PEP 563 + TYPE_CHECKING support in all DI sites
- Async fixture teardown on same event loop (no more loop mismatch)
- Fixture resolution time excluded from test duration
- Log records captured on TestResult for --show-logs
An eval is a test that returns a scored value. Uses ForEach/From
for parametrization — no separate EvalSuite/EvalCase framework.

- @session.eval(evaluators=[...]) decorator
- @evaluator decorator with partial-application binding
- EvalSession(model=) for eval-focused sessions
- EvalContext passed to evaluators
- Scoring v2: evaluators return bool or dataclass
  - Annotated[bool, Verdict] → pass/fail
  - Annotated[float, Metric] → stats aggregation
  - Annotated[str, Reason] → displayed on failure
- EvalCase dataclass for typed ForEach data
- Built-in evaluators: contains_keywords, not_empty, max_length, etc.
- EvalHistoryPlugin listens to EVAL_SUITE_END
- EvalResultsWriter for per-case .md files
- Evaluator exception → error (not fail)
- on_eval_suite_end: Rich table for scores, plain text for ASCII
- Scores inline in -v, --show-output for inputs/output/expected
- --show-logs flag for captured log records
- Fixture setup time always displayed
- protest history --runs: per-suite breakdown with model
- protest.console.print(): progress output bypassing capture
- Lifecycle messages bypass capture (no re-display on fail)
- Output truncated at 20 lines with pointer to full output
- Case id in lifecycle messages (chatbot[lookup] not chatbot)
- 1063 tests (56 eval-specific)
- Yorkshire chatbot example with @session.eval + ForEach
- History module: JSONL storage, git info, env info
- docs/evals.md: full guide (scoring, evaluators, CLI, history)
- docs/core-concepts/console.md: console.print guide
@renaudcepre renaudcepre changed the title Feat/evals native feat: evals Mar 30, 2026
ProTest owns the interface, user plugs in their LLM library.

- Judge protocol: `async judge(prompt, output_type) -> JudgeResponse[T]`
- JudgeResponse wraps output with optional tokens/cost tracking
- EvalContext.judge() unwraps for evaluators, accumulates usage stats
- JudgeInfo auto-derived from instance for history
- EvalPayload carries judge_call_count, tokens, cost per case
- EvalSession(judge=MyJudge()) wires through to evaluators
- suite.eval(judge=) for standalone usage
- 19 new tests (protocol, ctx.judge, e2e, structured output, tokens)
Task: 45.2k in / 27.1k out, $0.0142
Judge: 5 calls, 800 in / 400 out, $0.0030
…flag

Fixture crashes (errored >= total_cases) were counted in pass_rates,
score_values, and flaky — polluting stats with noise. Now:
- EvalCaseResult.is_error propagated from TestResult.is_fixture_error
- History serializes errored count per suite + is_error per case
- _aggregate_suites skips error-only runs from stats entirely
- _track_cases skips error cases from score_values and flaky
- Error runs still visible in `protest history --runs`

Also: docs/evals.md updated for TaskResult section and Judge protocol fix.
- Remove defensive getattr in session.py where types are known
- Type plugin setup(session: ProTestSession) instead of Any
- Add name/provider to Judge Protocol — explicit contract
- Delete ModelInfo.from_agent and JudgeInfo.from_instance — user wires
- Fix lint: PLR2004 magic values, PLR0912 noqa, ambiguous unicode
Replace fragile repr() fallback with explicit error on unknown types.
Add evaluator_identity() as user-controlled escape hatch for custom
evaluators. Introspect dataclass/partial/callable as fallback only.

- Remove hasattr(obj, "model_dump") duck-typing (Pydantic leak)
- Remove default=str silent fallback in json.dumps
- Skip _prefixed dataclass fields (runtime internals, not config)
- Add functools.partial support (qualname + bound kwargs)
- Add ShortCircuit.evaluator_identity()
- 33 tests covering all paths including fail-hard
Type-safe suite kind across the codebase. StrEnum keeps JSON
compat (SuiteKind.EVAL == "eval") so no migration needed.
…ores

28 lazy imports in protest/, none resolving a real circular dependency.
Moved all to top-level except justified cases (optional deps like rich,
conditional wiring, and one true circular import in evals/__init__.py).

Removed blanket PLC0415 per-file-ignores from pyproject.toml — remaining
suppressions use inline noqa with justification.
- Type built-in evaluators as EvalContext[Any, str] (text evaluators)
- not_empty typed EvalContext[Any, Any] (works on any output)
- Fix mypy running outside venv (uv run mypy in justfile)
- Add mypy config in pyproject.toml with rich stubs override
- Fix no-any-return, arg-type, unused type-ignore across codebase
- Remove stale type: ignore[import-not-found] on rich imports
- Remove is_async_evaluator(), _is_evaluator, _is_async_evaluator
  (written but never read — dead code with hasattr duck-typing)
- Add yorkshire example evaluators showing EvalContext generics:
  [Any, str] for text, [str, float] for numeric, [str, bytes] for binary
- Removed unnecessary `# type: ignore[import-not-found]` markers on imports.
- Added `--group dev` flag to dependency sync in CI workflow.
- Updated `uv.lock` to include new packages: `librt` and `mypy`.
- Introduced `EvalSuite` class to encapsulate eval logic, replacing inline `session.eval()` definitions.
- Removed duplicate `eval` methods in `ProTestSession` and `ProTestSuite`.
- Updated tests and examples to leverage `EvalSuite`.
…e APIs and tests

- Standardized eval cases by replacing untyped `dict` with `EvalCase` objects across codebase.
- Updated evaluator helpers to work exclusively with `EvalCase` instances.
- Refactored `make_eval_wrapper` to remove unused `expected_key` argument.
- Updated tests and examples to adopt `EvalCase` usage for improved type safety and code clarity.
- Added support for emitting an `EVAL_SUITE_END` event with detailed suite-level metrics and score statistics.
- Extended `SUITE_END` payloads to include evaluation-related details when processing eval-specific results.
… architecture

- Delete EvalSession — ProTestSession is the only session
- Merge HistoryPlugin + EvalHistoryPlugin into single always-on plugin
- EvalResultsWriter now always-on (no-op without evals)
- Model/judge live entirely on EvalSuite, no session propagation
- history=True by default on ProTestSession
- Remove apply_defaults, _wire_eval_support, add_suite override
…actor writer construction

- Added comprehensive tests for `EvalCaseResult.from_test_result` to validate field mappings and defensive checks.
- Refactored writer logic to use `EvalCaseResult.from_test_result`, simplifying redundant helper methods.
…hance tag propagation logic

- Added tests to verify that `EvalCase.metadata['tags']` are merged into `TestItem.tags`.
- Updated `Collector` to propagate tags from `EvalCase.metadata` into `TestItem` during collection.
- Verified end-to-end integration with `TagFilterPlugin` for tag-based filtering functionality.
…t cross-platform file locking

- Added tests to ensure `append_entry` supports concurrent writes without line corruption.
- Implemented cross-platform file locking: `fcntl.flock` on POSIX and `msvcrt.locking` on Windows using a sibling `.lock` file.
- Ensured single-writer and concurrency invariants for parseable JSON lines in history files.
…rride behaviors

- Added regression tests to ensure `_isolate_protest_history` fixture correctly overrides `DEFAULT_HISTORY_DIR` with a per-test temp directory.
- Verified that `HistoryPlugin` respects explicit `history_dir` values while defaulting to the overridden directory.
- Updated `conftest.py` with autouse fixture to prevent test pollution of real `.protest/history.jsonl`.
…lace sys stream duck-typing

- Added unit tests for `real_stdout` and `real_stderr` to ensure proper unwrapping of `TaskAwareStream` and correct fallback to original streams.
- Replaced `getattr(sys.stdout, "_original", ...)` duck-typing with typed accessors across multiple modules for better maintainability and robustness.
- Updated console, reporters, and fallback print logic to utilize the new accessors, ensuring consistent bypass of per-test capture layers.
…t bus usage

- Added `EventBus` type annotations for `_event_bus_ref` and related methods to improve clarity and type safety.
- Updated comments in `console.print` to explain the necessity of private access to `bus._handlers` and its rationale.
- Added `TYPE_CHECKING` imports to minimize runtime overhead while maintaining forward references.
…arn on future versions

- `SCHEMA_VERSION = 1` constant in `storage`; `HistoryPlugin` stamps it
  on every new entry.
- Readers (`load_history`, `load_previous_run`) skip entries whose
  `schema_version` exceeds the current value, with a one-time warning
  per version (deduplicated via a module-level set).
- Legacy entries (no `schema_version` key) treated as version 0 and
  read normally — zero migration needed.
- Add `tests/history/test_schema_version.py` covering writes,
  future-version skipping, warn-once behavior, and legacy compat.
Replaces naive `int(n * 0.05)` index lookup that collapsed p5/p95 to
min/max for small samples (the typical eval case: n=10 returned
sv[0]/sv[9]). Now uses `statistics.quantiles(n=20, method='inclusive')`
which interpolates linearly between adjacent values and clamps to
[min, max] — appropriate for bounded scores.

- Single-value case (n=1) falls back to that value (percentiles undefined).
- Empty case unchanged: zeroed stats.
- `_MIN_VALUES_FOR_PERCENTILES = 2` constant gates the quantiles call.
- Add `tests/evals/test_score_stats.py` covering empty / n=1 / n=2 /
  n=10 (the regression case) / n=100 / sort-independence.
- m2: replace `lambda` with `functools.partial` in CLI command dispatch
  (`protest/cli/main.py`).
- m3: route `EvalResultsWriter` "Results: ..." line through
  `console.print` instead of builtin `print`, so it bypasses test capture
  consistently.
- m4: `Evaluator.__call__` now always returns a fresh clone in the
  re-binding path; removes the surprising `f is f()` identity.
- m6: replace `"tests"` sentinel for `_default_suite_name` with `None`,
  fall back to the literal `"tests"` only when no test suite registered.
  A user-defined suite literally named `"tests"` no longer collides with
  the default-detection heuristic.
- m7: add a Contents section (TOC) to `docs/evals.md` for raw-file
  navigability (mkdocs already auto-generates a sidebar TOC).
- m10: clarify `FakeJudge.judge` comment — caller must use a dataclass
  with all-default fields.
- m11: type `EvalSuite.eval(judge=)` as `Judge | None` (was `Any`) and
  document the per-eval override behavior in the docstring.

Verified intentional / already-resolved: m1 (`console.print` shadow is
the API), m5 (deduplicated via M3), m8 (deferred — needs PEP 696),
m9 (`_canonical` resolution order is documented), m12 (`SuiteKind` is a
`StrEnum`, no mismatch between str/enum comparisons).
- Set `UV_PYTHON` to enforce the selected Python version in the matrix.
- Add a verification step to confirm the expected Python version is used.
…ility

- Updated `SuiteKind` to inherit from `str` and `Enum` instead of `StrEnum`, ensuring compatibility with Python 3.10.
- Adjusted `SuiteKind.__str__` method for consistent behavior.
- Modified history plugin to handle `Enum.value` directly while maintaining default behavior.
- Moved `Self` import to `protest.compat` for streamlined typing support.
- Dropped `pydantic-evals` from dependencies and `pyproject.toml` `evals` extra.
- Removed references to `pydantic-evals` in code and version reporting.
- Cleaned up `uv.lock` and related metadata.
- Added `TestRunsOrderRecentFirst` to validate that `--runs` follows the git log convention, showing the most recent entries first.
- Updated CLI logic to reverse storage order (oldest → newest) for display consistency.
- Adjusted index formatting and numbering in both plain and rich output modes to reflect the newest-first display.
…prove evaluator logic documentation

- Introduced tests for `min_recall` edge cases, including exact threshold passing, discontinuity fixes, and below-threshold failures.
- Updated `contains_keywords` evaluator to simplify `all_keywords_present` logic and ensure consistent behavior across thresholds.
- Adjusted default `min_recall` to `1.0` in docs and implementation for stricter compliance.
…evaluator behavior

- Added tests to ensure `not_empty` correctly handles empty and non-empty lists, dicts, and sets.
- Updated `not_empty` docstring and logic to explicitly check `Sized` objects using `len()`.
…daptive formatting

- Added tests to ensure `_serialize_eval_case` preserves 10 µs precision, preventing sub-ms durations from collapsing to `0.0`.
- Introduced `_format_case_duration` tests for adaptive time unit rendering across microseconds, milliseconds, and seconds.
- Updated markdown renderer to use `_format_case_duration` for task durations.
- Increased duration serialization precision from 3 to 5 decimals in history plugin.
- Added tests for 3-tuple payload behavior in `console.print` with flags `raw` and `prefix`.
- Verified ASCII and Rich reporters correctly render messages with/without test prefixes and markup.
- Updated `console.print` to support a new `prefix` parameter for suite-level outputs (e.g., "Results: ...").
- Adjusted `on_user_print` implementations across reporters to handle the `prefix` flag correctly.
- Updated `console.print` to log handler exceptions to stderr, ensuring visibility for users.
- Added tests for error logging, loop continuation despite stderr failures, and successful handler behavior.
- Added tests to validate `clean_dirty` concurrency handling, ensuring no appends are silently dropped due to interleaved truncate operations.
- Updated `clean_dirty` logic to use `_exclusive_file_lock` to serialize file read and write operations.
- Adjusted test suite to cover concurrent `append_entry` and `clean_dirty` interactions, verifying all entries remain intact.
…ents

- Expanded documentation to introduce native LLM evals, including pass/fail and numeric scoring with JSONL history.
- Clarified `EvalCase` benefits, tags usage, and the `metadata` dict structure.
- Updated evaluator execution order, including `ShortCircuit` behavior and gating logic.
- Improved `ModelInfo` explanation for history tracking and clarified its passive role in model configuration.
- Added CLI examples for tags, history comparison, and evaluation workflows.
…lag exclusion

- Added decorator-time validation to ensure eval functions declare only one `EvalCase` parameter, raising clear errors on conflicts.
- Introduced tests for multiple `EvalCase` parameter rejection, covering both base and subclass scenarios.
- Updated CLI parser to exclude eval-only flags (e.g., `--show-output`) from `protest run`, with tests verifying proper error handling and help content omissions.
- Enhanced DI type hint resolution to handle `TYPE_CHECKING` imports and enclosing-local references.
asyncio.TimeoutError and builtins.TimeoutError were distinct classes
before Python 3.11. Reporters and tests check isinstance against the
builtin, so on 3.10 the previous `raise asyncio.TimeoutError(...)` made
those checks fail. On 3.11+ both names alias the builtin, so this is a
no-op. Fixes 6 timeout/retry tests on the 3.10 CI matrix.
Agent test (Claude Code in isolated dir, public docs only) surfaced
several rough edges. This batch addresses the ones blocking a clean
re-run signal:

- ScoreNameCollisionError: dataclass evaluators with overlapping field
  names previously overwrote each other silently in the per-case
  scores dict (and the history file). Now raises at runtime with the
  case name and duplicate names; doc rewritten to remove the false
  auto-prefix promise.
- ModelInfo -> ModelLabel: rename clarifies it is a passive history
  label, not a runtime model config (the doc warning becomes obsolete
  and is replaced by a plain description).
- rich made truly optional: lazy-imported inside RichReporter methods
  so `import protest` works without rich; AsciiReporter.activate()
  takes over when rich is missing. Verified in a venv with no extras.
- EvalSuite re-exported from protest.evals so users only need one
  import path for the eval API.
- Top-level `protest --help` epilog now includes eval/history/live
  examples (was 9 run + 1 tags, none for eval/history/live).
- cli.md gets full `protest eval` and `protest history` sections,
  including --compare's case-modified vs scoring-modified semantics.
Agent v2 confirmed the tier-1 fixes landed cleanly and surfaced a new
bucket of frictions concentrated on `protest history`. This batch
addresses them.

CLI refactor:
- `protest history` is now sub-command based (`list`, `runs`, `show`,
  `compare`, `clean`) instead of mutually-exclusive flags. `list`
  remains the implicit default so `protest history --tail 5` still
  works without typing the sub-command. The previous flag-as-mode form
  (`--runs`, `--show`, `--compare`, `--clean-dirty`) is removed.
- `protest history clean` is dry-run by default. `--apply` actually
  modifies the file. Eliminates the "destructive without warning"
  footgun.
- `--model` and `--suite` filter at the suite level: a run with
  several suites under different models keeps the entry, with non-
  matching suites pruned out of the displayed view. The previous
  run-level filter would surprise users by dropping the whole run.
- `--tail N` now narrows the entries before aggregation, so the
  `list` (trend) view actually scopes to the requested window.
- Added `--short` for `protest eval`: hide passing scores per case
  to keep the output readable on suites with many evaluators.

Docs:
- `cli.md` rewritten for the new sub-command layout, with explicit
  examples for each sub-command and a note on suite-level filtering.
- `evals.md` gets a callout on writing custom evaluators when the eval
  task returns a non-string output (dict / dataclass / pydantic), and
  a tip clarifying that "first run successful" doesn't mean every case
  passes — evals are expected to surface failing cases.
- `evals.md` quick-start now imports `EvalSuite` from `protest.evals`
  (single canonical path).
- `installation.md` adds an IDE / type-checker setup section
  (Pyright/Pylance/mypy + uv).

Storage:
- `is_dirty_entry()` and `count_dirty_entries()` extracted as helpers
  so the dry-run path can compute counts without touching the file.

The remaining cross-suite/cross-model `compare` ask is tracked in #101.
`protest history compare` previously aplatted cases across all suites in
the two most recent runs. When the runs contained suites under different
ModelLabels (e.g. rules_v1 + rules_v2 in a multi-model session), a
case-id present under both models would surface as "regressed" or
"fixed" depending on which suite the diff happened to scan first.

Reported by the v3 naive-agent test: 5 strictly-identical runs produced
fake "Regressions: T010, T016" because T010 passed under v2 and failed
under v1 — the diff conflated the two contexts.

Fix: detect distinct ModelLabel.names across the two compared entries
and refuse to run when more than one is present, asking the user to
disambiguate via --model NAME or --suite NAME (which already
suite-prune entries at load time, leaving a single-model comparison).

Two new tests cover the rejection and the --model-disambiguated success
path. Top-level `protest --help` epilog and the test-bed MISSION.md
also get a small refresh to use the new sub-command syntax (`protest
history compare/runs/clean`) rather than the now-removed flag-as-mode
form.
…registration

The single Evaluator.__call__ that switched on isinstance(args[0], EvalContext)
forced an Any-typed signature and produced the surprising f is f() identity for
the no-kwargs case. Split into __call__(**kwargs) for rebinding and run(ctx) for
execution: each method is monomorphic and pyright can read it without overloads.

Plain callables are no longer accepted in evaluators=[...]. validate_evaluators
runs at registration boundaries (make_eval_wrapper, EvalCase, ShortCircuit) and
raises a clear TypeError pointing at @evaluator. The executor then operates on
a uniform Evaluator | ShortCircuit Union — the only remaining isinstance is the
narrowing on that real disjoint Union.
- evals.md: EvalCase field table listed `tags` as a special metadata key
  while the example below used `tags=[...]` as a kwarg and the dataclass
  declares it first-class. Split into separate `tags` / `metadata` rows.
- evals.md: history compare example now shows `--model NAME` with the
  rationale, so users hit the constraint at read time instead of via the
  runtime "multiple models" rejection.
- history.py: Run Detail panel title now carries a "(+ pass · - fail)"
  legend; the +/- markers were unlabeled and required inference.
Vestige from the pydantic-evals era — there is no Dataset concept
in the native eval API. The file holds EvalCase instances, so
cases.py matches the vocabulary used by EvalSuite, EvalCase, and
the --last-failed CLI flag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant