Context
protest history compare (v0.2 sub-command form) currently compares the two most recent runs of the same model — useful for spotting regression run-over-run. The naive-agent test (#agent-v2 journal) surfaced a real use case it doesn't cover:
Comparing the two suites inside the same run — e.g. `helpdesk_v1` vs `helpdesk_v2` produced by the same `protest eval` invocation.
This is exactly the workflow the docs encourage when comparing model variants (Multi-Model Sessions). With the current compare, you have to do something contrived (run v1 only, then v2 only, then compare across runs) and even then the comparison can be misleading because run timestamps and case orders differ.
Proposed surface
Add cross-suite / cross-model comparison via flags on the existing compare sub-command:
# Compare two suites within the most recent run
protest history compare --suites helpdesk_v1 helpdesk_v2
# Compare two models within the most recent run
protest history compare --models rules-v1 rules-v2
Either flag is mutually exclusive with the other and with the default behavior (run-vs-run). When set, the comparison happens within the most recent run that contains both selected suites/models.
Decisions to make at implementation time
- Run scope: should
--suites a b always pick the most recent run that has both, or accept an explicit run via --run N?
- Symmetry of output: same
Fixed/Regressed/Modified/New/Deleted table as run-over-run, with the two suites/models as the two sides?
- Models with multiple suites: if
--models v1 v2 matches multiple suite pairs in the same run, do we compare each pair separately or aggregate?
- Naming:
--suites plural feels right but argparse defaults to space-separated values which can be brittle. Alternatives: --suite-a/--suite-b or --suite NAME repeatable.
Acceptance criteria
protest history compare --suites a b works on a run that contains both suites
protest history compare --models v1 v2 works on a run with both models
- Mutually exclusive with run-over-run mode (no flags)
- Per-case
+/-/⟳ markers honor the same case-hash vs eval-hash distinction
- New CLI tests covering the success and error cases
cli.md documents the new flags
Context
protest history compare(v0.2 sub-command form) currently compares the two most recent runs of the same model — useful for spotting regression run-over-run. The naive-agent test (#agent-v2 journal) surfaced a real use case it doesn't cover:This is exactly the workflow the docs encourage when comparing model variants (
Multi-Model Sessions). With the currentcompare, you have to do something contrived (run v1 only, then v2 only, then compare across runs) and even then the comparison can be misleading because run timestamps and case orders differ.Proposed surface
Add cross-suite / cross-model comparison via flags on the existing
comparesub-command:Either flag is mutually exclusive with the other and with the default behavior (run-vs-run). When set, the comparison happens within the most recent run that contains both selected suites/models.
Decisions to make at implementation time
--suites a balways pick the most recent run that has both, or accept an explicit run via--run N?Fixed/Regressed/Modified/New/Deletedtable as run-over-run, with the two suites/models as the two sides?--models v1 v2matches multiple suite pairs in the same run, do we compare each pair separately or aggregate?--suitesplural feels right butargparsedefaults to space-separated values which can be brittle. Alternatives:--suite-a/--suite-bor--suite NAMErepeatable.Acceptance criteria
protest history compare --suites a bworks on a run that contains both suitesprotest history compare --models v1 v2works on a run with both models+/-/⟳markers honor the same case-hash vs eval-hash distinctioncli.mddocuments the new flags