Canonical Plan
This issue body is the source of truth for implementation.
Objective
We provide aggregate win/loss/tie summaries that make comparison results decision-ready at benchmark level.
Why This Matters
- Per-case tables are useful but slow for high-level model selection.
- Teams need compact benchmark-level signals for release and routing decisions.
- Consistent aggregation improves comparability across runs.
Why This Location
- Win-rate aggregation is a reporting/analytics layer concern.
- It does not require changing runtime scoring semantics.
- External-first implementation keeps core stable and composable.
Architecture Boundary
External-first aggregation/reporting layer.
Deliverable Location
- Primary location (in-repo tooling path):
examples/features/benchmark-tooling/
- Script location:
examples/features/benchmark-tooling/scripts/ (win-rate aggregation utility)
- Usage/docs:
examples/features/benchmark-tooling/README.md
Design Latitude
We decide representation details (overall rates, per-metric rates, optional uncertainty) while keeping outputs script-friendly.
Acceptance Signals
- We produce overall win/loss/tie rates from aligned comparisons.
- We support per-metric breakdown when metric-level data exists.
- We make tie policy explicit and documented.
Non-Goals
- No new core evaluator behavior.
- No forced changes to existing comparison semantics.
Canonical Plan
This issue body is the source of truth for implementation.
Objective
We provide aggregate win/loss/tie summaries that make comparison results decision-ready at benchmark level.
Why This Matters
Why This Location
Architecture Boundary
External-first aggregation/reporting layer.
Deliverable Location
examples/features/benchmark-tooling/examples/features/benchmark-tooling/scripts/(win-rate aggregation utility)examples/features/benchmark-tooling/README.mdDesign Latitude
We decide representation details (overall rates, per-metric rates, optional uncertainty) while keeping outputs script-friendly.
Acceptance Signals
Non-Goals