feat(tooling): aggregate win-rate summaries for result comparisons (external-first)

## Canonical Plan
This issue body is the source of truth for implementation.

## Objective
We provide aggregate win/loss/tie summaries that make comparison results decision-ready at benchmark level.

## Why This Matters
- Per-case tables are useful but slow for high-level model selection.
- Teams need compact benchmark-level signals for release and routing decisions.
- Consistent aggregation improves comparability across runs.

## Why This Location
- Win-rate aggregation is a reporting/analytics layer concern.
- It does not require changing runtime scoring semantics.
- External-first implementation keeps core stable and composable.

## Architecture Boundary
External-first aggregation/reporting layer.

## Deliverable Location
- Primary location (in-repo tooling path): `examples/features/benchmark-tooling/`
- Script location: `examples/features/benchmark-tooling/scripts/` (win-rate aggregation utility)
- Usage/docs: `examples/features/benchmark-tooling/README.md`

## Design Latitude
We decide representation details (overall rates, per-metric rates, optional uncertainty) while keeping outputs script-friendly.

## Acceptance Signals
- We produce overall win/loss/tie rates from aligned comparisons.
- We support per-metric breakdown when metric-level data exists.
- We make tie policy explicit and documented.

## Non-Goals
- No new core evaluator behavior.
- No forced changes to existing comparison semantics.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tooling): aggregate win-rate summaries for result comparisons (external-first) #366

Canonical Plan

Objective

Why This Matters

Why This Location

Architecture Boundary

Deliverable Location

Design Latitude

Acceptance Signals

Non-Goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(tooling): aggregate win-rate summaries for result comparisons (external-first) #366

Description

Canonical Plan

Objective

Why This Matters

Why This Location

Architecture Boundary

Deliverable Location

Design Latitude

Acceptance Signals

Non-Goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions