Skip to content

feat(tooling): aggregate win-rate summaries for result comparisons (external-first) #366

@christso

Description

@christso

Canonical Plan

This issue body is the source of truth for implementation.

Objective

We provide aggregate win/loss/tie summaries that make comparison results decision-ready at benchmark level.

Why This Matters

  • Per-case tables are useful but slow for high-level model selection.
  • Teams need compact benchmark-level signals for release and routing decisions.
  • Consistent aggregation improves comparability across runs.

Why This Location

  • Win-rate aggregation is a reporting/analytics layer concern.
  • It does not require changing runtime scoring semantics.
  • External-first implementation keeps core stable and composable.

Architecture Boundary

External-first aggregation/reporting layer.

Deliverable Location

  • Primary location (in-repo tooling path): examples/features/benchmark-tooling/
  • Script location: examples/features/benchmark-tooling/scripts/ (win-rate aggregation utility)
  • Usage/docs: examples/features/benchmark-tooling/README.md

Design Latitude

We decide representation details (overall rates, per-metric rates, optional uncertainty) while keeping outputs script-friendly.

Acceptance Signals

  • We produce overall win/loss/tie rates from aligned comparisons.
  • We support per-metric breakdown when metric-level data exists.
  • We make tie policy explicit and documented.

Non-Goals

  • No new core evaluator behavior.
  • No forced changes to existing comparison semantics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions