Skip to content

feat(tooling): benchmark summary report across models/metrics (external-first) #367

@christso

Description

@christso

Canonical Plan

This issue body is the source of truth for implementation.

Objective

We generate consolidated benchmark summaries across models and metrics from existing result artifacts.

Why This Matters

  • Benchmark interpretation currently requires stitching multiple outputs manually.
  • A single consolidated report lowers effort and improves repeatability.
  • Teams can communicate results faster across engineering and product stakeholders.

Why This Location

  • This is a synthesis/reporting concern over existing outputs.
  • We can deliver value without expanding runtime/evaluator surface area.
  • External-first keeps reporting flexible and decoupled from execution.

Architecture Boundary

External-first reporting/tooling.

Deliverable Location

  • Primary location (in-repo tooling path): examples/features/benchmark-tooling/
  • Script location: examples/features/benchmark-tooling/scripts/ (benchmark report utility)
  • Usage/docs: examples/features/benchmark-tooling/README.md

Design Latitude

We choose command shape, report format defaults, and aggregation presentation based on usability and maintainability.

Acceptance Signals

  • We can produce one benchmark-oriented summary from multiple result files.
  • We include per-target/per-metric aggregates and uncertainty when available.
  • We support both machine-readable and human-readable outputs.

Non-Goals

  • No evaluator runtime changes.
  • No requirement to add a new core command if a tooling path is cleaner.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions