feat(tooling): benchmark summary report across models/metrics (external-first)

## Canonical Plan
This issue body is the source of truth for implementation.

## Objective
We generate consolidated benchmark summaries across models and metrics from existing result artifacts.

## Why This Matters
- Benchmark interpretation currently requires stitching multiple outputs manually.
- A single consolidated report lowers effort and improves repeatability.
- Teams can communicate results faster across engineering and product stakeholders.

## Why This Location
- This is a synthesis/reporting concern over existing outputs.
- We can deliver value without expanding runtime/evaluator surface area.
- External-first keeps reporting flexible and decoupled from execution.

## Architecture Boundary
External-first reporting/tooling.

## Deliverable Location
- Primary location (in-repo tooling path): `examples/features/benchmark-tooling/`
- Script location: `examples/features/benchmark-tooling/scripts/` (benchmark report utility)
- Usage/docs: `examples/features/benchmark-tooling/README.md`

## Design Latitude
We choose command shape, report format defaults, and aggregation presentation based on usability and maintainability.

## Acceptance Signals
- We can produce one benchmark-oriented summary from multiple result files.
- We include per-target/per-metric aggregates and uncertainty when available.
- We support both machine-readable and human-readable outputs.

## Non-Goals
- No evaluator runtime changes.
- No requirement to add a new core command if a tooling path is cleaner.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tooling): benchmark summary report across models/metrics (external-first) #367

Canonical Plan

Objective

Why This Matters

Why This Location

Architecture Boundary

Deliverable Location

Design Latitude

Acceptance Signals

Non-Goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(tooling): benchmark summary report across models/metrics (external-first) #367

Description

Canonical Plan

Objective

Why This Matters

Why This Location

Architecture Boundary

Deliverable Location

Design Latitude

Acceptance Signals

Non-Goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions