Canonical Plan
This issue body is the source of truth for implementation.
Objective
We generate consolidated benchmark summaries across models and metrics from existing result artifacts.
Why This Matters
- Benchmark interpretation currently requires stitching multiple outputs manually.
- A single consolidated report lowers effort and improves repeatability.
- Teams can communicate results faster across engineering and product stakeholders.
Why This Location
- This is a synthesis/reporting concern over existing outputs.
- We can deliver value without expanding runtime/evaluator surface area.
- External-first keeps reporting flexible and decoupled from execution.
Architecture Boundary
External-first reporting/tooling.
Deliverable Location
- Primary location (in-repo tooling path):
examples/features/benchmark-tooling/
- Script location:
examples/features/benchmark-tooling/scripts/ (benchmark report utility)
- Usage/docs:
examples/features/benchmark-tooling/README.md
Design Latitude
We choose command shape, report format defaults, and aggregation presentation based on usability and maintainability.
Acceptance Signals
- We can produce one benchmark-oriented summary from multiple result files.
- We include per-target/per-metric aggregates and uncertainty when available.
- We support both machine-readable and human-readable outputs.
Non-Goals
- No evaluator runtime changes.
- No requirement to add a new core command if a tooling path is cleaner.
Canonical Plan
This issue body is the source of truth for implementation.
Objective
We generate consolidated benchmark summaries across models and metrics from existing result artifacts.
Why This Matters
Why This Location
Architecture Boundary
External-first reporting/tooling.
Deliverable Location
examples/features/benchmark-tooling/examples/features/benchmark-tooling/scripts/(benchmark report utility)examples/features/benchmark-tooling/README.mdDesign Latitude
We choose command shape, report format defaults, and aggregation presentation based on usability and maintainability.
Acceptance Signals
Non-Goals