tracking: multi-model x multi-metric x variability gap closure roadmap

## SOURCE OF TRUTH FOR ORCHESTRATOR HANDOFF
This issue body is the canonical execution plan for this roadmap.

## 0) Execution Philosophy
- We keep design freedom with hard scope boundaries.
- We prefer iterative delivery and incorporate implementation learnings.
- We avoid speculative abstractions that do not serve immediate acceptance signals.

## 1) Architecture Boundary

### Core runtime (allowed/required)
- #369 suite-level total cost budget guardrail

### Docs/examples only (non-core)
- #370 multi-model benchmark showcase

### External-first (tooling/plugin/post-processing)
- #364 per-target JSONL splitting
- #365 paired statistical significance testing
- #366 aggregate win-rate summaries
- #367 benchmark summary report
- #368 trial output-consistency metric (plugin/reference)

If we implement an external-first item inside `agentv`, we keep it as optional utility/report tooling and avoid runtime/schema coupling.

## 2) Current Implementation Status (snapshot)
- #364: partially implemented foundation (matrix eval + target field), no auto split-by-target mode.
- #365: not implemented (compare has deltas only; no significance tests).
- #366: partially implemented (counts exist; rates/per-metric aggregation missing).
- #367: partially implemented building blocks (matrix summary, trace stats, wrappers), no benchmark report mode.
- #368: partially implemented adjacent tooling; no embedding-based trial consistency workflow yet.
- #369: partially implemented adjacent controls (trial cost limit + provider budget), no suite-level total budget.
- #370: partially implemented via separate examples; no unified showcase.

## 3) Delivery Tracks

### Track A: AgentV core/docs
- #369 (core runtime)
- #370 (docs/examples)

### Track B: External analytics/plugin
- #364, #365, #366, #367, #368

## 4) Dependency Graph
- #364 improves ergonomics for #370 and #367 outputs.
- #366 should land before #365 (aggregate baseline before significance layer).
- #365 + #366 feed #367 reporting.
- #368 is mostly independent (plugin-first).
- #369 is independent of compare/report stack.

## 5) Parallel Waves (default)

### Wave 1 (parallel)
- Agent A: #369
- Agent B: #364
- Agent C: #368

### Wave 2 (parallel, once Wave 1 PRs are open)
- Agent D: #366
- Agent E: #370 (draft early, finalize against stable outputs)

### Wave 3
- Agent F: #365 (after #366)

### Wave 4
- Agent G: #367 (after #364/#366/#365)

## 6) Merge Order (low-conflict default)
1. #369
2. #364
3. #368
4. #366
5. #365
6. #367
7. #370 (can merge earlier if independent of unmerged behavior)

## 7) Subagent Contract
- We read `CLAUDE.md` in `agentv` first.
- We use branch + PR only (no direct push to `main`).
- We keep one issue per PR and avoid unrelated refactors.
- We include tests/docs required by each issue acceptance signals.
- We use squash merge and link each PR back to this tracking issue.

Suggested prompt:
"Implement EntityProcess/agentv issue #<N>. First read CLAUDE.md in repo root. Create a feature branch, implement only this issue, add/update tests and docs, push, and open a PR that closes #<N>. Keep changes scoped and avoid unrelated refactors."

## 8) Completion Criteria
- Core runtime: #369 merged.
- Docs/examples: #370 merged.
- External analytics/plugin deliverables available for #364/#365/#366/#367/#368.
- End-to-end benchmark workflow executes without bespoke one-off scripts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking: multi-model x multi-metric x variability gap closure roadmap #371

SOURCE OF TRUTH FOR ORCHESTRATOR HANDOFF

0) Execution Philosophy

1) Architecture Boundary

Core runtime (allowed/required)

Docs/examples only (non-core)

External-first (tooling/plugin/post-processing)

2) Current Implementation Status (snapshot)

3) Delivery Tracks

Track A: AgentV core/docs

Track B: External analytics/plugin

4) Dependency Graph

5) Parallel Waves (default)

Wave 1 (parallel)

Wave 2 (parallel, once Wave 1 PRs are open)

Wave 3

Wave 4

6) Merge Order (low-conflict default)

7) Subagent Contract

8) Completion Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

tracking: multi-model x multi-metric x variability gap closure roadmap #371

Description

SOURCE OF TRUTH FOR ORCHESTRATOR HANDOFF

0) Execution Philosophy

1) Architecture Boundary

Core runtime (allowed/required)

Docs/examples only (non-core)

External-first (tooling/plugin/post-processing)

2) Current Implementation Status (snapshot)

3) Delivery Tracks

Track A: AgentV core/docs

Track B: External analytics/plugin

4) Dependency Graph

5) Parallel Waves (default)

Wave 1 (parallel)

Wave 2 (parallel, once Wave 1 PRs are open)

Wave 3

Wave 4

6) Merge Order (low-conflict default)

7) Subagent Contract

8) Completion Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions