Skip to content

tracking: multi-model x multi-metric x variability gap closure roadmap #371

@christso

Description

@christso

SOURCE OF TRUTH FOR ORCHESTRATOR HANDOFF

This issue body is the canonical execution plan for this roadmap.

0) Execution Philosophy

  • We keep design freedom with hard scope boundaries.
  • We prefer iterative delivery and incorporate implementation learnings.
  • We avoid speculative abstractions that do not serve immediate acceptance signals.

1) Architecture Boundary

Core runtime (allowed/required)

Docs/examples only (non-core)

External-first (tooling/plugin/post-processing)

If we implement an external-first item inside agentv, we keep it as optional utility/report tooling and avoid runtime/schema coupling.

2) Current Implementation Status (snapshot)

3) Delivery Tracks

Track A: AgentV core/docs

Track B: External analytics/plugin

4) Dependency Graph

5) Parallel Waves (default)

Wave 1 (parallel)

Wave 2 (parallel, once Wave 1 PRs are open)

Wave 3

Wave 4

6) Merge Order (low-conflict default)

  1. feat(eval): suite-level total cost budget guardrail (core runtime) #369
  2. feat(tooling): split combined results JSONL by target (external-first) #364
  3. feat(plugin): trial output-consistency metric via embedding similarity #368
  4. feat(tooling): aggregate win-rate summaries for result comparisons (external-first) #366
  5. feat(tooling): paired significance testing for result-file comparisons (external-first) #365
  6. feat(tooling): benchmark summary report across models/metrics (external-first) #367
  7. docs/examples: add multi-model x multi-metric x variability benchmark showcase #370 (can merge earlier if independent of unmerged behavior)

7) Subagent Contract

  • We read CLAUDE.md in agentv first.
  • We use branch + PR only (no direct push to main).
  • We keep one issue per PR and avoid unrelated refactors.
  • We include tests/docs required by each issue acceptance signals.
  • We use squash merge and link each PR back to this tracking issue.

Suggested prompt:
"Implement EntityProcess/agentv issue #. First read CLAUDE.md in repo root. Create a feature branch, implement only this issue, add/update tests and docs, push, and open a PR that closes #. Keep changes scoped and avoid unrelated refactors."

8) Completion Criteria

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions