Canonical Plan
This issue body is the source of truth for implementation.
Objective
We introduce a trial-output consistency metric as a plugin/reference capability.
Why This Matters
- Trial score variability does not directly describe semantic output consistency.
- We need a dedicated signal for response stability across repeated attempts.
- This unlocks stronger diagnostics for non-deterministic agent behavior.
Why This Location
- Consistency scoring method choice is specialized and likely to evolve.
- Plugin-first lets us experiment without hardcoding a narrow built-in.
- This aligns with AgentV’s extensibility-first architecture.
Architecture Boundary
Plugin-first (aligned with AgentV principles).
Deliverable Location
- Primary location:
examples/features/trial-output-consistency/
- Plugin/judge location:
examples/features/trial-output-consistency/judges/
- Runnable eval location:
examples/features/trial-output-consistency/evals/
- Usage/docs:
examples/features/trial-output-consistency/README.md
Design Latitude
We can choose embedding backend, similarity strategy, and artifact flow, provided the solution stays extension-oriented.
Acceptance Signals
- We provide a runnable reference implementation that computes consistency across repeated trial outputs.
- We expose the result as a named metric in evaluation workflows.
- We define explicit behavior for low-trial and edge-case inputs.
Non-Goals
- No built-in core evaluator unless plugin-first proves insufficient.
Canonical Plan
This issue body is the source of truth for implementation.
Objective
We introduce a trial-output consistency metric as a plugin/reference capability.
Why This Matters
Why This Location
Architecture Boundary
Plugin-first (aligned with AgentV principles).
Deliverable Location
examples/features/trial-output-consistency/examples/features/trial-output-consistency/judges/examples/features/trial-output-consistency/evals/examples/features/trial-output-consistency/README.mdDesign Latitude
We can choose embedding backend, similarity strategy, and artifact flow, provided the solution stays extension-oriented.
Acceptance Signals
Non-Goals