Skip to content

feat(plugin): trial output-consistency metric via embedding similarity #368

@christso

Description

@christso

Canonical Plan

This issue body is the source of truth for implementation.

Objective

We introduce a trial-output consistency metric as a plugin/reference capability.

Why This Matters

  • Trial score variability does not directly describe semantic output consistency.
  • We need a dedicated signal for response stability across repeated attempts.
  • This unlocks stronger diagnostics for non-deterministic agent behavior.

Why This Location

  • Consistency scoring method choice is specialized and likely to evolve.
  • Plugin-first lets us experiment without hardcoding a narrow built-in.
  • This aligns with AgentV’s extensibility-first architecture.

Architecture Boundary

Plugin-first (aligned with AgentV principles).

Deliverable Location

  • Primary location: examples/features/trial-output-consistency/
  • Plugin/judge location: examples/features/trial-output-consistency/judges/
  • Runnable eval location: examples/features/trial-output-consistency/evals/
  • Usage/docs: examples/features/trial-output-consistency/README.md

Design Latitude

We can choose embedding backend, similarity strategy, and artifact flow, provided the solution stays extension-oriented.

Acceptance Signals

  • We provide a runnable reference implementation that computes consistency across repeated trial outputs.
  • We expose the result as a named metric in evaluation workflows.
  • We define explicit behavior for low-trial and edge-case inputs.

Non-Goals

  • No built-in core evaluator unless plugin-first proves insufficient.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions