Skip to content

feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335

@christso

Description

@christso

Status Update (2026-04-09)

This issue remains relevant, but its scope should continue to center on core run data and CLI analysis, not a larger assistant-runtime or Studio expansion.

Important note: AgentV already has agentv trend. That means part of the original "cross-run regression detection" goal already exists. The remaining core work is mainly around iteration metadata, termination taxonomy, and artifact-backed run history that other optimization workflows can build on.

Revised Scope

In scope

  • expected_iterations on tests
  • completion_signal support where appropriate
  • termination_reason / related loop metadata in results
  • iteration_efficiency assertion type (still depends on #320)
  • strengthening/reusing CLI regression analysis primitives where needed
  • additive, non-breaking metadata that helps optimization workflows resume and inspect prior cycles
  • artifact-backed iteration history primitives where they clearly support CLI analysis and other workflow layers

Deprioritized

  • Studio regression alert feed
  • Studio regression timeline as a major workstream
  • auto-clustering UI inside Studio
  • making this issue dependent on a bigger Studio rollout
  • persistent personal memory or session-search infrastructure for the core product

Design Boundary

This issue should provide portable run/result primitives that skills, plugins, wrappers, or future workflow tooling can consume. It should not turn AgentV into a long-lived self-improving assistant runtime.

Dependencies

  • #320 remains the important dependency for iteration_efficiency

Acceptance Signals

  • iteration metadata is represented in result data cleanly and non-breakingly
  • termination reasons are captured for loop-based workflows
  • iteration_efficiency can be evaluated once #320 is available
  • any regression UX added later can reuse existing CLI/data primitives rather than invent a new subsystem
  • result metadata is sufficient for higher-level optimization loops to resume, inspect, and compare prior cycles without requiring chat-memory features

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions