Skip to content

feat(tooling): split combined results JSONL by target (external-first) #364

@christso

Description

@christso

Canonical Plan

This issue body is the source of truth for implementation.

Objective

We improve multi-model benchmarking ergonomics by producing per-target result artifacts from combined JSONL outputs.

Why This Matters

  • Multi-target runs are harder to compare quickly when all records stay in one combined file.
  • Teams repeatedly write ad-hoc filtering scripts before they can run model-vs-model analysis.
  • Standardizing this utility reduces operational friction in benchmark workflows.

Why This Location

  • This is an output-shaping concern, not scoring/execution behavior.
  • We can solve it as tooling without changing core evaluator/runtime semantics.
  • Keeping it external-first preserves AgentV’s lightweight-core design.

Architecture Boundary

External-first. We prefer wrapper/tooling and avoid core evaluator/runtime changes.

Deliverable Location

  • Primary location (in-repo tooling path): examples/features/benchmark-tooling/
  • Script location: examples/features/benchmark-tooling/scripts/ (for split-by-target utility)
  • Usage/docs: examples/features/benchmark-tooling/README.md

Design Latitude

We can choose the exact utility interface and filename strategy as long as output is deterministic and easy to use.

Acceptance Signals

  • We can derive one deterministic JSONL per target from a combined results file.
  • We handle target names that require safe filename normalization.
  • We document the downstream compare workflow.

Non-Goals

  • No runtime scoring semantics changes.
  • No mandatory schema changes for existing result records.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions