Skip to content

fix(examples): make benchmark-tooling example runnable out of the box #380

@christso

Description

@christso

Problem

The examples/features/benchmark-tooling/ example added in #373 does not run out of the box:

  1. No sample input data — there's no fixture JSONL to run the split script against
  2. No EVAL.yaml — unlike other examples under examples/features/, nothing to run with bun agentv eval
  3. Stale model references — README references gpt-4.1 and claude-sonnet-4 instead of current recommended models

Examples should be self-contained. A user cloning the repo should be able to run any example without external dependencies or prior state.

Context: Why Split-by-Target Exists

When an EVAL.yaml specifies multiple targets, agentv eval writes all results into a single combined JSONL (one record per test × target, each with a target field). The split-by-target.ts script splits that combined file into per-target files so you can feed them to agentv compare.

The example needs to demonstrate this full flow: multi-target eval → combined JSONL → split → compare.

Files to Create/Update

1. examples/features/benchmark-tooling/evals/benchmark.eval.yaml

Multi-target eval config with 3 targets. Pattern follows examples/features/matrix-evaluation/:

execution:
  targets:
    - gemini-3-flash-preview
    - gpt-4.1
    - gpt-5-mini

tests:
  - id: greeting
    input: "Say hello"
    criteria: "The response should contain a greeting"

  - id: code-generation
    input: "Write a fibonacci function in Python"
    criteria: "The response should contain a valid Python function"

  - id: summarization
    input: "Summarize the key benefits of automated testing"
    criteria: "The response should mention reliability, speed, or regression detection"

2. examples/features/benchmark-tooling/fixtures/combined-results.jsonl

Sample combined output (what agentv eval would produce). One record per test × target = 9 records total. Each record needs at minimum: test_id, target, score, input, answer. Use realistic but short mock data.

3. Update examples/features/benchmark-tooling/README.md

Replace current content. Structure:

  1. What this demonstrates — the split-by-target workflow for multi-model benchmarking
  2. Quick start — run the split script against the fixture:
    bun examples/features/benchmark-tooling/scripts/split-by-target.ts \
      examples/features/benchmark-tooling/fixtures/combined-results.jsonl \
      ./split-output
  3. Full workflow — run the eval yourself, then split, then compare:
    bun agentv eval examples/features/benchmark-tooling/evals/benchmark.eval.yaml
    # split the combined output
    bun examples/features/benchmark-tooling/scripts/split-by-target.ts .agentv/results/<output>.jsonl ./by-target
    # compare any two targets
    bun agentv compare ./by-target/<file-a>.jsonl ./by-target/<file-b>.jsonl
  4. Update all model references to gemini-3-flash-preview, gpt-4.1, gpt-5-mini

4. No changes to scripts/split-by-target.ts

The script itself is fine. Only the surrounding example scaffolding is missing.

Acceptance

  • split-by-target.ts runs against the fixture JSONL with no prior setup
  • Output produces 3 files (one per target)
  • README shows both the quick-start (fixture) and full workflow (live eval)
  • Model references are gemini-3-flash-preview, gpt-4.1, gpt-5-mini throughout

Architecture Boundary

External-first. No core changes — stays in examples/features/benchmark-tooling/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions