Problem
The examples/features/benchmark-tooling/ example added in #373 does not run out of the box:
- No sample input data — there's no fixture JSONL to run the split script against
- No EVAL.yaml — unlike other examples under
examples/features/, nothing to run with bun agentv eval
- Stale model references — README references
gpt-4.1 and claude-sonnet-4 instead of current recommended models
Examples should be self-contained. A user cloning the repo should be able to run any example without external dependencies or prior state.
Context: Why Split-by-Target Exists
When an EVAL.yaml specifies multiple targets, agentv eval writes all results into a single combined JSONL (one record per test × target, each with a target field). The split-by-target.ts script splits that combined file into per-target files so you can feed them to agentv compare.
The example needs to demonstrate this full flow: multi-target eval → combined JSONL → split → compare.
Files to Create/Update
1. examples/features/benchmark-tooling/evals/benchmark.eval.yaml
Multi-target eval config with 3 targets. Pattern follows examples/features/matrix-evaluation/:
execution:
targets:
- gemini-3-flash-preview
- gpt-4.1
- gpt-5-mini
tests:
- id: greeting
input: "Say hello"
criteria: "The response should contain a greeting"
- id: code-generation
input: "Write a fibonacci function in Python"
criteria: "The response should contain a valid Python function"
- id: summarization
input: "Summarize the key benefits of automated testing"
criteria: "The response should mention reliability, speed, or regression detection"
2. examples/features/benchmark-tooling/fixtures/combined-results.jsonl
Sample combined output (what agentv eval would produce). One record per test × target = 9 records total. Each record needs at minimum: test_id, target, score, input, answer. Use realistic but short mock data.
3. Update examples/features/benchmark-tooling/README.md
Replace current content. Structure:
- What this demonstrates — the split-by-target workflow for multi-model benchmarking
- Quick start — run the split script against the fixture:
bun examples/features/benchmark-tooling/scripts/split-by-target.ts \
examples/features/benchmark-tooling/fixtures/combined-results.jsonl \
./split-output
- Full workflow — run the eval yourself, then split, then compare:
bun agentv eval examples/features/benchmark-tooling/evals/benchmark.eval.yaml
# split the combined output
bun examples/features/benchmark-tooling/scripts/split-by-target.ts .agentv/results/<output>.jsonl ./by-target
# compare any two targets
bun agentv compare ./by-target/<file-a>.jsonl ./by-target/<file-b>.jsonl
- Update all model references to
gemini-3-flash-preview, gpt-4.1, gpt-5-mini
4. No changes to scripts/split-by-target.ts
The script itself is fine. Only the surrounding example scaffolding is missing.
Acceptance
split-by-target.ts runs against the fixture JSONL with no prior setup
- Output produces 3 files (one per target)
- README shows both the quick-start (fixture) and full workflow (live eval)
- Model references are
gemini-3-flash-preview, gpt-4.1, gpt-5-mini throughout
Architecture Boundary
External-first. No core changes — stays in examples/features/benchmark-tooling/.
Problem
The
examples/features/benchmark-tooling/example added in #373 does not run out of the box:examples/features/, nothing to run withbun agentv evalgpt-4.1andclaude-sonnet-4instead of current recommended modelsExamples should be self-contained. A user cloning the repo should be able to run any example without external dependencies or prior state.
Context: Why Split-by-Target Exists
When an EVAL.yaml specifies multiple targets,
agentv evalwrites all results into a single combined JSONL (one record per test × target, each with atargetfield). Thesplit-by-target.tsscript splits that combined file into per-target files so you can feed them toagentv compare.The example needs to demonstrate this full flow: multi-target eval → combined JSONL → split → compare.
Files to Create/Update
1.
examples/features/benchmark-tooling/evals/benchmark.eval.yamlMulti-target eval config with 3 targets. Pattern follows
examples/features/matrix-evaluation/:2.
examples/features/benchmark-tooling/fixtures/combined-results.jsonlSample combined output (what
agentv evalwould produce). One record per test × target = 9 records total. Each record needs at minimum:test_id,target,score,input,answer. Use realistic but short mock data.3. Update
examples/features/benchmark-tooling/README.mdReplace current content. Structure:
gemini-3-flash-preview,gpt-4.1,gpt-5-mini4. No changes to
scripts/split-by-target.tsThe script itself is fine. Only the surrounding example scaffolding is missing.
Acceptance
split-by-target.tsruns against the fixture JSONL with no prior setupgemini-3-flash-preview,gpt-4.1,gpt-5-minithroughoutArchitecture Boundary
External-first. No core changes — stays in
examples/features/benchmark-tooling/.