fix(examples): make benchmark-tooling example runnable out of the box

## Problem

The `examples/features/benchmark-tooling/` example added in #373 does not run out of the box:

1. **No sample input data** — there's no fixture JSONL to run the split script against
2. **No EVAL.yaml** — unlike other examples under `examples/features/`, nothing to run with `bun agentv eval`
3. **Stale model references** — README references `gpt-4.1` and `claude-sonnet-4` instead of current recommended models

Examples should be self-contained. A user cloning the repo should be able to run any example without external dependencies or prior state.

## Context: Why Split-by-Target Exists

When an EVAL.yaml specifies multiple targets, `agentv eval` writes **all results into a single combined JSONL** (one record per test × target, each with a `target` field). The `split-by-target.ts` script splits that combined file into per-target files so you can feed them to `agentv compare`.

The example needs to demonstrate this full flow: multi-target eval → combined JSONL → split → compare.

## Files to Create/Update

### 1. `examples/features/benchmark-tooling/evals/benchmark.eval.yaml`

Multi-target eval config with 3 targets. Pattern follows `examples/features/matrix-evaluation/`:

```yaml
execution:
  targets:
    - gemini-3-flash-preview
    - gpt-4.1
    - gpt-5-mini

tests:
  - id: greeting
    input: "Say hello"
    criteria: "The response should contain a greeting"

  - id: code-generation
    input: "Write a fibonacci function in Python"
    criteria: "The response should contain a valid Python function"

  - id: summarization
    input: "Summarize the key benefits of automated testing"
    criteria: "The response should mention reliability, speed, or regression detection"
```

### 2. `examples/features/benchmark-tooling/fixtures/combined-results.jsonl`

Sample combined output (what `agentv eval` would produce). One record per test × target = 9 records total. Each record needs at minimum: `test_id`, `target`, `score`, `input`, `answer`. Use realistic but short mock data.

### 3. Update `examples/features/benchmark-tooling/README.md`

Replace current content. Structure:

1. **What this demonstrates** — the split-by-target workflow for multi-model benchmarking
2. **Quick start** — run the split script against the fixture:
   ```bash
   bun examples/features/benchmark-tooling/scripts/split-by-target.ts \
     examples/features/benchmark-tooling/fixtures/combined-results.jsonl \
     ./split-output
   ```
3. **Full workflow** — run the eval yourself, then split, then compare:
   ```bash
   bun agentv eval examples/features/benchmark-tooling/evals/benchmark.eval.yaml
   # split the combined output
   bun examples/features/benchmark-tooling/scripts/split-by-target.ts .agentv/results/<output>.jsonl ./by-target
   # compare any two targets
   bun agentv compare ./by-target/<file-a>.jsonl ./by-target/<file-b>.jsonl
   ```
4. Update all model references to `gemini-3-flash-preview`, `gpt-4.1`, `gpt-5-mini`

### 4. No changes to `scripts/split-by-target.ts`

The script itself is fine. Only the surrounding example scaffolding is missing.

## Acceptance

- `split-by-target.ts` runs against the fixture JSONL with no prior setup
- Output produces 3 files (one per target)
- README shows both the quick-start (fixture) and full workflow (live eval)
- Model references are `gemini-3-flash-preview`, `gpt-4.1`, `gpt-5-mini` throughout

## Architecture Boundary

External-first. No core changes — stays in `examples/features/benchmark-tooling/`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(examples): make benchmark-tooling example runnable out of the box #380

Problem

Context: Why Split-by-Target Exists

Files to Create/Update

1. `examples/features/benchmark-tooling/evals/benchmark.eval.yaml`

2. `examples/features/benchmark-tooling/fixtures/combined-results.jsonl`

3. Update `examples/features/benchmark-tooling/README.md`

4. No changes to `scripts/split-by-target.ts`

Acceptance

Architecture Boundary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix(examples): make benchmark-tooling example runnable out of the box #380

Description

Problem

Context: Why Split-by-Target Exists

Files to Create/Update

1. examples/features/benchmark-tooling/evals/benchmark.eval.yaml

2. examples/features/benchmark-tooling/fixtures/combined-results.jsonl

3. Update examples/features/benchmark-tooling/README.md

4. No changes to scripts/split-by-target.ts

Acceptance

Architecture Boundary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `examples/features/benchmark-tooling/evals/benchmark.eval.yaml`

2. `examples/features/benchmark-tooling/fixtures/combined-results.jsonl`

3. Update `examples/features/benchmark-tooling/README.md`

4. No changes to `scripts/split-by-target.ts`