feat(compare): support combined JSONL input and N-way multi-model comparison

## Problem

The current `agentv compare` command is strictly pairwise: it takes two pre-split JSONL files and computes deltas between them. With 3+ models this creates compounding friction:

1. **Manual splitting required** — matrix eval produces one combined JSONL, but compare needs separate files. Users must run split-by-target first.
2. **Combinatorial pairwise runs** — 3 models = 3 comparisons, 4 models = 6, 5 models = 10. Each is a separate command.

This is the most common workflow after a multi-target eval and it should be one command.

## Industry Research

Every major eval framework treats **N-way matrix comparison as the default**, not pairwise:

| Framework | Approach |
|---|---|
| **promptfoo** | Matrix view: prompts × providers × tests. All models evaluated simultaneously. No separate compare command — comparison *is* the eval output. |
| **Braintrust** | N-way experiment comparison in UI. PR comments show deltas vs baseline. |
| **LangSmith** | `evaluate_comparative()` accepts 2+ experiments. Pairwise annotation queues for human review. |
| **Arize Phoenix** | Side-by-side experiments with baseline diffing. |
| **wandb** | Compare 50+ runs instantly. Parallel coordinates plots. |
| **MLflow** | Select N runs → compare. Table + chart views. |

**No framework uses a dedicated CLI command that takes two pre-split files.** The standard pattern is: run eval with multiple targets, see all results in a matrix.

Key takeaways:

- **N-way matrix is the standard** — show all models side by side per test case
- **Baseline designation** drives CI exit codes — one model is the reference, regressions against it fail the pipeline
- **Pairwise summaries** are derived from the matrix, not a separate workflow

Full research: [agentevals-research/comparisons/multi-model-comparison-and-baseline-regression.md](https://github.com/agentevals/agentevals-research/blob/main/research/comparisons/multi-model-comparison-and-baseline-regression.md)

---

## Proposed Enhancement

### Phase 1: Combined JSONL with target filtering (pairwise shortcut)

```bash
# Filter a combined results file by target — no pre-splitting needed
agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
```

- Read one JSONL file, filter records by `target` field
- Reuse existing pairwise comparison logic
- Backward compatible — two-file positional args still work

### Phase 2: N-way matrix comparison

```bash
# Auto-detect all targets in a combined file, show matrix
agentv compare results.jsonl

# Specify which targets to include
agentv compare results.jsonl --targets gemini-3-flash-preview gpt-4.1 gpt-5-mini

# Designate a baseline for CI exit code
agentv compare results.jsonl --baseline gpt-4.1
```

Output — matrix table with scores per test × target, pairwise summaries below:

```
Test ID              gemini-3-flash-preview  gpt-4.1  gpt-5-mini
─────────────────────────────────────────────────────────────────
greeting                              0.90     0.85        0.95
code-generation                       0.70     0.80        0.75
summarization                         0.85     0.90        0.80

Pairwise Summary:
  gemini-3-flash-preview → gpt-4.1:    1 win, 1 loss, 1 tie  (Δ +0.033)
  gemini-3-flash-preview → gpt-5-mini: 1 win, 1 loss, 1 tie  (Δ +0.017)
  gpt-4.1 → gpt-5-mini:               1 win, 1 loss, 1 tie  (Δ -0.017)
```

JSON output (`--json`) includes the full matrix and all pairwise comparisons.

### Exit code behavior

| Mode | Exit Code |
|---|---|
| Two-file pairwise (existing) | Same as today — exit 1 on regression |
| Combined JSONL with `--baseline` | Exit 1 if any target regresses vs baseline |
| Combined JSONL without `--baseline` | Exit 0 (informational) |

---

## Implementation Guide

### Current implementation

File: `apps/cli/src/commands/compare/index.ts`

Current structure:
- **CLI args:** Two positional `string` args (`result1`, `result2`), `--threshold`, `--format`/`--json`
- **`loadJsonlResults(filePath)`** — reads JSONL, extracts `test_id` + `score` (ignores `target` field)
- **`compareResults(results1, results2, threshold)`** — matches by `test_id`, computes deltas, classifies win/loss/tie
- **`formatTable(comparison, file1, file2)`** — renders pairwise table with ANSI colors
- **`determineExitCode(meanDelta)`** — exit 0 if candidate >= baseline, else exit 1

### What to change

#### 1. CLI args

Make `result2` optional. Add new flags:

```typescript
// Existing (keep)
result1: positional({ type: string, description: 'Path to JSONL result file' }),
result2: positional({ type: optional(string), description: 'Path to second JSONL (pairwise mode)' }),

// New
baseline: option({ type: optional(string), long: 'baseline', short: 'b',
  description: 'Target name to use as baseline (filters combined JSONL)' }),
candidate: option({ type: optional(string), long: 'candidate', short: 'c',
  description: 'Target name to use as candidate (filters combined JSONL)' }),
targets: option({ type: optional(restPositionals(string)), long: 'targets',
  description: 'Target names to include in matrix comparison' }),
```

#### 2. Mode detection in handler

```
if result2 is provided → existing pairwise mode (two files)
else if --baseline and --candidate → pairwise mode from combined JSONL
else → N-way matrix mode from combined JSONL
```

#### 3. New functions

**`loadCombinedResults(filePath: string): Map<string, EvalResult[]>`**
- Reads JSONL, groups records by `target` field
- Each `EvalResult` needs `target` added: `{ testId, score, target }`

**`compareMatrix(groups: Map<string, EvalResult[]>, threshold: number): MatrixOutput`**
- For each test_id, collect scores across all targets
- Run pairwise comparisons across all target pairs (reuse existing `compareResults`)
- Return: `{ matrix: TestRow[], pairwise: ComparisonOutput[], targets: string[] }`

**`formatMatrix(matrix: MatrixOutput, baselineTarget?: string): string`**
- Render the score matrix table (test_id rows × target columns)
- Below the matrix, render pairwise summaries
- If `baselineTarget` specified, highlight regressions vs that target

#### 4. Exit code for matrix mode

```typescript
if (baselineTarget) {
  // Exit 1 if any target regresses vs baseline
  const baselinePairs = pairwise.filter(p => p.baseline === baselineTarget);
  const anyRegression = baselinePairs.some(p => p.summary.meanDelta < 0);
  process.exit(anyRegression ? 1 : 0);
} else {
  process.exit(0); // Informational
}
```

---

## Update benchmark-tooling example

After implementing the compare enhancement, update `examples/features/benchmark-tooling/` to demonstrate the N-way workflow instead of the split workflow.

### Update `examples/features/benchmark-tooling/README.md`

Replace the current split-focused content with:

```markdown
# Benchmark Tooling

Multi-model benchmarking workflow with AgentV.

## Quick Start

### 1. Run a matrix evaluation

\`\`\`bash
agentv eval examples/features/benchmark-tooling/evals/benchmark.eval.yaml
\`\`\`

This evaluates all tests against 3 targets and writes a combined results JSONL.

### 2. Compare all targets

\`\`\`bash
# N-way matrix — see all models side by side
agentv compare .agentv/results/<output>.jsonl

# Designate a baseline for CI regression gating
agentv compare .agentv/results/<output>.jsonl --baseline gpt-4.1

# JSON output for CI pipelines
agentv compare .agentv/results/<output>.jsonl --json
\`\`\`

### 3. Pairwise comparison (optional)

\`\`\`bash
# Compare two specific targets from the combined file
agentv compare .agentv/results/<output>.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
\`\`\`
```

### Add `examples/features/benchmark-tooling/evals/benchmark.eval.yaml`

```yaml
execution:
  targets:
    - gemini-3-flash-preview
    - gpt-4.1
    - gpt-5-mini

tests:
  - id: greeting
    input: "Say hello"
    criteria: "The response should contain a greeting"

  - id: code-generation
    input: "Write a fibonacci function in Python"
    criteria: "The response should contain a valid Python function"

  - id: summarization
    input: "Summarize the key benefits of automated testing"
    criteria: "The response should mention reliability, speed, or regression detection"
```

### Add `examples/features/benchmark-tooling/fixtures/combined-results.jsonl`

Sample combined output (9 records: 3 tests × 3 targets) so the compare command can be demonstrated without running a live eval:

```bash
# Works out of the box — no API keys needed
agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl
```

Each record needs: `test_id`, `target`, `score`, `input`, `answer`. Use realistic mock data.

---

## Acceptance Criteria

- [ ] `agentv compare results.jsonl` reads a combined JSONL and shows N-way matrix
- [ ] `agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini` filters by target, shows pairwise
- [ ] `agentv compare results.jsonl --baseline gpt-4.1` shows matrix, exits 1 on regression vs baseline
- [ ] `agentv compare results.jsonl --targets t1 t2` limits matrix to specified targets
- [ ] `agentv compare results.jsonl --json` outputs machine-readable matrix + pairwise data
- [ ] Two-file pairwise mode (`agentv compare a.jsonl b.jsonl`) still works unchanged
- [ ] `examples/features/benchmark-tooling/` updated with EVAL.yaml, fixture, and N-way README
- [ ] Fixture runs out of the box: `agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl`

## Supersedes

Closes #380 — the split-by-target example is no longer needed as a primary workflow. The script stays as a niche utility.

Framework	Approach
promptfoo	Matrix view: prompts × providers × tests. All models evaluated simultaneously. No separate compare command — comparison is the eval output.
Braintrust	N-way experiment comparison in UI. PR comments show deltas vs baseline.
LangSmith	`evaluate_comparative()` accepts 2+ experiments. Pairwise annotation queues for human review.
Arize Phoenix	Side-by-side experiments with baseline diffing.
wandb	Compare 50+ runs instantly. Parallel coordinates plots.
MLflow	Select N runs → compare. Table + chart views.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compare): support combined JSONL input and N-way multi-model comparison #381

Problem

Industry Research

Proposed Enhancement

Phase 1: Combined JSONL with target filtering (pairwise shortcut)

Phase 2: N-way matrix comparison

Exit code behavior

Implementation Guide

Current implementation

What to change

1. CLI args

2. Mode detection in handler

3. New functions

4. Exit code for matrix mode

Update benchmark-tooling example

Update `examples/features/benchmark-tooling/README.md`

Add `examples/features/benchmark-tooling/evals/benchmark.eval.yaml`

Add `examples/features/benchmark-tooling/fixtures/combined-results.jsonl`

Acceptance Criteria

Supersedes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Mode	Exit Code
Two-file pairwise (existing)	Same as today — exit 1 on regression
Combined JSONL with `--baseline`	Exit 1 if any target regresses vs baseline
Combined JSONL without `--baseline`	Exit 0 (informational)

feat(compare): support combined JSONL input and N-way multi-model comparison #381

Description

Problem

Industry Research

Proposed Enhancement

Phase 1: Combined JSONL with target filtering (pairwise shortcut)

Phase 2: N-way matrix comparison

Exit code behavior

Implementation Guide

Current implementation

What to change

1. CLI args

2. Mode detection in handler

3. New functions

4. Exit code for matrix mode

Update benchmark-tooling example

Update examples/features/benchmark-tooling/README.md

Add examples/features/benchmark-tooling/evals/benchmark.eval.yaml

Add examples/features/benchmark-tooling/fixtures/combined-results.jsonl

Acceptance Criteria

Supersedes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Update `examples/features/benchmark-tooling/README.md`

Add `examples/features/benchmark-tooling/evals/benchmark.eval.yaml`

Add `examples/features/benchmark-tooling/fixtures/combined-results.jsonl`