Skip to content

feat(compare): support combined JSONL input and N-way multi-model comparison #381

@christso

Description

@christso

Problem

The current agentv compare command is strictly pairwise: it takes two pre-split JSONL files and computes deltas between them. With 3+ models this creates compounding friction:

  1. Manual splitting required — matrix eval produces one combined JSONL, but compare needs separate files. Users must run split-by-target first.
  2. Combinatorial pairwise runs — 3 models = 3 comparisons, 4 models = 6, 5 models = 10. Each is a separate command.

This is the most common workflow after a multi-target eval and it should be one command.

Industry Research

Every major eval framework treats N-way matrix comparison as the default, not pairwise:

Framework Approach
promptfoo Matrix view: prompts × providers × tests. All models evaluated simultaneously. No separate compare command — comparison is the eval output.
Braintrust N-way experiment comparison in UI. PR comments show deltas vs baseline.
LangSmith evaluate_comparative() accepts 2+ experiments. Pairwise annotation queues for human review.
Arize Phoenix Side-by-side experiments with baseline diffing.
wandb Compare 50+ runs instantly. Parallel coordinates plots.
MLflow Select N runs → compare. Table + chart views.

No framework uses a dedicated CLI command that takes two pre-split files. The standard pattern is: run eval with multiple targets, see all results in a matrix.

Key takeaways:

  • N-way matrix is the standard — show all models side by side per test case
  • Baseline designation drives CI exit codes — one model is the reference, regressions against it fail the pipeline
  • Pairwise summaries are derived from the matrix, not a separate workflow

Full research: agentevals-research/comparisons/multi-model-comparison-and-baseline-regression.md


Proposed Enhancement

Phase 1: Combined JSONL with target filtering (pairwise shortcut)

# Filter a combined results file by target — no pre-splitting needed
agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
  • Read one JSONL file, filter records by target field
  • Reuse existing pairwise comparison logic
  • Backward compatible — two-file positional args still work

Phase 2: N-way matrix comparison

# Auto-detect all targets in a combined file, show matrix
agentv compare results.jsonl

# Specify which targets to include
agentv compare results.jsonl --targets gemini-3-flash-preview gpt-4.1 gpt-5-mini

# Designate a baseline for CI exit code
agentv compare results.jsonl --baseline gpt-4.1

Output — matrix table with scores per test × target, pairwise summaries below:

Test ID              gemini-3-flash-preview  gpt-4.1  gpt-5-mini
─────────────────────────────────────────────────────────────────
greeting                              0.90     0.85        0.95
code-generation                       0.70     0.80        0.75
summarization                         0.85     0.90        0.80

Pairwise Summary:
  gemini-3-flash-preview → gpt-4.1:    1 win, 1 loss, 1 tie  (Δ +0.033)
  gemini-3-flash-preview → gpt-5-mini: 1 win, 1 loss, 1 tie  (Δ +0.017)
  gpt-4.1 → gpt-5-mini:               1 win, 1 loss, 1 tie  (Δ -0.017)

JSON output (--json) includes the full matrix and all pairwise comparisons.

Exit code behavior

Mode Exit Code
Two-file pairwise (existing) Same as today — exit 1 on regression
Combined JSONL with --baseline Exit 1 if any target regresses vs baseline
Combined JSONL without --baseline Exit 0 (informational)

Implementation Guide

Current implementation

File: apps/cli/src/commands/compare/index.ts

Current structure:

  • CLI args: Two positional string args (result1, result2), --threshold, --format/--json
  • loadJsonlResults(filePath) — reads JSONL, extracts test_id + score (ignores target field)
  • compareResults(results1, results2, threshold) — matches by test_id, computes deltas, classifies win/loss/tie
  • formatTable(comparison, file1, file2) — renders pairwise table with ANSI colors
  • determineExitCode(meanDelta) — exit 0 if candidate >= baseline, else exit 1

What to change

1. CLI args

Make result2 optional. Add new flags:

// Existing (keep)
result1: positional({ type: string, description: 'Path to JSONL result file' }),
result2: positional({ type: optional(string), description: 'Path to second JSONL (pairwise mode)' }),

// New
baseline: option({ type: optional(string), long: 'baseline', short: 'b',
  description: 'Target name to use as baseline (filters combined JSONL)' }),
candidate: option({ type: optional(string), long: 'candidate', short: 'c',
  description: 'Target name to use as candidate (filters combined JSONL)' }),
targets: option({ type: optional(restPositionals(string)), long: 'targets',
  description: 'Target names to include in matrix comparison' }),

2. Mode detection in handler

if result2 is provided → existing pairwise mode (two files)
else if --baseline and --candidate → pairwise mode from combined JSONL
else → N-way matrix mode from combined JSONL

3. New functions

loadCombinedResults(filePath: string): Map<string, EvalResult[]>

  • Reads JSONL, groups records by target field
  • Each EvalResult needs target added: { testId, score, target }

compareMatrix(groups: Map<string, EvalResult[]>, threshold: number): MatrixOutput

  • For each test_id, collect scores across all targets
  • Run pairwise comparisons across all target pairs (reuse existing compareResults)
  • Return: { matrix: TestRow[], pairwise: ComparisonOutput[], targets: string[] }

formatMatrix(matrix: MatrixOutput, baselineTarget?: string): string

  • Render the score matrix table (test_id rows × target columns)
  • Below the matrix, render pairwise summaries
  • If baselineTarget specified, highlight regressions vs that target

4. Exit code for matrix mode

if (baselineTarget) {
  // Exit 1 if any target regresses vs baseline
  const baselinePairs = pairwise.filter(p => p.baseline === baselineTarget);
  const anyRegression = baselinePairs.some(p => p.summary.meanDelta < 0);
  process.exit(anyRegression ? 1 : 0);
} else {
  process.exit(0); // Informational
}

Update benchmark-tooling example

After implementing the compare enhancement, update examples/features/benchmark-tooling/ to demonstrate the N-way workflow instead of the split workflow.

Update examples/features/benchmark-tooling/README.md

Replace the current split-focused content with:

# Benchmark Tooling

Multi-model benchmarking workflow with AgentV.

## Quick Start

### 1. Run a matrix evaluation

\`\`\`bash
agentv eval examples/features/benchmark-tooling/evals/benchmark.eval.yaml
\`\`\`

This evaluates all tests against 3 targets and writes a combined results JSONL.

### 2. Compare all targets

\`\`\`bash
# N-way matrix — see all models side by side
agentv compare .agentv/results/<output>.jsonl

# Designate a baseline for CI regression gating
agentv compare .agentv/results/<output>.jsonl --baseline gpt-4.1

# JSON output for CI pipelines
agentv compare .agentv/results/<output>.jsonl --json
\`\`\`

### 3. Pairwise comparison (optional)

\`\`\`bash
# Compare two specific targets from the combined file
agentv compare .agentv/results/<output>.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
\`\`\`

Add examples/features/benchmark-tooling/evals/benchmark.eval.yaml

execution:
  targets:
    - gemini-3-flash-preview
    - gpt-4.1
    - gpt-5-mini

tests:
  - id: greeting
    input: "Say hello"
    criteria: "The response should contain a greeting"

  - id: code-generation
    input: "Write a fibonacci function in Python"
    criteria: "The response should contain a valid Python function"

  - id: summarization
    input: "Summarize the key benefits of automated testing"
    criteria: "The response should mention reliability, speed, or regression detection"

Add examples/features/benchmark-tooling/fixtures/combined-results.jsonl

Sample combined output (9 records: 3 tests × 3 targets) so the compare command can be demonstrated without running a live eval:

# Works out of the box — no API keys needed
agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl

Each record needs: test_id, target, score, input, answer. Use realistic mock data.


Acceptance Criteria

  • agentv compare results.jsonl reads a combined JSONL and shows N-way matrix
  • agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini filters by target, shows pairwise
  • agentv compare results.jsonl --baseline gpt-4.1 shows matrix, exits 1 on regression vs baseline
  • agentv compare results.jsonl --targets t1 t2 limits matrix to specified targets
  • agentv compare results.jsonl --json outputs machine-readable matrix + pairwise data
  • Two-file pairwise mode (agentv compare a.jsonl b.jsonl) still works unchanged
  • examples/features/benchmark-tooling/ updated with EVAL.yaml, fixture, and N-way README
  • Fixture runs out of the box: agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl

Supersedes

Closes #380 — the split-by-target example is no longer needed as a primary workflow. The script stays as a niche utility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions