From cac071d6da31d15966fba56bf175fbef892bace0 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Thu, 26 Feb 2026 02:45:48 +0000
Subject: [PATCH 1/2] docs: update compare command docs for N-way matrix mode

All references to `agentv compare` previously only documented two-file
pairwise mode. Updated to show the N-way matrix as the primary workflow
with --baseline, --candidate, and --targets flags.

Updated files:
- README.md (root + CLI): matrix output example, baseline/pairwise commands
- docs/COMPARISON.md: CI example with --baseline regression gate
- examples/features/compare/: N-way matrix + pairwise examples with output
- examples/showcase/multi-model-benchmark/: combined JSONL workflow
- plugins/agentv-dev/skills/agentv-eval-builder/: compare command reference
---
 README.md                                     | 30 +++++--
 apps/cli/README.md                            | 30 +++++--
 docs/COMPARISON.md                            |  8 +-
 examples/features/compare/README.md           | 31 +++++--
 examples/features/compare/evals/README.md     | 90 ++++++++++++++-----
 .../showcase/multi-model-benchmark/README.md  | 69 ++++++--------
 .../skills/agentv-eval-builder/SKILL.md       |  7 +-
 7 files changed, 181 insertions(+), 84 deletions(-)

diff --git a/README.md b/README.md
index e079ed599..41fc36799 100644
--- a/README.md
+++ b/README.md
@@ -278,16 +278,34 @@ agentv create eval my-eval          # → evals/my-eval.eval.yaml + .cases.jsonl
 
 ### Compare Evaluation Results
 
-Run two evaluations and compare them:
+Compare a combined results file across all targets (N-way matrix):
 
 ```bash
-agentv eval evals/my-eval.yaml --out before.jsonl
-# ... make changes to your agent ...
-agentv eval evals/my-eval.yaml --out after.jsonl
-agentv compare before.jsonl after.jsonl --threshold 0.1
+agentv compare results.jsonl
 ```
 
-Output shows wins, losses, ties, and mean delta to identify improvements.
+```
+Score Matrix
+
+  Test ID          gemini-3-flash-preview  gpt-4.1  gpt-5-mini
+  ───────────────  ──────────────────────  ───────  ──────────
+  code-generation                    0.70     0.80        0.75
+  greeting                           0.90     0.85        0.95
+  summarization                      0.85     0.90        0.80
+
+Pairwise Summary:
+  gemini-3-flash-preview → gpt-4.1:     1 win, 0 losses, 2 ties  (Δ +0.033)
+  gemini-3-flash-preview → gpt-5-mini:  0 wins, 0 losses, 3 ties  (Δ +0.017)
+  gpt-4.1 → gpt-5-mini:                 0 wins, 0 losses, 3 ties  (Δ -0.017)
+```
+
+Designate a baseline for CI regression gating, or compare two specific targets:
+
+```bash
+agentv compare results.jsonl --baseline gpt-4.1                          # exit 1 on regression
+agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini  # pairwise
+agentv compare before.jsonl after.jsonl                                  # two-file pairwise
+```
 
 ## Targets Configuration
 
diff --git a/apps/cli/README.md b/apps/cli/README.md
index e079ed599..41fc36799 100644
--- a/apps/cli/README.md
+++ b/apps/cli/README.md
@@ -278,16 +278,34 @@ agentv create eval my-eval          # → evals/my-eval.eval.yaml + .cases.jsonl
 
 ### Compare Evaluation Results
 
-Run two evaluations and compare them:
+Compare a combined results file across all targets (N-way matrix):
 
 ```bash
-agentv eval evals/my-eval.yaml --out before.jsonl
-# ... make changes to your agent ...
-agentv eval evals/my-eval.yaml --out after.jsonl
-agentv compare before.jsonl after.jsonl --threshold 0.1
+agentv compare results.jsonl
 ```
 
-Output shows wins, losses, ties, and mean delta to identify improvements.
+```
+Score Matrix
+
+  Test ID          gemini-3-flash-preview  gpt-4.1  gpt-5-mini
+  ───────────────  ──────────────────────  ───────  ──────────
+  code-generation                    0.70     0.80        0.75
+  greeting                           0.90     0.85        0.95
+  summarization                      0.85     0.90        0.80
+
+Pairwise Summary:
+  gemini-3-flash-preview → gpt-4.1:     1 win, 0 losses, 2 ties  (Δ +0.033)
+  gemini-3-flash-preview → gpt-5-mini:  0 wins, 0 losses, 3 ties  (Δ +0.017)
+  gpt-4.1 → gpt-5-mini:                 0 wins, 0 losses, 3 ties  (Δ -0.017)
+```
+
+Designate a baseline for CI regression gating, or compare two specific targets:
+
+```bash
+agentv compare results.jsonl --baseline gpt-4.1                          # exit 1 on regression
+agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini  # pairwise
+agentv compare before.jsonl after.jsonl                                  # two-file pairwise
+```
 
 ## Targets Configuration
 
diff --git a/docs/COMPARISON.md b/docs/COMPARISON.md
index ac3730cb2..6fb0347e9 100644
--- a/docs/COMPARISON.md
+++ b/docs/COMPARISON.md
@@ -57,7 +57,9 @@ No network round-trips, no waiting for managed infrastructure:
 # AgentV workflow
 agentv eval evals/my-eval.yaml
 agentv eval evals/**/*.yaml --workers 10  # Parallel
-agentv compare before.jsonl after.jsonl   # A/B testing
+agentv compare results.jsonl              # N-way matrix comparison
+agentv compare results.jsonl --baseline gpt-4.1  # CI regression gate
+agentv compare before.jsonl after.jsonl   # Two-file pairwise A/B testing
 ```
 
 ```bash
@@ -140,8 +142,10 @@ Single eval run scores all three dimensions. Other approaches:
 ```yaml
 # .github/workflows/eval.yml
 - run: agentv eval evals/**/*.yaml --out results.jsonl
+- run: agentv compare results.jsonl --baseline gpt-4.1
+  # Exit 1 if any target regresses vs baseline (N-way matrix)
 - run: agentv compare baseline.jsonl results.jsonl --threshold 0.05
-  # Fail if performance drops > 5%
+  # Or two-file pairwise: fail if performance drops > 5%
 ```
 
 Other tools face challenges here:
diff --git a/examples/features/compare/README.md b/examples/features/compare/README.md
index 87346dc23..04e41cb8e 100644
--- a/examples/features/compare/README.md
+++ b/examples/features/compare/README.md
@@ -1,26 +1,42 @@
 # Baseline vs Candidate Comparison
 
-Demonstrates comparing evaluation results between baseline and candidate versions using the `agentv compare` command.
+Demonstrates comparing evaluation results using the `agentv compare` command.
 
 ## What This Shows
 
-- Comparing two evaluation result files
+- N-way matrix comparison from a combined JSONL file
+- Two-file pairwise comparison (baseline vs candidate)
 - Score delta calculation and win/loss classification
-- Regression detection via exit codes
+- Baseline regression detection via exit codes
 - Human-readable and JSON output formats
 
 ## Running
 
 ```bash
 # From repository root
-# Compare baseline vs candidate results
-bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl
+
+# N-way matrix from a combined results file (see ../benchmark-tooling/ for fixture)
+agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl
+
+# Pairwise from combined file
+agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl \
+  --baseline gpt-4.1 --candidate gpt-5-mini
+
+# CI regression gate: exit 1 if any target regresses vs baseline
+agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl \
+  --baseline gpt-4.1
+
+# Two-file pairwise comparison (legacy)
+agentv compare examples/features/compare/evals/baseline-results.jsonl \
+  examples/features/compare/evals/candidate-results.jsonl
 
 # With custom threshold for win/loss classification
-bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl --threshold 0.05
+agentv compare examples/features/compare/evals/baseline-results.jsonl \
+  examples/features/compare/evals/candidate-results.jsonl --threshold 0.05
 
 # JSON output for CI pipelines
-bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl --json
+agentv compare examples/features/compare/evals/baseline-results.jsonl \
+  examples/features/compare/evals/candidate-results.jsonl --json
 ```
 
 ## Key Files
@@ -28,3 +44,4 @@ bun agentv compare examples/features/compare/evals/baseline-results.jsonl exampl
 - `evals/baseline-results.jsonl` - Results from baseline configuration
 - `evals/candidate-results.jsonl` - Results from candidate configuration
 - `evals/README.md` - Detailed usage documentation
+- `../benchmark-tooling/fixtures/combined-results.jsonl` - Combined multi-target fixture for N-way matrix
diff --git a/examples/features/compare/evals/README.md b/examples/features/compare/evals/README.md
index be6c0f81a..c01a261f0 100644
--- a/examples/features/compare/evals/README.md
+++ b/examples/features/compare/evals/README.md
@@ -1,22 +1,71 @@
 # Compare Command Example
 
-This example demonstrates the `agentv compare` command for comparing evaluation results between two runs.
+The `agentv compare` command supports three modes: N-way matrix from a combined JSONL, pairwise from a combined JSONL, and two-file pairwise.
 
 ## Use Case
 
 Compare model performance across different configurations:
-- Baseline vs. candidate prompts
-- Different model versions (e.g., GPT-4.1 vs. GPT-5)
-- Before/after optimization runs
+- N-way matrix comparison across 3+ models from a single combined results file
+- Baseline regression gating in CI (exit 1 if any target regresses)
+- Head-to-head pairwise between two specific targets
+- Before/after optimization runs (two-file pairwise)
 
 ## Sample Files
 
 - `baseline-results.jsonl` - Results from baseline configuration (GPT-4.1)
 - `candidate-results.jsonl` - Results from candidate configuration (GPT-5)
+- `../../benchmark-tooling/fixtures/combined-results.jsonl` - Combined multi-target results (3 tests x 3 targets)
 
 ## Usage
 
-### Basic Comparison
+### N-Way Matrix (combined JSONL)
+
+```bash
+agentv compare combined-results.jsonl
+```
+
+Output:
+```
+Score Matrix
+
+  Test ID          gemini-3-flash-preview  gpt-4.1  gpt-5-mini
+  ───────────────  ──────────────────────  ───────  ──────────
+  code-generation                    0.70     0.80        0.75
+  greeting                           0.90     0.85        0.95
+  summarization                      0.85     0.90        0.80
+
+Pairwise Summary:
+  gemini-3-flash-preview → gpt-4.1:     1 win, 0 losses, 2 ties  (Δ +0.033)
+  gemini-3-flash-preview → gpt-5-mini:  0 wins, 0 losses, 3 ties  (Δ +0.017)
+  gpt-4.1 → gpt-5-mini:                 0 wins, 0 losses, 3 ties  (Δ -0.017)
+```
+
+### Baseline Regression Check
+
+```bash
+agentv compare combined-results.jsonl --baseline gpt-4.1
+# Exits 1 if any target regresses vs gpt-4.1
+```
+
+### Pairwise from Combined JSONL
+
+```bash
+agentv compare combined-results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
+```
+
+```
+Comparing: gpt-4.1 → gpt-5-mini
+
+  Test ID          Baseline  Candidate     Delta  Result
+  ───────────────  ────────  ─────────  ────────  ────────
+  greeting             0.85       0.95     +0.10  = tie
+  code-generation      0.80       0.75     -0.05  = tie
+  summarization        0.90       0.80     -0.10  = tie
+
+Summary: 0 wins, 0 losses, 3 ties | Mean Δ: -0.017 | Status: regressed
+```
+
+### Two-File Pairwise (legacy)
 
 ```bash
 agentv compare baseline-results.jsonl candidate-results.jsonl
@@ -50,38 +99,39 @@ agentv compare baseline-results.jsonl candidate-results.jsonl --threshold 0.05
 For machine-readable output (CI pipelines, scripts):
 
 ```bash
-agentv compare baseline-results.jsonl candidate-results.jsonl --json
+agentv compare combined-results.jsonl --json
 ```
 
 Output uses snake_case for Python ecosystem compatibility:
 
 ```json
 {
-  "matched": [
-    {"test_id": "code-review-001", "score1": 0.72, "score2": 0.88, "delta": 0.16, "outcome": "win"}
+  "matrix": [
+    {"test_id": "code-generation", "scores": {"gemini-3-flash-preview": 0.7, "gpt-4.1": 0.8, "gpt-5-mini": 0.75}}
+  ],
+  "pairwise": [
+    {"baseline": "gemini-3-flash-preview", "candidate": "gpt-4.1", "summary": {"wins": 1, "losses": 0, "ties": 2, "mean_delta": 0.033}}
   ],
-  "unmatched": {"file1": 0, "file2": 0},
-  "summary": {
-    "total": 10,
-    "matched": 5,
-    "wins": 1,
-    "losses": 0,
-    "ties": 4,
-    "mean_delta": 0.054
-  }
+  "targets": ["gemini-3-flash-preview", "gpt-4.1", "gpt-5-mini"]
 }
 ```
 
 ## Exit Codes
 
-- `0` - Candidate is equal or better (meanDelta >= 0)
-- `1` - Baseline is better (regression detected)
+| Mode | Exit Code |
+|---|---|
+| Two-file pairwise | Exit 1 on regression (meanDelta < 0) |
+| Combined with `--baseline` | Exit 1 if any target regresses vs baseline |
+| Combined without `--baseline` | Exit 0 (informational) |
 
 ## CI Integration
 
 Use exit codes for automated quality gates:
 
 ```bash
-# Fail CI if candidate regresses
+# N-way: fail if any target regresses vs baseline
+agentv compare results.jsonl --baseline gpt-4.1 || echo "Regression detected!"
+
+# Two-file: fail if candidate regresses
 agentv compare baseline.jsonl candidate.jsonl || echo "Regression detected!"
 ```
diff --git a/examples/showcase/multi-model-benchmark/README.md b/examples/showcase/multi-model-benchmark/README.md
index 6fd3879af..c14a036fd 100644
--- a/examples/showcase/multi-model-benchmark/README.md
+++ b/examples/showcase/multi-model-benchmark/README.md
@@ -49,51 +49,41 @@ To run against a single target first:
 bun agentv eval examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml --target copilot
 ```
 
-### Saving Results for Comparison
-
-Save per-target results to separate files for the compare workflow:
-
-```bash
-# Run each target and save results
-bun agentv eval examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml \
-  --target copilot --out results-copilot.jsonl
-
-bun agentv eval examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml \
-  --target claude --out results-claude.jsonl
-
-bun agentv eval examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml \
-  --target gemini_base --out results-gemini.jsonl
-```
-
 ## Comparing Models
 
-Use `agentv compare` to see score deltas between any two runs:
+The eval produces a combined results file with a `target` field per record. Use `agentv compare` to see all models side by side:
 
 ```bash
-# Compare copilot vs claude
-bun agentv compare results-copilot.jsonl results-claude.jsonl
+# N-way matrix — see all models at once
+agentv compare results.jsonl
+
+# Designate a baseline for CI regression gating
+agentv compare results.jsonl --baseline copilot
 
-# Compare copilot vs gemini
-bun agentv compare results-copilot.jsonl results-gemini.jsonl
+# Pairwise: compare two specific targets
+agentv compare results.jsonl --baseline copilot --candidate claude
 
 # JSON output for CI integration
-bun agentv compare results-copilot.jsonl results-claude.jsonl --json
+agentv compare results.jsonl --json
 ```
 
 ### Expected Output
 
 ```
-Comparing: results-copilot.jsonl → results-claude.jsonl
+Score Matrix
 
-  Test ID                Baseline  Candidate     Delta  Result
-  ─────────────────────  ────────  ─────────  ────────  ────────
-  factual-geography          0.92       0.95     +0.03  = tie
-  factual-science            0.88       0.91     +0.03  = tie
-  analytical-comparison      0.78       0.85     +0.07  = tie
-  creative-explanation       0.82       0.80     -0.02  = tie
-  structured-list            0.90       0.88     -0.02  = tie
+  Test ID                copilot  claude  gemini_base
+  ─────────────────────  ───────  ──────  ───────────
+  factual-geography         0.92    0.95         0.87
+  factual-science           0.88    0.91         0.85
+  analytical-comparison     0.78    0.85         0.80
+  creative-explanation      0.82    0.80         0.83
+  structured-list           0.90    0.88         0.86
 
-Summary: 0 wins, 0 losses, 5 ties | Mean Δ: +0.018 | Status: no change
+Pairwise Summary:
+  claude → copilot:       0 wins, 0 losses, 5 ties  (Δ -0.018)
+  claude → gemini_base:   0 wins, 0 losses, 5 ties  (Δ -0.044)
+  copilot → gemini_base:  0 wins, 0 losses, 5 ties  (Δ -0.026)
 ```
 
 > **Note:** Actual scores will vary — LLM outputs are non-deterministic. The trials configuration helps surface this variability. Scores above are illustrative.
@@ -144,12 +134,14 @@ This surfaces non-determinism — if a model passes on trial 1 but fails on tria
 
 ### 4. Compare
 
-The `agentv compare` command reads two JSONL result files and computes per-test score deltas:
+The `agentv compare` command reads a combined JSONL (with `target` field) and shows an N-way matrix with pairwise summaries. Each pair classifies per-test deltas:
 
 - **Win**: candidate score exceeds baseline by threshold (default 0.10)
 - **Loss**: baseline score exceeds candidate by threshold
 - **Tie**: scores within threshold
 
+With `--baseline`, exit code 1 signals regression (CI-friendly).
+
 ## Evaluation Flow
 
 ```
@@ -161,19 +153,14 @@ benchmark.eval.yaml
 │  (per target × trials)  │
 └────────┬────────────────┘
          │
-    ┌────┼────────┐
-    ▼    ▼        ▼
-copilot claude  gemini
-    │    │        │
-    ▼    ▼        ▼
- .jsonl .jsonl  .jsonl
-    │    │        │
-    └────┼────────┘
+         ▼
+   combined results.jsonl
+   (all targets in one file)
          │
          ▼
 ┌─────────────────────────┐
 │  agentv compare          │
-│  (pairwise deltas)      │
+│  (N-way matrix + deltas)│
 └─────────────────────────┘
 ```
 
diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
index 04171100a..4b622255d 100644
--- a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
+++ b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
@@ -355,8 +355,11 @@ agentv prompt eval judge <file.yaml> --test-id <id> --answer-file f # judge prom
 # Validate eval file
 agentv validate <file.yaml>
 
-# Compare results between runs
-agentv compare <results1.jsonl> <results2.jsonl>
+# Compare results — N-way matrix from combined JSONL
+agentv compare <combined-results.jsonl>
+agentv compare <combined-results.jsonl> --baseline <target>                   # CI regression gate
+agentv compare <combined-results.jsonl> --baseline <target> --candidate <target>  # pairwise
+agentv compare <results1.jsonl> <results2.jsonl>                              # two-file pairwise
 
 # Generate rubrics from criteria
 agentv generate rubrics <file.yaml> [--target <name>]

From 065007899054b7b21f6eed3c6fea215b275905fd Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Thu, 26 Feb 2026 02:52:45 +0000
Subject: [PATCH 2/2] docs(comparison): rename execution.evaluators to assert

The evaluators block was renamed to assert in the eval YAML schema.
Update both code examples in COMPARISON.md to use the current syntax.
---
 docs/COMPARISON.md | 38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/docs/COMPARISON.md b/docs/COMPARISON.md
index 6fb0347e9..9b2617839 100644
--- a/docs/COMPARISON.md
+++ b/docs/COMPARISON.md
@@ -23,15 +23,14 @@
 
 **1. Hybrid Judge System (Code + LLM with Custom Prompts)**
 ```yaml
-execution:
-  evaluators:
-    - name: format_check
-      type: code_judge           # Deterministic: checks concrete outputs
-      script: ./validators/check_format.py
-
-    - name: correctness
-      type: llm_judge            # Subjective: uses customizable judge prompt
-      prompt: ./judges/correctness.md  # Edit the prompt, not the code
+assert:
+  - name: format_check
+    type: code_judge           # Deterministic: checks concrete outputs
+    script: ./validators/check_format.py
+
+  - name: correctness
+    type: llm_judge            # Subjective: uses customizable judge prompt
+    prompt: ./judges/correctness.md  # Edit the prompt, not the code
 ```
 
 This is more powerful than:
@@ -119,17 +118,16 @@ Alternative approaches:
 ### Scenario: Deterministic + Subjective Evaluation
 
 ```yaml
-execution:
-  evaluators:
-    - name: syntax_check
-      type: code_judge
-      script: ["python", "check_syntax.py"]
-    - name: logic_check
-      type: code_judge
-      script: ["python", "check_logic.py"]
-    - name: explanation_quality
-      type: llm_judge
-      prompt: judges/explanation.md
+assert:
+  - name: syntax_check
+    type: code_judge
+    script: ["python", "check_syntax.py"]
+  - name: logic_check
+    type: code_judge
+    script: ["python", "check_logic.py"]
+  - name: explanation_quality
+    type: llm_judge
+    prompt: judges/explanation.md
 ```
 
 Single eval run scores all three dimensions. Other approaches: