EntityProcess · christso · Feb 26, 2026 · Feb 26, 2026 · Feb 26, 2026
diff --git a/README.md b/README.md
@@ -278,16 +278,34 @@ agentv create eval my-eval          # → evals/my-eval.eval.yaml + .cases.jsonl
 
 ### Compare Evaluation Results
 
-Run two evaluations and compare them:
+Compare a combined results file across all targets (N-way matrix):
 
 ```bash
-agentv eval evals/my-eval.yaml --out before.jsonl
-# ... make changes to your agent ...
-agentv eval evals/my-eval.yaml --out after.jsonl
-agentv compare before.jsonl after.jsonl --threshold 0.1
+agentv compare results.jsonl
 ```
 
-Output shows wins, losses, ties, and mean delta to identify improvements.
+```
+Score Matrix
+
+  Test ID          gemini-3-flash-preview  gpt-4.1  gpt-5-mini
+  ───────────────  ──────────────────────  ───────  ──────────
+  code-generation                    0.70     0.80        0.75
+  greeting                           0.90     0.85        0.95
+  summarization                      0.85     0.90        0.80
+
+Pairwise Summary:
+  gemini-3-flash-preview → gpt-4.1:     1 win, 0 losses, 2 ties  (Δ +0.033)
+  gemini-3-flash-preview → gpt-5-mini:  0 wins, 0 losses, 3 ties  (Δ +0.017)
+  gpt-4.1 → gpt-5-mini:                 0 wins, 0 losses, 3 ties  (Δ -0.017)
+```
+
+Designate a baseline for CI regression gating, or compare two specific targets:
+
+```bash
+agentv compare results.jsonl --baseline gpt-4.1                          # exit 1 on regression
+agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini  # pairwise
+agentv compare before.jsonl after.jsonl                                  # two-file pairwise
+```
 
 ## Targets Configuration
 

diff --git a/apps/cli/README.md b/apps/cli/README.md
@@ -278,16 +278,34 @@ agentv create eval my-eval          # → evals/my-eval.eval.yaml + .cases.jsonl
 
 ### Compare Evaluation Results
 
-Run two evaluations and compare them:
+Compare a combined results file across all targets (N-way matrix):
 
 ```bash
-agentv eval evals/my-eval.yaml --out before.jsonl
-# ... make changes to your agent ...
-agentv eval evals/my-eval.yaml --out after.jsonl
-agentv compare before.jsonl after.jsonl --threshold 0.1
+agentv compare results.jsonl
 ```
 
-Output shows wins, losses, ties, and mean delta to identify improvements.
+```
+Score Matrix
+
+  Test ID          gemini-3-flash-preview  gpt-4.1  gpt-5-mini
+  ───────────────  ──────────────────────  ───────  ──────────
+  code-generation                    0.70     0.80        0.75
+  greeting                           0.90     0.85        0.95
+  summarization                      0.85     0.90        0.80
+
+Pairwise Summary:
+  gemini-3-flash-preview → gpt-4.1:     1 win, 0 losses, 2 ties  (Δ +0.033)
+  gemini-3-flash-preview → gpt-5-mini:  0 wins, 0 losses, 3 ties  (Δ +0.017)
+  gpt-4.1 → gpt-5-mini:                 0 wins, 0 losses, 3 ties  (Δ -0.017)
+```
+
+Designate a baseline for CI regression gating, or compare two specific targets:
+
+```bash
+agentv compare results.jsonl --baseline gpt-4.1                          # exit 1 on regression
+agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini  # pairwise
+agentv compare before.jsonl after.jsonl                                  # two-file pairwise
+```
 
 ## Targets Configuration
 

diff --git a/docs/COMPARISON.md b/docs/COMPARISON.md
@@ -23,15 +23,14 @@
 
 **1. Hybrid Judge System (Code + LLM with Custom Prompts)**
 ```yaml
-execution:
-  evaluators:
-    - name: format_check
-      type: code_judge           # Deterministic: checks concrete outputs
-      script: ./validators/check_format.py
-
-    - name: correctness
-      type: llm_judge            # Subjective: uses customizable judge prompt
-      prompt: ./judges/correctness.md  # Edit the prompt, not the code
+assert:
+  - name: format_check
+    type: code_judge           # Deterministic: checks concrete outputs
+    script: ./validators/check_format.py
+
+  - name: correctness
+    type: llm_judge            # Subjective: uses customizable judge prompt
+    prompt: ./judges/correctness.md  # Edit the prompt, not the code
 ```
 
 This is more powerful than:
@@ -57,7 +56,9 @@ No network round-trips, no waiting for managed infrastructure:
 # AgentV workflow
 agentv eval evals/my-eval.yaml
 agentv eval evals/**/*.yaml --workers 10  # Parallel
-agentv compare before.jsonl after.jsonl   # A/B testing
+agentv compare results.jsonl              # N-way matrix comparison
+agentv compare results.jsonl --baseline gpt-4.1  # CI regression gate
+agentv compare before.jsonl after.jsonl   # Two-file pairwise A/B testing
 ```
 
 ```bash
@@ -117,17 +118,16 @@ Alternative approaches:
 ### Scenario: Deterministic + Subjective Evaluation
 
 ```yaml
-execution:
-  evaluators:
-    - name: syntax_check
-      type: code_judge
-      script: ["python", "check_syntax.py"]
-    - name: logic_check
-      type: code_judge
-      script: ["python", "check_logic.py"]
-    - name: explanation_quality
-      type: llm_judge
-      prompt: judges/explanation.md
+assert:
+  - name: syntax_check
+    type: code_judge
+    script: ["python", "check_syntax.py"]
+  - name: logic_check
+    type: code_judge
+    script: ["python", "check_logic.py"]
+  - name: explanation_quality
+    type: llm_judge
+    prompt: judges/explanation.md
 ```
 
 Single eval run scores all three dimensions. Other approaches:
@@ -140,8 +140,10 @@ Single eval run scores all three dimensions. Other approaches:
 ```yaml
 # .github/workflows/eval.yml
 - run: agentv eval evals/**/*.yaml --out results.jsonl
+- run: agentv compare results.jsonl --baseline gpt-4.1
+  # Exit 1 if any target regresses vs baseline (N-way matrix)
 - run: agentv compare baseline.jsonl results.jsonl --threshold 0.05
-  # Fail if performance drops > 5%
+  # Or two-file pairwise: fail if performance drops > 5%
 ```
 
 Other tools face challenges here:

diff --git a/examples/features/compare/README.md b/examples/features/compare/README.md
@@ -1,30 +1,47 @@
 # Baseline vs Candidate Comparison
 
-Demonstrates comparing evaluation results between baseline and candidate versions using the `agentv compare` command.
+Demonstrates comparing evaluation results using the `agentv compare` command.
 
 ## What This Shows
 
-- Comparing two evaluation result files
+- N-way matrix comparison from a combined JSONL file
+- Two-file pairwise comparison (baseline vs candidate)
 - Score delta calculation and win/loss classification
-- Regression detection via exit codes
+- Baseline regression detection via exit codes
 - Human-readable and JSON output formats
 
 ## Running
 
 ```bash
 # From repository root
-# Compare baseline vs candidate results
-bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl
+
+# N-way matrix from a combined results file (see ../benchmark-tooling/ for fixture)
+agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl
+
+# Pairwise from combined file
+agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl \
+  --baseline gpt-4.1 --candidate gpt-5-mini
+
+# CI regression gate: exit 1 if any target regresses vs baseline
+agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl \
+  --baseline gpt-4.1
+
+# Two-file pairwise comparison (legacy)
+agentv compare examples/features/compare/evals/baseline-results.jsonl \
+  examples/features/compare/evals/candidate-results.jsonl
 
 # With custom threshold for win/loss classification
-bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl --threshold 0.05
+agentv compare examples/features/compare/evals/baseline-results.jsonl \
+  examples/features/compare/evals/candidate-results.jsonl --threshold 0.05
 
 # JSON output for CI pipelines
-bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl --json
+agentv compare examples/features/compare/evals/baseline-results.jsonl \
+  examples/features/compare/evals/candidate-results.jsonl --json
 ```
 
 ## Key Files
 
 - `evals/baseline-results.jsonl` - Results from baseline configuration
 - `evals/candidate-results.jsonl` - Results from candidate configuration
 - `evals/README.md` - Detailed usage documentation
+- `../benchmark-tooling/fixtures/combined-results.jsonl` - Combined multi-target fixture for N-way matrix
diff --git a/examples/features/compare/evals/README.md b/examples/features/compare/evals/README.md
@@ -1,22 +1,71 @@
 # Compare Command Example
 
-This example demonstrates the `agentv compare` command for comparing evaluation results between two runs.
+The `agentv compare` command supports three modes: N-way matrix from a combined JSONL, pairwise from a combined JSONL, and two-file pairwise.
 
 ## Use Case
 
 Compare model performance across different configurations:
-- Baseline vs. candidate prompts
-- Different model versions (e.g., GPT-4.1 vs. GPT-5)
-- Before/after optimization runs
+- N-way matrix comparison across 3+ models from a single combined results file
+- Baseline regression gating in CI (exit 1 if any target regresses)
+- Head-to-head pairwise between two specific targets
+- Before/after optimization runs (two-file pairwise)
 
 ## Sample Files
 
 - `baseline-results.jsonl` - Results from baseline configuration (GPT-4.1)
 - `candidate-results.jsonl` - Results from candidate configuration (GPT-5)
+- `../../benchmark-tooling/fixtures/combined-results.jsonl` - Combined multi-target results (3 tests x 3 targets)
 
 ## Usage
 
-### Basic Comparison
+### N-Way Matrix (combined JSONL)
+
+```bash
+agentv compare combined-results.jsonl
+```
+
+Output:
+```
+Score Matrix
+
+  Test ID          gemini-3-flash-preview  gpt-4.1  gpt-5-mini
+  ───────────────  ──────────────────────  ───────  ──────────
+  code-generation                    0.70     0.80        0.75
+  greeting                           0.90     0.85        0.95
+  summarization                      0.85     0.90        0.80
+
+Pairwise Summary:
+  gemini-3-flash-preview → gpt-4.1:     1 win, 0 losses, 2 ties  (Δ +0.033)
+  gemini-3-flash-preview → gpt-5-mini:  0 wins, 0 losses, 3 ties  (Δ +0.017)
+  gpt-4.1 → gpt-5-mini:                 0 wins, 0 losses, 3 ties  (Δ -0.017)
+```
+
+### Baseline Regression Check
+
+```bash
+agentv compare combined-results.jsonl --baseline gpt-4.1
+# Exits 1 if any target regresses vs gpt-4.1
+```
+
+### Pairwise from Combined JSONL
+
+```bash
+agentv compare combined-results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
+```
+
+```
+Comparing: gpt-4.1 → gpt-5-mini
+
+  Test ID          Baseline  Candidate     Delta  Result
+  ───────────────  ────────  ─────────  ────────  ────────
+  greeting             0.85       0.95     +0.10  = tie
+  code-generation      0.80       0.75     -0.05  = tie
+  summarization        0.90       0.80     -0.10  = tie
+
+Summary: 0 wins, 0 losses, 3 ties | Mean Δ: -0.017 | Status: regressed
+```
+
+### Two-File Pairwise (legacy)
 
 ```bash
 agentv compare baseline-results.jsonl candidate-results.jsonl
@@ -50,38 +99,39 @@ agentv compare baseline-results.jsonl candidate-results.jsonl --threshold 0.05
 For machine-readable output (CI pipelines, scripts):
 
 ```bash
-agentv compare baseline-results.jsonl candidate-results.jsonl --json
+agentv compare combined-results.jsonl --json
 ```
 
 Output uses snake_case for Python ecosystem compatibility:
 
 ```json
 {
-  "matched": [
-    {"test_id": "code-review-001", "score1": 0.72, "score2": 0.88, "delta": 0.16, "outcome": "win"}
+  "matrix": [
+    {"test_id": "code-generation", "scores": {"gemini-3-flash-preview": 0.7, "gpt-4.1": 0.8, "gpt-5-mini": 0.75}}
+  ],
+  "pairwise": [
+    {"baseline": "gemini-3-flash-preview", "candidate": "gpt-4.1", "summary": {"wins": 1, "losses": 0, "ties": 2, "mean_delta": 0.033}}
   ],
-  "unmatched": {"file1": 0, "file2": 0},
-  "summary": {
-    "total": 10,
-    "matched": 5,
-    "wins": 1,
-    "losses": 0,
-    "ties": 4,
-    "mean_delta": 0.054
-  }
+  "targets": ["gemini-3-flash-preview", "gpt-4.1", "gpt-5-mini"]
 }
 ```
 
 ## Exit Codes
 
-- `0` - Candidate is equal or better (meanDelta >= 0)
-- `1` - Baseline is better (regression detected)
+| Mode | Exit Code |
+|---|---|
+| Two-file pairwise | Exit 1 on regression (meanDelta < 0) |
+| Combined with `--baseline` | Exit 1 if any target regresses vs baseline |
+| Combined without `--baseline` | Exit 0 (informational) |
 
 ## CI Integration
 
 Use exit codes for automated quality gates:
 
 ```bash
-# Fail CI if candidate regresses
+# N-way: fail if any target regresses vs baseline
+agentv compare results.jsonl --baseline gpt-4.1 || echo "Regression detected!"
+
+# Two-file: fail if candidate regresses
 agentv compare baseline.jsonl candidate.jsonl || echo "Regression detected!"
 ```