From cac071d6da31d15966fba56bf175fbef892bace0 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Thu, 26 Feb 2026 02:45:48 +0000 Subject: [PATCH 1/2] docs: update compare command docs for N-way matrix mode All references to `agentv compare` previously only documented two-file pairwise mode. Updated to show the N-way matrix as the primary workflow with --baseline, --candidate, and --targets flags. Updated files: - README.md (root + CLI): matrix output example, baseline/pairwise commands - docs/COMPARISON.md: CI example with --baseline regression gate - examples/features/compare/: N-way matrix + pairwise examples with output - examples/showcase/multi-model-benchmark/: combined JSONL workflow - plugins/agentv-dev/skills/agentv-eval-builder/: compare command reference --- README.md | 30 +++++-- apps/cli/README.md | 30 +++++-- docs/COMPARISON.md | 8 +- examples/features/compare/README.md | 31 +++++-- examples/features/compare/evals/README.md | 90 ++++++++++++++----- .../showcase/multi-model-benchmark/README.md | 69 ++++++-------- .../skills/agentv-eval-builder/SKILL.md | 7 +- 7 files changed, 181 insertions(+), 84 deletions(-) diff --git a/README.md b/README.md index e079ed599..41fc36799 100644 --- a/README.md +++ b/README.md @@ -278,16 +278,34 @@ agentv create eval my-eval # → evals/my-eval.eval.yaml + .cases.jsonl ### Compare Evaluation Results -Run two evaluations and compare them: +Compare a combined results file across all targets (N-way matrix): ```bash -agentv eval evals/my-eval.yaml --out before.jsonl -# ... make changes to your agent ... -agentv eval evals/my-eval.yaml --out after.jsonl -agentv compare before.jsonl after.jsonl --threshold 0.1 +agentv compare results.jsonl ``` -Output shows wins, losses, ties, and mean delta to identify improvements. +``` +Score Matrix + + Test ID gemini-3-flash-preview gpt-4.1 gpt-5-mini + ─────────────── ────────────────────── ─────── ────────── + code-generation 0.70 0.80 0.75 + greeting 0.90 0.85 0.95 + summarization 0.85 0.90 0.80 + +Pairwise Summary: + gemini-3-flash-preview → gpt-4.1: 1 win, 0 losses, 2 ties (Δ +0.033) + gemini-3-flash-preview → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ +0.017) + gpt-4.1 → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ -0.017) +``` + +Designate a baseline for CI regression gating, or compare two specific targets: + +```bash +agentv compare results.jsonl --baseline gpt-4.1 # exit 1 on regression +agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini # pairwise +agentv compare before.jsonl after.jsonl # two-file pairwise +``` ## Targets Configuration diff --git a/apps/cli/README.md b/apps/cli/README.md index e079ed599..41fc36799 100644 --- a/apps/cli/README.md +++ b/apps/cli/README.md @@ -278,16 +278,34 @@ agentv create eval my-eval # → evals/my-eval.eval.yaml + .cases.jsonl ### Compare Evaluation Results -Run two evaluations and compare them: +Compare a combined results file across all targets (N-way matrix): ```bash -agentv eval evals/my-eval.yaml --out before.jsonl -# ... make changes to your agent ... -agentv eval evals/my-eval.yaml --out after.jsonl -agentv compare before.jsonl after.jsonl --threshold 0.1 +agentv compare results.jsonl ``` -Output shows wins, losses, ties, and mean delta to identify improvements. +``` +Score Matrix + + Test ID gemini-3-flash-preview gpt-4.1 gpt-5-mini + ─────────────── ────────────────────── ─────── ────────── + code-generation 0.70 0.80 0.75 + greeting 0.90 0.85 0.95 + summarization 0.85 0.90 0.80 + +Pairwise Summary: + gemini-3-flash-preview → gpt-4.1: 1 win, 0 losses, 2 ties (Δ +0.033) + gemini-3-flash-preview → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ +0.017) + gpt-4.1 → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ -0.017) +``` + +Designate a baseline for CI regression gating, or compare two specific targets: + +```bash +agentv compare results.jsonl --baseline gpt-4.1 # exit 1 on regression +agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini # pairwise +agentv compare before.jsonl after.jsonl # two-file pairwise +``` ## Targets Configuration diff --git a/docs/COMPARISON.md b/docs/COMPARISON.md index ac3730cb2..6fb0347e9 100644 --- a/docs/COMPARISON.md +++ b/docs/COMPARISON.md @@ -57,7 +57,9 @@ No network round-trips, no waiting for managed infrastructure: # AgentV workflow agentv eval evals/my-eval.yaml agentv eval evals/**/*.yaml --workers 10 # Parallel -agentv compare before.jsonl after.jsonl # A/B testing +agentv compare results.jsonl # N-way matrix comparison +agentv compare results.jsonl --baseline gpt-4.1 # CI regression gate +agentv compare before.jsonl after.jsonl # Two-file pairwise A/B testing ``` ```bash @@ -140,8 +142,10 @@ Single eval run scores all three dimensions. Other approaches: ```yaml # .github/workflows/eval.yml - run: agentv eval evals/**/*.yaml --out results.jsonl +- run: agentv compare results.jsonl --baseline gpt-4.1 + # Exit 1 if any target regresses vs baseline (N-way matrix) - run: agentv compare baseline.jsonl results.jsonl --threshold 0.05 - # Fail if performance drops > 5% + # Or two-file pairwise: fail if performance drops > 5% ``` Other tools face challenges here: diff --git a/examples/features/compare/README.md b/examples/features/compare/README.md index 87346dc23..04e41cb8e 100644 --- a/examples/features/compare/README.md +++ b/examples/features/compare/README.md @@ -1,26 +1,42 @@ # Baseline vs Candidate Comparison -Demonstrates comparing evaluation results between baseline and candidate versions using the `agentv compare` command. +Demonstrates comparing evaluation results using the `agentv compare` command. ## What This Shows -- Comparing two evaluation result files +- N-way matrix comparison from a combined JSONL file +- Two-file pairwise comparison (baseline vs candidate) - Score delta calculation and win/loss classification -- Regression detection via exit codes +- Baseline regression detection via exit codes - Human-readable and JSON output formats ## Running ```bash # From repository root -# Compare baseline vs candidate results -bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl + +# N-way matrix from a combined results file (see ../benchmark-tooling/ for fixture) +agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl + +# Pairwise from combined file +agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl \ + --baseline gpt-4.1 --candidate gpt-5-mini + +# CI regression gate: exit 1 if any target regresses vs baseline +agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl \ + --baseline gpt-4.1 + +# Two-file pairwise comparison (legacy) +agentv compare examples/features/compare/evals/baseline-results.jsonl \ + examples/features/compare/evals/candidate-results.jsonl # With custom threshold for win/loss classification -bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl --threshold 0.05 +agentv compare examples/features/compare/evals/baseline-results.jsonl \ + examples/features/compare/evals/candidate-results.jsonl --threshold 0.05 # JSON output for CI pipelines -bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl --json +agentv compare examples/features/compare/evals/baseline-results.jsonl \ + examples/features/compare/evals/candidate-results.jsonl --json ``` ## Key Files @@ -28,3 +44,4 @@ bun agentv compare examples/features/compare/evals/baseline-results.jsonl exampl - `evals/baseline-results.jsonl` - Results from baseline configuration - `evals/candidate-results.jsonl` - Results from candidate configuration - `evals/README.md` - Detailed usage documentation +- `../benchmark-tooling/fixtures/combined-results.jsonl` - Combined multi-target fixture for N-way matrix diff --git a/examples/features/compare/evals/README.md b/examples/features/compare/evals/README.md index be6c0f81a..c01a261f0 100644 --- a/examples/features/compare/evals/README.md +++ b/examples/features/compare/evals/README.md @@ -1,22 +1,71 @@ # Compare Command Example -This example demonstrates the `agentv compare` command for comparing evaluation results between two runs. +The `agentv compare` command supports three modes: N-way matrix from a combined JSONL, pairwise from a combined JSONL, and two-file pairwise. ## Use Case Compare model performance across different configurations: -- Baseline vs. candidate prompts -- Different model versions (e.g., GPT-4.1 vs. GPT-5) -- Before/after optimization runs +- N-way matrix comparison across 3+ models from a single combined results file +- Baseline regression gating in CI (exit 1 if any target regresses) +- Head-to-head pairwise between two specific targets +- Before/after optimization runs (two-file pairwise) ## Sample Files - `baseline-results.jsonl` - Results from baseline configuration (GPT-4.1) - `candidate-results.jsonl` - Results from candidate configuration (GPT-5) +- `../../benchmark-tooling/fixtures/combined-results.jsonl` - Combined multi-target results (3 tests x 3 targets) ## Usage -### Basic Comparison +### N-Way Matrix (combined JSONL) + +```bash +agentv compare combined-results.jsonl +``` + +Output: +``` +Score Matrix + + Test ID gemini-3-flash-preview gpt-4.1 gpt-5-mini + ─────────────── ────────────────────── ─────── ────────── + code-generation 0.70 0.80 0.75 + greeting 0.90 0.85 0.95 + summarization 0.85 0.90 0.80 + +Pairwise Summary: + gemini-3-flash-preview → gpt-4.1: 1 win, 0 losses, 2 ties (Δ +0.033) + gemini-3-flash-preview → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ +0.017) + gpt-4.1 → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ -0.017) +``` + +### Baseline Regression Check + +```bash +agentv compare combined-results.jsonl --baseline gpt-4.1 +# Exits 1 if any target regresses vs gpt-4.1 +``` + +### Pairwise from Combined JSONL + +```bash +agentv compare combined-results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini +``` + +``` +Comparing: gpt-4.1 → gpt-5-mini + + Test ID Baseline Candidate Delta Result + ─────────────── ──────── ───────── ──────── ──────── + greeting 0.85 0.95 +0.10 = tie + code-generation 0.80 0.75 -0.05 = tie + summarization 0.90 0.80 -0.10 = tie + +Summary: 0 wins, 0 losses, 3 ties | Mean Δ: -0.017 | Status: regressed +``` + +### Two-File Pairwise (legacy) ```bash agentv compare baseline-results.jsonl candidate-results.jsonl @@ -50,38 +99,39 @@ agentv compare baseline-results.jsonl candidate-results.jsonl --threshold 0.05 For machine-readable output (CI pipelines, scripts): ```bash -agentv compare baseline-results.jsonl candidate-results.jsonl --json +agentv compare combined-results.jsonl --json ``` Output uses snake_case for Python ecosystem compatibility: ```json { - "matched": [ - {"test_id": "code-review-001", "score1": 0.72, "score2": 0.88, "delta": 0.16, "outcome": "win"} + "matrix": [ + {"test_id": "code-generation", "scores": {"gemini-3-flash-preview": 0.7, "gpt-4.1": 0.8, "gpt-5-mini": 0.75}} + ], + "pairwise": [ + {"baseline": "gemini-3-flash-preview", "candidate": "gpt-4.1", "summary": {"wins": 1, "losses": 0, "ties": 2, "mean_delta": 0.033}} ], - "unmatched": {"file1": 0, "file2": 0}, - "summary": { - "total": 10, - "matched": 5, - "wins": 1, - "losses": 0, - "ties": 4, - "mean_delta": 0.054 - } + "targets": ["gemini-3-flash-preview", "gpt-4.1", "gpt-5-mini"] } ``` ## Exit Codes -- `0` - Candidate is equal or better (meanDelta >= 0) -- `1` - Baseline is better (regression detected) +| Mode | Exit Code | +|---|---| +| Two-file pairwise | Exit 1 on regression (meanDelta < 0) | +| Combined with `--baseline` | Exit 1 if any target regresses vs baseline | +| Combined without `--baseline` | Exit 0 (informational) | ## CI Integration Use exit codes for automated quality gates: ```bash -# Fail CI if candidate regresses +# N-way: fail if any target regresses vs baseline +agentv compare results.jsonl --baseline gpt-4.1 || echo "Regression detected!" + +# Two-file: fail if candidate regresses agentv compare baseline.jsonl candidate.jsonl || echo "Regression detected!" ``` diff --git a/examples/showcase/multi-model-benchmark/README.md b/examples/showcase/multi-model-benchmark/README.md index 6fd3879af..c14a036fd 100644 --- a/examples/showcase/multi-model-benchmark/README.md +++ b/examples/showcase/multi-model-benchmark/README.md @@ -49,51 +49,41 @@ To run against a single target first: bun agentv eval examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml --target copilot ``` -### Saving Results for Comparison - -Save per-target results to separate files for the compare workflow: - -```bash -# Run each target and save results -bun agentv eval examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml \ - --target copilot --out results-copilot.jsonl - -bun agentv eval examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml \ - --target claude --out results-claude.jsonl - -bun agentv eval examples/showcase/multi-model-benchmark/evals/benchmark.eval.yaml \ - --target gemini_base --out results-gemini.jsonl -``` - ## Comparing Models -Use `agentv compare` to see score deltas between any two runs: +The eval produces a combined results file with a `target` field per record. Use `agentv compare` to see all models side by side: ```bash -# Compare copilot vs claude -bun agentv compare results-copilot.jsonl results-claude.jsonl +# N-way matrix — see all models at once +agentv compare results.jsonl + +# Designate a baseline for CI regression gating +agentv compare results.jsonl --baseline copilot -# Compare copilot vs gemini -bun agentv compare results-copilot.jsonl results-gemini.jsonl +# Pairwise: compare two specific targets +agentv compare results.jsonl --baseline copilot --candidate claude # JSON output for CI integration -bun agentv compare results-copilot.jsonl results-claude.jsonl --json +agentv compare results.jsonl --json ``` ### Expected Output ``` -Comparing: results-copilot.jsonl → results-claude.jsonl +Score Matrix - Test ID Baseline Candidate Delta Result - ───────────────────── ──────── ───────── ──────── ──────── - factual-geography 0.92 0.95 +0.03 = tie - factual-science 0.88 0.91 +0.03 = tie - analytical-comparison 0.78 0.85 +0.07 = tie - creative-explanation 0.82 0.80 -0.02 = tie - structured-list 0.90 0.88 -0.02 = tie + Test ID copilot claude gemini_base + ───────────────────── ─────── ────── ─────────── + factual-geography 0.92 0.95 0.87 + factual-science 0.88 0.91 0.85 + analytical-comparison 0.78 0.85 0.80 + creative-explanation 0.82 0.80 0.83 + structured-list 0.90 0.88 0.86 -Summary: 0 wins, 0 losses, 5 ties | Mean Δ: +0.018 | Status: no change +Pairwise Summary: + claude → copilot: 0 wins, 0 losses, 5 ties (Δ -0.018) + claude → gemini_base: 0 wins, 0 losses, 5 ties (Δ -0.044) + copilot → gemini_base: 0 wins, 0 losses, 5 ties (Δ -0.026) ``` > **Note:** Actual scores will vary — LLM outputs are non-deterministic. The trials configuration helps surface this variability. Scores above are illustrative. @@ -144,12 +134,14 @@ This surfaces non-determinism — if a model passes on trial 1 but fails on tria ### 4. Compare -The `agentv compare` command reads two JSONL result files and computes per-test score deltas: +The `agentv compare` command reads a combined JSONL (with `target` field) and shows an N-way matrix with pairwise summaries. Each pair classifies per-test deltas: - **Win**: candidate score exceeds baseline by threshold (default 0.10) - **Loss**: baseline score exceeds candidate by threshold - **Tie**: scores within threshold +With `--baseline`, exit code 1 signals regression (CI-friendly). + ## Evaluation Flow ``` @@ -161,19 +153,14 @@ benchmark.eval.yaml │ (per target × trials) │ └────────┬────────────────┘ │ - ┌────┼────────┐ - ▼ ▼ ▼ -copilot claude gemini - │ │ │ - ▼ ▼ ▼ - .jsonl .jsonl .jsonl - │ │ │ - └────┼────────┘ + ▼ + combined results.jsonl + (all targets in one file) │ ▼ ┌─────────────────────────┐ │ agentv compare │ -│ (pairwise deltas) │ +│ (N-way matrix + deltas)│ └─────────────────────────┘ ``` diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md index 04171100a..4b622255d 100644 --- a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md @@ -355,8 +355,11 @@ agentv prompt eval judge --test-id --answer-file f # judge prom # Validate eval file agentv validate -# Compare results between runs -agentv compare +# Compare results — N-way matrix from combined JSONL +agentv compare +agentv compare --baseline # CI regression gate +agentv compare --baseline --candidate # pairwise +agentv compare # two-file pairwise # Generate rubrics from criteria agentv generate rubrics [--target ] From 065007899054b7b21f6eed3c6fea215b275905fd Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Thu, 26 Feb 2026 02:52:45 +0000 Subject: [PATCH 2/2] docs(comparison): rename execution.evaluators to assert The evaluators block was renamed to assert in the eval YAML schema. Update both code examples in COMPARISON.md to use the current syntax. --- docs/COMPARISON.md | 38 ++++++++++++++++++-------------------- 1 file changed, 18 insertions(+), 20 deletions(-) diff --git a/docs/COMPARISON.md b/docs/COMPARISON.md index 6fb0347e9..9b2617839 100644 --- a/docs/COMPARISON.md +++ b/docs/COMPARISON.md @@ -23,15 +23,14 @@ **1. Hybrid Judge System (Code + LLM with Custom Prompts)** ```yaml -execution: - evaluators: - - name: format_check - type: code_judge # Deterministic: checks concrete outputs - script: ./validators/check_format.py - - - name: correctness - type: llm_judge # Subjective: uses customizable judge prompt - prompt: ./judges/correctness.md # Edit the prompt, not the code +assert: + - name: format_check + type: code_judge # Deterministic: checks concrete outputs + script: ./validators/check_format.py + + - name: correctness + type: llm_judge # Subjective: uses customizable judge prompt + prompt: ./judges/correctness.md # Edit the prompt, not the code ``` This is more powerful than: @@ -119,17 +118,16 @@ Alternative approaches: ### Scenario: Deterministic + Subjective Evaluation ```yaml -execution: - evaluators: - - name: syntax_check - type: code_judge - script: ["python", "check_syntax.py"] - - name: logic_check - type: code_judge - script: ["python", "check_logic.py"] - - name: explanation_quality - type: llm_judge - prompt: judges/explanation.md +assert: + - name: syntax_check + type: code_judge + script: ["python", "check_syntax.py"] + - name: logic_check + type: code_judge + script: ["python", "check_logic.py"] + - name: explanation_quality + type: llm_judge + prompt: judges/explanation.md ``` Single eval run scores all three dimensions. Other approaches: