Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 24 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -278,16 +278,34 @@ agentv create eval my-eval # → evals/my-eval.eval.yaml + .cases.jsonl

### Compare Evaluation Results

Run two evaluations and compare them:
Compare a combined results file across all targets (N-way matrix):

```bash
agentv eval evals/my-eval.yaml --out before.jsonl
# ... make changes to your agent ...
agentv eval evals/my-eval.yaml --out after.jsonl
agentv compare before.jsonl after.jsonl --threshold 0.1
agentv compare results.jsonl
```

Output shows wins, losses, ties, and mean delta to identify improvements.
```
Score Matrix

Test ID gemini-3-flash-preview gpt-4.1 gpt-5-mini
─────────────── ────────────────────── ─────── ──────────
code-generation 0.70 0.80 0.75
greeting 0.90 0.85 0.95
summarization 0.85 0.90 0.80

Pairwise Summary:
gemini-3-flash-preview → gpt-4.1: 1 win, 0 losses, 2 ties (Δ +0.033)
gemini-3-flash-preview → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ +0.017)
gpt-4.1 → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ -0.017)
```

Designate a baseline for CI regression gating, or compare two specific targets:

```bash
agentv compare results.jsonl --baseline gpt-4.1 # exit 1 on regression
agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini # pairwise
agentv compare before.jsonl after.jsonl # two-file pairwise
```

## Targets Configuration

Expand Down
30 changes: 24 additions & 6 deletions apps/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -278,16 +278,34 @@ agentv create eval my-eval # → evals/my-eval.eval.yaml + .cases.jsonl

### Compare Evaluation Results

Run two evaluations and compare them:
Compare a combined results file across all targets (N-way matrix):

```bash
agentv eval evals/my-eval.yaml --out before.jsonl
# ... make changes to your agent ...
agentv eval evals/my-eval.yaml --out after.jsonl
agentv compare before.jsonl after.jsonl --threshold 0.1
agentv compare results.jsonl
```

Output shows wins, losses, ties, and mean delta to identify improvements.
```
Score Matrix

Test ID gemini-3-flash-preview gpt-4.1 gpt-5-mini
─────────────── ────────────────────── ─────── ──────────
code-generation 0.70 0.80 0.75
greeting 0.90 0.85 0.95
summarization 0.85 0.90 0.80

Pairwise Summary:
gemini-3-flash-preview → gpt-4.1: 1 win, 0 losses, 2 ties (Δ +0.033)
gemini-3-flash-preview → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ +0.017)
gpt-4.1 → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ -0.017)
```

Designate a baseline for CI regression gating, or compare two specific targets:

```bash
agentv compare results.jsonl --baseline gpt-4.1 # exit 1 on regression
agentv compare results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini # pairwise
agentv compare before.jsonl after.jsonl # two-file pairwise
```

## Targets Configuration

Expand Down
46 changes: 24 additions & 22 deletions docs/COMPARISON.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,14 @@

**1. Hybrid Judge System (Code + LLM with Custom Prompts)**
```yaml
execution:
evaluators:
- name: format_check
type: code_judge # Deterministic: checks concrete outputs
script: ./validators/check_format.py

- name: correctness
type: llm_judge # Subjective: uses customizable judge prompt
prompt: ./judges/correctness.md # Edit the prompt, not the code
assert:
- name: format_check
type: code_judge # Deterministic: checks concrete outputs
script: ./validators/check_format.py

- name: correctness
type: llm_judge # Subjective: uses customizable judge prompt
prompt: ./judges/correctness.md # Edit the prompt, not the code
```

This is more powerful than:
Expand All @@ -57,7 +56,9 @@ No network round-trips, no waiting for managed infrastructure:
# AgentV workflow
agentv eval evals/my-eval.yaml
agentv eval evals/**/*.yaml --workers 10 # Parallel
agentv compare before.jsonl after.jsonl # A/B testing
agentv compare results.jsonl # N-way matrix comparison
agentv compare results.jsonl --baseline gpt-4.1 # CI regression gate
agentv compare before.jsonl after.jsonl # Two-file pairwise A/B testing
```

```bash
Expand Down Expand Up @@ -117,17 +118,16 @@ Alternative approaches:
### Scenario: Deterministic + Subjective Evaluation

```yaml
execution:
evaluators:
- name: syntax_check
type: code_judge
script: ["python", "check_syntax.py"]
- name: logic_check
type: code_judge
script: ["python", "check_logic.py"]
- name: explanation_quality
type: llm_judge
prompt: judges/explanation.md
assert:
- name: syntax_check
type: code_judge
script: ["python", "check_syntax.py"]
- name: logic_check
type: code_judge
script: ["python", "check_logic.py"]
- name: explanation_quality
type: llm_judge
prompt: judges/explanation.md
```

Single eval run scores all three dimensions. Other approaches:
Expand All @@ -140,8 +140,10 @@ Single eval run scores all three dimensions. Other approaches:
```yaml
# .github/workflows/eval.yml
- run: agentv eval evals/**/*.yaml --out results.jsonl
- run: agentv compare results.jsonl --baseline gpt-4.1
# Exit 1 if any target regresses vs baseline (N-way matrix)
- run: agentv compare baseline.jsonl results.jsonl --threshold 0.05
# Fail if performance drops > 5%
# Or two-file pairwise: fail if performance drops > 5%
```

Other tools face challenges here:
Expand Down
31 changes: 24 additions & 7 deletions examples/features/compare/README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,47 @@
# Baseline vs Candidate Comparison

Demonstrates comparing evaluation results between baseline and candidate versions using the `agentv compare` command.
Demonstrates comparing evaluation results using the `agentv compare` command.

## What This Shows

- Comparing two evaluation result files
- N-way matrix comparison from a combined JSONL file
- Two-file pairwise comparison (baseline vs candidate)
- Score delta calculation and win/loss classification
- Regression detection via exit codes
- Baseline regression detection via exit codes
- Human-readable and JSON output formats

## Running

```bash
# From repository root
# Compare baseline vs candidate results
bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl

# N-way matrix from a combined results file (see ../benchmark-tooling/ for fixture)
agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl

# Pairwise from combined file
agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl \
--baseline gpt-4.1 --candidate gpt-5-mini

# CI regression gate: exit 1 if any target regresses vs baseline
agentv compare examples/features/benchmark-tooling/fixtures/combined-results.jsonl \
--baseline gpt-4.1

# Two-file pairwise comparison (legacy)
agentv compare examples/features/compare/evals/baseline-results.jsonl \
examples/features/compare/evals/candidate-results.jsonl

# With custom threshold for win/loss classification
bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl --threshold 0.05
agentv compare examples/features/compare/evals/baseline-results.jsonl \
examples/features/compare/evals/candidate-results.jsonl --threshold 0.05

# JSON output for CI pipelines
bun agentv compare examples/features/compare/evals/baseline-results.jsonl examples/features/compare/evals/candidate-results.jsonl --json
agentv compare examples/features/compare/evals/baseline-results.jsonl \
examples/features/compare/evals/candidate-results.jsonl --json
```

## Key Files

- `evals/baseline-results.jsonl` - Results from baseline configuration
- `evals/candidate-results.jsonl` - Results from candidate configuration
- `evals/README.md` - Detailed usage documentation
- `../benchmark-tooling/fixtures/combined-results.jsonl` - Combined multi-target fixture for N-way matrix
90 changes: 70 additions & 20 deletions examples/features/compare/evals/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,71 @@
# Compare Command Example

This example demonstrates the `agentv compare` command for comparing evaluation results between two runs.
The `agentv compare` command supports three modes: N-way matrix from a combined JSONL, pairwise from a combined JSONL, and two-file pairwise.

## Use Case

Compare model performance across different configurations:
- Baseline vs. candidate prompts
- Different model versions (e.g., GPT-4.1 vs. GPT-5)
- Before/after optimization runs
- N-way matrix comparison across 3+ models from a single combined results file
- Baseline regression gating in CI (exit 1 if any target regresses)
- Head-to-head pairwise between two specific targets
- Before/after optimization runs (two-file pairwise)

## Sample Files

- `baseline-results.jsonl` - Results from baseline configuration (GPT-4.1)
- `candidate-results.jsonl` - Results from candidate configuration (GPT-5)
- `../../benchmark-tooling/fixtures/combined-results.jsonl` - Combined multi-target results (3 tests x 3 targets)

## Usage

### Basic Comparison
### N-Way Matrix (combined JSONL)

```bash
agentv compare combined-results.jsonl
```

Output:
```
Score Matrix

Test ID gemini-3-flash-preview gpt-4.1 gpt-5-mini
─────────────── ────────────────────── ─────── ──────────
code-generation 0.70 0.80 0.75
greeting 0.90 0.85 0.95
summarization 0.85 0.90 0.80

Pairwise Summary:
gemini-3-flash-preview → gpt-4.1: 1 win, 0 losses, 2 ties (Δ +0.033)
gemini-3-flash-preview → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ +0.017)
gpt-4.1 → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ -0.017)
```

### Baseline Regression Check

```bash
agentv compare combined-results.jsonl --baseline gpt-4.1
# Exits 1 if any target regresses vs gpt-4.1
```

### Pairwise from Combined JSONL

```bash
agentv compare combined-results.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
```

```
Comparing: gpt-4.1 → gpt-5-mini

Test ID Baseline Candidate Delta Result
─────────────── ──────── ───────── ──────── ────────
greeting 0.85 0.95 +0.10 = tie
code-generation 0.80 0.75 -0.05 = tie
summarization 0.90 0.80 -0.10 = tie

Summary: 0 wins, 0 losses, 3 ties | Mean Δ: -0.017 | Status: regressed
```

### Two-File Pairwise (legacy)

```bash
agentv compare baseline-results.jsonl candidate-results.jsonl
Expand Down Expand Up @@ -50,38 +99,39 @@ agentv compare baseline-results.jsonl candidate-results.jsonl --threshold 0.05
For machine-readable output (CI pipelines, scripts):

```bash
agentv compare baseline-results.jsonl candidate-results.jsonl --json
agentv compare combined-results.jsonl --json
```

Output uses snake_case for Python ecosystem compatibility:

```json
{
"matched": [
{"test_id": "code-review-001", "score1": 0.72, "score2": 0.88, "delta": 0.16, "outcome": "win"}
"matrix": [
{"test_id": "code-generation", "scores": {"gemini-3-flash-preview": 0.7, "gpt-4.1": 0.8, "gpt-5-mini": 0.75}}
],
"pairwise": [
{"baseline": "gemini-3-flash-preview", "candidate": "gpt-4.1", "summary": {"wins": 1, "losses": 0, "ties": 2, "mean_delta": 0.033}}
],
"unmatched": {"file1": 0, "file2": 0},
"summary": {
"total": 10,
"matched": 5,
"wins": 1,
"losses": 0,
"ties": 4,
"mean_delta": 0.054
}
"targets": ["gemini-3-flash-preview", "gpt-4.1", "gpt-5-mini"]
}
```

## Exit Codes

- `0` - Candidate is equal or better (meanDelta >= 0)
- `1` - Baseline is better (regression detected)
| Mode | Exit Code |
|---|---|
| Two-file pairwise | Exit 1 on regression (meanDelta < 0) |
| Combined with `--baseline` | Exit 1 if any target regresses vs baseline |
| Combined without `--baseline` | Exit 0 (informational) |

## CI Integration

Use exit codes for automated quality gates:

```bash
# Fail CI if candidate regresses
# N-way: fail if any target regresses vs baseline
agentv compare results.jsonl --baseline gpt-4.1 || echo "Regression detected!"

# Two-file: fail if candidate regresses
agentv compare baseline.jsonl candidate.jsonl || echo "Regression detected!"
```
Loading