You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Overhaul judge and criteria for E2E testing with CLI agent reviewers
Major changes:
Judge: Replaced CodebuffClient SDK-based LLM judges with real CLI coding
agents (Claude Code, Codex, Gemini) that run IN the repo. Reviewer agents
can build, run tests, start the dev server, use browser tools, curl
endpoints, check logs — actual E2E verification, not just diff reading.
Structured output via result file (evalbuff-review-result.json) with
fallback to stdout JSON extraction.
Criteria: Shifted from code style (correctness, completeness, pattern
consistency, fluency) to E2E verification levels:
- L1: Builds, existing tests pass, basic completeness
- L2: Feature works E2E (browser/curl/client), logs clean
- L3: Edge cases & error states tested E2E, UI verification
- L4: Cross-component integration, performance, no regressions
- L5: Production readiness (migrations, env vars, error recovery)
Orchestrator: Judge now runs inside withTestRepo callback so reviewer
agents have access to the live repo. CodebuffClient only used for
doc writer (analyzeFailure). Added --reviewers CLI flag.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
'The code compiles, runs without errors, and produces the expected behavior.',
22
+
'The code compiles, builds, and the project starts without errors. Run the build command and verify it succeeds.',
23
23
},
24
24
{
25
-
name: 'Completeness',
25
+
name: 'Existing Tests Pass',
26
26
weight: 3,
27
27
description:
28
-
'All aspects of the prompt are addressed. No partial implementations or TODO comments.',
28
+
'All pre-existing tests still pass. Run the test suite and confirm no regressions were introduced.',
29
29
},
30
30
{
31
-
name: 'Basic Style',
32
-
weight: 1,
31
+
name: 'Basic Completeness',
32
+
weight: 2,
33
33
description:
34
-
'Code follows basic formatting conventions and is readable.',
34
+
'All aspects of the prompt are addressed. No partial implementations or TODO comments left behind.',
35
35
},
36
36
],
37
37
2: [
38
38
{
39
-
name: 'Pattern Consistency',
40
-
weight: 2,
39
+
name: 'Feature Works E2E',
40
+
weight: 4,
41
41
description:
42
-
'New code follows the same patterns, naming conventions, and architectural style as existing code in the codebase.',
42
+
'The new feature or bug fix actually works when you use the application. Start the app, navigate to the relevant page or endpoint, and exercise the feature. Use browser tools, curl, or the appropriate client to verify the happy path end-to-end.',
43
+
},
44
+
{
45
+
name: 'Logs & Observability',
46
+
weight: 1,
47
+
description:
48
+
'Check application logs for errors, warnings, or stack traces during E2E testing. Verify no unexpected errors appear when exercising the feature.',
43
49
},
44
50
],
45
51
3: [
46
52
{
47
-
name: 'Test Quality',
53
+
name: 'Edge Cases & Error States',
54
+
weight: 3,
55
+
description:
56
+
'Test error states and edge cases E2E. Submit invalid inputs, trigger error conditions, test boundary values. Verify the app handles them gracefully without crashing.',
57
+
},
58
+
{
59
+
name: 'UI/UX Verification',
48
60
weight: 2,
49
61
description:
50
-
'Tests are meaningful, cover edge cases, and test behavior rather than implementation details.',
62
+
'For UI changes: visually verify the rendered output. Check layout, responsiveness, and that the UI matches expectations. Take screenshots to document.',
51
63
},
52
64
],
53
65
4: [
54
66
{
55
-
name: 'Optimal Design',
67
+
name: 'Cross-Component Integration',
56
68
weight: 2,
57
69
description:
58
-
'Code is DRY, uses the right abstractions, and the diff is minimal — no unnecessary changes.',
70
+
'Verify the change works correctly with related features. Test flows that cross component boundaries. If a backend change was made, verify the frontend still works. If a DB migration was added, verify queries work.',
71
+
},
72
+
{
73
+
name: 'Performance & No Regressions',
74
+
weight: 2,
75
+
description:
76
+
'Verify no performance regressions. Check page load times, API response times, or resource usage. Ensure the change does not break unrelated features.',
59
77
},
60
78
],
61
79
5: [
62
80
{
63
-
name: 'Fluency',
64
-
weight: 1,
81
+
name: 'Production Readiness',
82
+
weight: 2,
65
83
description:
66
-
'Code reads like a senior engineer wrote it. Idiomatic usage of the language and framework. No over-engineering.',
84
+
'Full production readiness check. Verify migrations, environment variable handling, error recovery, and graceful degradation. The change should be safe to deploy.',
67
85
},
68
86
],
69
87
}
@@ -122,13 +140,13 @@ export function maybePromoteCriteria(
122
140
}
123
141
124
142
/**
125
-
* Format criteria as text for injection into judge prompts.
143
+
* Format criteria as text for injection into reviewer agent prompts.
'Apply these additional quality criteria when scoring. Higher levels add stricter standards:',
149
+
'You MUST verify each of these criteria. Higher levels require deeper E2E testing:',
132
150
'',
133
151
]
134
152
@@ -138,7 +156,9 @@ export function formatCriteriaForPrompt(criteria: QualityCriteria): string {
138
156
139
157
lines.push(
140
158
'',
141
-
'Weight these criteria proportionally when computing scores. A violation of a high-weight criterion should have a bigger impact on the score than a low-weight one.',
159
+
'For each criterion, describe what you tested and what you observed. If you cannot test a criterion (e.g., no UI for a backend change), note that and explain why.',
160
+
'',
161
+
'Weight these criteria proportionally when computing scores. A failure on a high-weight criterion should have a bigger impact on the score than a low-weight one.',
0 commit comments