test(evals): fix anomalies and expand eval coverage by dzianisv · Pull Request #111 · dzianisv/opencode-plugins

dzianisv · 2026-02-15T09:58:46Z

Summary

Fix 4 anomalies in existing eval assertions that were misleading or too loose
Add 19 new eval test cases across all 3 eval configs (judge, stuck, compression)
All evals pass: judge 31/31, stuck 18/18, compression 17/17
Unit tests: 319 passed, 5 skipped

Anomalies Fixed

File	Issue
`promptfooconfig.yaml` #19	Description said "COMPLETE" but assertion expected `complete === false`
`stuck-detection.yaml` "Task finished"	Loose assertion accepted `reason: "working"` as passing
`stuck-detection.yaml` "Very short delay"	Tautological `stuck === false \|\| shouldNudge === false` always passed
`post-compression.yaml` #2/#3	Accepted `continue_task` when `needs_github_update` was the correct answer

New Test Cases

Judge eval (8 new, 23→31 total):

Agent stopped mid-task with pending TODOs (issue Session stopped #109 scenario)
Agent claims fix but test output shows warnings
Agent in retry/fix loop (3 attempts, still failing)
Partial implementation (created files but didn't wire up)
Gold-plating but task completed — COMPLETE
Context window exhaustion (agent says "continuing" then stops)
Agent forgot to run requested tests
Agent commits directly to main violating workflow rules

Stuck detection (6 new, 12→18 total):

Retry loop (same failed command repeated)
Long-running build (150s) — NOT stuck, just slow
Incomplete message with short delay
Planning-only tokens, no action taken
Rate-limited agent — should NOT be nudged
Stuck agent must NOT be classified as complete

Post-compression (5 new, 12→17 total):

PR with failing CI
Agent debugging test failures mid-stream
Multiple PRs open, agent working on one
Task blocked on missing secrets
Force-push scenario with rewritten history

Fix 4 anomalies in existing eval assertions: - promptfooconfig.yaml #19: misleading description (said COMPLETE, asserted incomplete) - stuck-detection.yaml 'Task finished': loose assertion allowed reason=working - stuck-detection.yaml 'Very short delay': tautological assertion always passed - post-compression.yaml #2/#3: accepted continue_task when needs_github_update correct Add 19 new eval test cases: - 8 judge eval cases (23→31): mid-task stop, subtle warnings, retry loops, partial impl, gold-plating, context exhaustion, missing tests, main push - 6 stuck detection cases (12→18): retry loop, slow build, incomplete msg, planning-only, rate limited, stuck-not-complete - 5 post-compression cases (12→17): failing CI, mid-debug, multi-PR, blocked on secrets, force-push All evals pass: judge 31/31, stuck 18/18, compression 17/17. Unit tests: 319 passed, 5 skipped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(evals): fix anomalies and expand eval coverage#111

test(evals): fix anomalies and expand eval coverage#111
dzianisv wants to merge 1 commit intomainfrom
eval/expand-coverage

dzianisv commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzianisv commented Feb 15, 2026

Summary

Anomalies Fixed

New Test Cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant