Skip to content

test(evals): fix anomalies and expand eval coverage#111

Open
dzianisv wants to merge 1 commit intomainfrom
eval/expand-coverage
Open

test(evals): fix anomalies and expand eval coverage#111
dzianisv wants to merge 1 commit intomainfrom
eval/expand-coverage

Conversation

@dzianisv
Copy link
Owner

Summary

  • Fix 4 anomalies in existing eval assertions that were misleading or too loose
  • Add 19 new eval test cases across all 3 eval configs (judge, stuck, compression)
  • All evals pass: judge 31/31, stuck 18/18, compression 17/17
  • Unit tests: 319 passed, 5 skipped

Anomalies Fixed

File Issue
promptfooconfig.yaml #19 Description said "COMPLETE" but assertion expected complete === false
stuck-detection.yaml "Task finished" Loose assertion accepted reason: "working" as passing
stuck-detection.yaml "Very short delay" Tautological stuck === false || shouldNudge === false always passed
post-compression.yaml #2/#3 Accepted continue_task when needs_github_update was the correct answer

New Test Cases

Judge eval (8 new, 23→31 total):

  • Agent stopped mid-task with pending TODOs (issue Session stopped #109 scenario)
  • Agent claims fix but test output shows warnings
  • Agent in retry/fix loop (3 attempts, still failing)
  • Partial implementation (created files but didn't wire up)
  • Gold-plating but task completed — COMPLETE
  • Context window exhaustion (agent says "continuing" then stops)
  • Agent forgot to run requested tests
  • Agent commits directly to main violating workflow rules

Stuck detection (6 new, 12→18 total):

  • Retry loop (same failed command repeated)
  • Long-running build (150s) — NOT stuck, just slow
  • Incomplete message with short delay
  • Planning-only tokens, no action taken
  • Rate-limited agent — should NOT be nudged
  • Stuck agent must NOT be classified as complete

Post-compression (5 new, 12→17 total):

  • PR with failing CI
  • Agent debugging test failures mid-stream
  • Multiple PRs open, agent working on one
  • Task blocked on missing secrets
  • Force-push scenario with rewritten history

Fix 4 anomalies in existing eval assertions:
- promptfooconfig.yaml #19: misleading description (said COMPLETE, asserted incomplete)
- stuck-detection.yaml 'Task finished': loose assertion allowed reason=working
- stuck-detection.yaml 'Very short delay': tautological assertion always passed
- post-compression.yaml #2/#3: accepted continue_task when needs_github_update correct

Add 19 new eval test cases:
- 8 judge eval cases (23→31): mid-task stop, subtle warnings, retry loops,
  partial impl, gold-plating, context exhaustion, missing tests, main push
- 6 stuck detection cases (12→18): retry loop, slow build, incomplete msg,
  planning-only, rate limited, stuck-not-complete
- 5 post-compression cases (12→17): failing CI, mid-debug, multi-PR,
  blocked on secrets, force-push

All evals pass: judge 31/31, stuck 18/18, compression 17/17.
Unit tests: 319 passed, 5 skipped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant