Goal
Add tasks where the obvious approach is wrong, testing genuine reasoning over pattern matching.
Task Types
Red Herring Tasks
- Provide irrelevant but distracting information
- Include "obvious" solution that fails on edge cases
- Context that suggests wrong approach
Edge Case Gauntlets
- Off-by-one scenarios in dates/times/counting
- Boundary conditions (empty lists, single items, max values)
- Unicode/encoding edge cases
- Timezone handling across DST boundaries
Inherited Mess Tasks (Recovery-Bench style)
- Workspace containing prior failed attempts that need cleanup
- Broken state that agent must diagnose before fixing
- Conflicting partial solutions left behind
Specific Task Ideas
- The Misleading Log: Error message points to wrong root cause
- Off-by-One Gauntlet: 5 date/time operations where edges matter
- The Cleanup Job: Previous agent left half-done work with bugs
- The Obvious Trap: Task where copy-paste solution from docs fails
Grading
- Binary: did they avoid the trap?
- Bonus: did they explain why the obvious approach fails?
Success Criteria
- Traps should catch >30% of models
- Tasks should differentiate reasoning vs. pattern matching
References
- ARC-AGI design philosophy
- Recovery-Bench: evaluating error recovery
Goal
Add tasks where the obvious approach is wrong, testing genuine reasoning over pattern matching.
Task Types
Red Herring Tasks
Edge Case Gauntlets
Inherited Mess Tasks (Recovery-Bench style)
Specific Task Ideas
Grading
Success Criteria
References