Goal
Add tasks requiring 3+ tool compositions and multi-step reasoning chains. Current tasks often involve single tool invocations.
Task Ideas
Information → Analysis → Action Chains
- Research competitor pricing → analyze trends → draft summary email → create calendar follow-up
- Read meeting notes → extract action items → create tasks → send assignments
Error Propagation Sensitivity
- Tasks where a mistake in step 2 requires recognizing and correcting it in step 5
- Multi-file refactoring where changes must stay consistent
Cross-Artifact Consistency
- Update codebase + documentation + changelog + tests consistently
- Modify config in multiple places that must stay in sync
Specific Task Proposals
- Dependency Resolution: Given a broken project with circular deps and version conflicts, fix it
- Cross-Artifact Update: Change a function signature and update all callers, tests, and docs
- Iterative Refinement: Write code → run tests → fix failures → run linter → fix style → verify no regressions
Success Criteria
- Average task should require 5+ distinct tool uses
- Multi-step tasks should have lower pass rates than single-step tasks
- Should expose planning/sequencing failures
References
- METR research: success drops sharply for hour-long tasks
- SWE-bench Pro: 4+ file changes are significantly harder
Goal
Add tasks requiring 3+ tool compositions and multi-step reasoning chains. Current tasks often involve single tool invocations.
Task Ideas
Information → Analysis → Action Chains
Error Propagation Sensitivity
Cross-Artifact Consistency
Specific Task Proposals
Success Criteria
References