Goal
Test ability to manage, navigate, and selectively retrieve from large contexts. Many current tasks provide all information upfront in digestible chunks.
Capabilities to Test
- Selective retrieval: Find relevant info in large documents
- State maintenance: Track information across many turns
- Multi-source synthesis: Combine info from multiple documents
Task Ideas
Long Document Tasks
- 50-page meeting transcript → extract action items for ONE specific attendee
- Large codebase (10+ files) → find and fix bug described only by symptoms
- Legal document → answer specific questions requiring cross-referencing sections
Multi-Document Synthesis
- Combine info from 5+ related documents into coherent analysis
- Reconcile conflicting information across sources
- Build timeline from scattered references
State Tracking
- Multi-turn conversation requiring recall of earlier details
- Incremental updates to a complex data structure
- Long debugging session requiring memory of what was tried
Implementation Notes
- Use workspace_files to provide large documents
- Ensure relevant info is buried, not at the start
- Include plausible distractors
Success Criteria
- Tasks should require >10K tokens of context navigation
- Models should fail if they can't selectively attend
- Synthesis tasks should differentiate summarization quality
References
- Long-context benchmarks (RULER, LongBench)
- Real-world: analysts deal with large document sets
Goal
Test ability to manage, navigate, and selectively retrieve from large contexts. Many current tasks provide all information upfront in digestible chunks.
Capabilities to Test
Task Ideas
Long Document Tasks
Multi-Document Synthesis
State Tracking
Implementation Notes
Success Criteria
References