Skip to content

Add long-context tasks requiring selective retrieval #336

@ScuttleBot

Description

@ScuttleBot

Goal

Test ability to manage, navigate, and selectively retrieve from large contexts. Many current tasks provide all information upfront in digestible chunks.

Capabilities to Test

  1. Selective retrieval: Find relevant info in large documents
  2. State maintenance: Track information across many turns
  3. Multi-source synthesis: Combine info from multiple documents

Task Ideas

Long Document Tasks

  • 50-page meeting transcript → extract action items for ONE specific attendee
  • Large codebase (10+ files) → find and fix bug described only by symptoms
  • Legal document → answer specific questions requiring cross-referencing sections

Multi-Document Synthesis

  • Combine info from 5+ related documents into coherent analysis
  • Reconcile conflicting information across sources
  • Build timeline from scattered references

State Tracking

  • Multi-turn conversation requiring recall of earlier details
  • Incremental updates to a complex data structure
  • Long debugging session requiring memory of what was tried

Implementation Notes

  • Use workspace_files to provide large documents
  • Ensure relevant info is buried, not at the start
  • Include plausible distractors

Success Criteria

  • Tasks should require >10K tokens of context navigation
  • Models should fail if they can't selectively attend
  • Synthesis tasks should differentiate summarization quality

References

  • Long-context benchmarks (RULER, LongBench)
  • Real-world: analysts deal with large document sets

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions