Skip to content

Establish target metrics for task difficulty/disparity #340

@ScuttleBot

Description

@ScuttleBot

Goal

Define quantitative targets for benchmark difficulty and implement monitoring to track them.

Target Metrics

Score Dispersion

  • Target: Standard deviation across models σ > 0.15
  • Current: Many tasks have σ < 0.10 (insufficient spread)

Ceiling Effects

  • Target: No task should have >85% pass rate across all tested models
  • Rationale: Tasks where everyone succeeds don't differentiate

Floor Effects

  • Target: No task should have <10% pass rate
  • Rationale: Tasks where everyone fails are uninformative

Discrimination

  • Target: Each task should have >0.3 point spread between best and median model
  • Rationale: Tasks should actually separate capabilities

Item Response Theory Integration

Consider IRT-based analysis:

  • Difficulty parameter (b): Where on capability scale is 50% success?
  • Discrimination parameter (a): How sharply does task separate levels?

Ideal tasks:

  • b values spread across capability range
  • a values > 0.5 (tasks actually discriminate)
  • No negative discrimination

Implementation

  1. Add metrics calculation to benchmark reporting
  2. Create dashboard/report showing per-task metrics
  3. Flag tasks that violate targets for review
  4. Track metrics over time as new models are added

Success Criteria

  • All target metrics should be tracked automatically
  • Tasks violating thresholds should be flagged for revision
  • Overall benchmark disparity should be measurable and improving

References

  • Item Response Theory for AI benchmarks
  • AllenAI fluid benchmarking approach

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions