Goal
Define quantitative targets for benchmark difficulty and implement monitoring to track them.
Target Metrics
Score Dispersion
- Target: Standard deviation across models σ > 0.15
- Current: Many tasks have σ < 0.10 (insufficient spread)
Ceiling Effects
- Target: No task should have >85% pass rate across all tested models
- Rationale: Tasks where everyone succeeds don't differentiate
Floor Effects
- Target: No task should have <10% pass rate
- Rationale: Tasks where everyone fails are uninformative
Discrimination
- Target: Each task should have >0.3 point spread between best and median model
- Rationale: Tasks should actually separate capabilities
Item Response Theory Integration
Consider IRT-based analysis:
- Difficulty parameter (b): Where on capability scale is 50% success?
- Discrimination parameter (a): How sharply does task separate levels?
Ideal tasks:
- b values spread across capability range
- a values > 0.5 (tasks actually discriminate)
- No negative discrimination
Implementation
- Add metrics calculation to benchmark reporting
- Create dashboard/report showing per-task metrics
- Flag tasks that violate targets for review
- Track metrics over time as new models are added
Success Criteria
- All target metrics should be tracked automatically
- Tasks violating thresholds should be flagged for revision
- Overall benchmark disparity should be measurable and improving
References
- Item Response Theory for AI benchmarks
- AllenAI fluid benchmarking approach
Goal
Define quantitative targets for benchmark difficulty and implement monitoring to track them.
Target Metrics
Score Dispersion
Ceiling Effects
Floor Effects
Discrimination
Item Response Theory Integration
Consider IRT-based analysis:
Ideal tasks:
Implementation
Success Criteria
References