Establish target metrics for task difficulty/disparity

## Goal
Define quantitative targets for benchmark difficulty and implement monitoring to track them.

## Target Metrics

### Score Dispersion
- **Target**: Standard deviation across models σ > 0.15
- **Current**: Many tasks have σ < 0.10 (insufficient spread)

### Ceiling Effects
- **Target**: No task should have >85% pass rate across all tested models
- **Rationale**: Tasks where everyone succeeds don't differentiate

### Floor Effects  
- **Target**: No task should have <10% pass rate
- **Rationale**: Tasks where everyone fails are uninformative

### Discrimination
- **Target**: Each task should have >0.3 point spread between best and median model
- **Rationale**: Tasks should actually separate capabilities

## Item Response Theory Integration
Consider IRT-based analysis:
- **Difficulty parameter (b)**: Where on capability scale is 50% success?
- **Discrimination parameter (a)**: How sharply does task separate levels?

Ideal tasks:
- b values spread across capability range
- a values > 0.5 (tasks actually discriminate)
- No negative discrimination

## Implementation
1. Add metrics calculation to benchmark reporting
2. Create dashboard/report showing per-task metrics
3. Flag tasks that violate targets for review
4. Track metrics over time as new models are added

## Success Criteria
- All target metrics should be tracked automatically
- Tasks violating thresholds should be flagged for revision
- Overall benchmark disparity should be measurable and improving

## References
- Item Response Theory for AI benchmarks
- AllenAI fluid benchmarking approach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish target metrics for task difficulty/disparity #340

Goal

Target Metrics

Score Dispersion

Ceiling Effects

Floor Effects

Discrimination

Item Response Theory Integration

Implementation

Success Criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Establish target metrics for task difficulty/disparity #340

Description

Goal

Target Metrics

Score Dispersion

Ceiling Effects

Floor Effects

Discrimination

Item Response Theory Integration

Implementation

Success Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions