Goal
Convert binary 0/1 criteria to graduated 0.0 / 0.25 / 0.5 / 0.75 / 1.0 scales to better differentiate model quality.
Background
Current tasks often score 0 or 1 per criterion, missing subtle differences in solution quality. A model that produces a correct but inefficient solution scores the same as one with an elegant approach.
Implementation
- Update grading infrastructure to support graduated scores
- For each criterion, define quality levels:
- 0.0: Missing/wrong
- 0.25: Partially correct, major issues
- 0.5: Mostly correct, some issues
- 0.75: Correct with minor issues
- 1.0: Fully correct/excellent
- Update existing tasks to use graduated criteria where appropriate
Success Criteria
- Score distribution should show more variance (not clustering at 0 and 1)
- Model rankings should be more stable (less sensitivity to binary cutoffs)
References
- RACE Benchmark multi-dimensional scoring
- Item Response Theory discrimination parameters
Goal
Convert binary 0/1 criteria to graduated 0.0 / 0.25 / 0.5 / 0.75 / 1.0 scales to better differentiate model quality.
Background
Current tasks often score 0 or 1 per criterion, missing subtle differences in solution quality. A model that produces a correct but inefficient solution scores the same as one with an elegant approach.
Implementation
Success Criteria
References