Goal
Track HOW models solve tasks, not just WHETHER they succeed. Creates disparity between models that both succeed but with different efficiency.
Metrics to Add
- Token efficiency: Total tokens used per task (input + output)
- Step efficiency: Actual tool calls / minimum necessary tool calls
- Error recovery rate: Successful corrections after initial failures
- Time to completion: Wall-clock time elapsed
Implementation
- Add token counting to transcript analysis
- Define "optimal step count" for each task in metadata
- Track retry/correction patterns in transcripts
- Record timestamps in results
Scoring Formula (proposed)
final_score = correctness * 0.7 + efficiency_factor * 0.2 + style_factor * 0.1
efficiency_factor = min(1.0, optimal_steps / actual_steps)
Success Criteria
- Two models with identical correctness scores should differentiate on efficiency
- Verbose/wasteful approaches should score lower than elegant ones
References
- METR time horizon research on task efficiency
- SWE-bench Pro process metrics
Goal
Track HOW models solve tasks, not just WHETHER they succeed. Creates disparity between models that both succeed but with different efficiency.
Metrics to Add
Implementation
Scoring Formula (proposed)
Success Criteria
References