Add process/efficiency metrics alongside correctness

## Goal
Track HOW models solve tasks, not just WHETHER they succeed. Creates disparity between models that both succeed but with different efficiency.

## Metrics to Add
1. **Token efficiency**: Total tokens used per task (input + output)
2. **Step efficiency**: Actual tool calls / minimum necessary tool calls
3. **Error recovery rate**: Successful corrections after initial failures
4. **Time to completion**: Wall-clock time elapsed

## Implementation
1. Add token counting to transcript analysis
2. Define "optimal step count" for each task in metadata
3. Track retry/correction patterns in transcripts
4. Record timestamps in results

## Scoring Formula (proposed)
```
final_score = correctness * 0.7 + efficiency_factor * 0.2 + style_factor * 0.1
efficiency_factor = min(1.0, optimal_steps / actual_steps)
```

## Success Criteria
- Two models with identical correctness scores should differentiate on efficiency
- Verbose/wasteful approaches should score lower than elegant ones

## References
- METR time horizon research on task efficiency
- SWE-bench Pro process metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add process/efficiency metrics alongside correctness #332

Goal

Metrics to Add

Implementation

Scoring Formula (proposed)

Success Criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add process/efficiency metrics alongside correctness #332

Description

Goal

Metrics to Add

Implementation

Scoring Formula (proposed)

Success Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions