Skip to content

Add process/efficiency metrics alongside correctness #332

@ScuttleBot

Description

@ScuttleBot

Goal

Track HOW models solve tasks, not just WHETHER they succeed. Creates disparity between models that both succeed but with different efficiency.

Metrics to Add

  1. Token efficiency: Total tokens used per task (input + output)
  2. Step efficiency: Actual tool calls / minimum necessary tool calls
  3. Error recovery rate: Successful corrections after initial failures
  4. Time to completion: Wall-clock time elapsed

Implementation

  1. Add token counting to transcript analysis
  2. Define "optimal step count" for each task in metadata
  3. Track retry/correction patterns in transcripts
  4. Record timestamps in results

Scoring Formula (proposed)

final_score = correctness * 0.7 + efficiency_factor * 0.2 + style_factor * 0.1
efficiency_factor = min(1.0, optimal_steps / actual_steps)

Success Criteria

  • Two models with identical correctness scores should differentiate on efficiency
  • Verbose/wasteful approaches should score lower than elegant ones

References

  • METR time horizon research on task efficiency
  • SWE-bench Pro process metrics

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions