Skip to content

Add performance evaluators#9

Merged
krisztianfekete merged 2 commits intomainfrom
feature/performance-evaluators
Apr 14, 2026
Merged

Add performance evaluators#9
krisztianfekete merged 2 commits intomainfrom
feature/performance-evaluators

Conversation

@krisztianfekete
Copy link
Copy Markdown
Collaborator

@krisztianfekete krisztianfekete commented Apr 14, 2026

Example config for testing:

evaluators:
  - name: token_efficiency
    type: code
    path: ../evaluators/evaluators/token_efficiency/token_efficiency.py
    threshold: 0.3
    config:
      max_input_tokens: 150000
      max_output_tokens: 50000

  - name: time_efficiency
    type: code
    path: ../evaluators/evaluators/time_efficiency/time_efficiency.py
    threshold: 0.3
    config:
      max_duration_s: 0.5
      latency_percentile: p95
      latency_source: overall

  - name: tool_efficiency
    type: code
    path: ../evaluators/evaluators/tool_efficiency/tool_efficiency.py
    threshold: 0.5
    config:
      max_tool_calls: 3
      min_tool_calls: 2
      penalize_duplicates: true
      penalize_errors: true

Supersedes #7
Fixes #6

@krisztianfekete krisztianfekete requested a review from peterj April 14, 2026 12:33
Comment thread evaluators/token_efficiency/token_efficiency.py Outdated
Comment thread evaluators/tool_efficiency/tool_efficiency.py Outdated
Copy link
Copy Markdown
Collaborator

@peterj peterj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits

@krisztianfekete krisztianfekete marked this pull request as ready for review April 14, 2026 14:18
@krisztianfekete krisztianfekete merged commit d262bae into main Apr 14, 2026
1 check passed
@krisztianfekete krisztianfekete deleted the feature/performance-evaluators branch April 14, 2026 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: performance evaluators for token efficiency, tool efficiency, and time-to-resolution

2 participants