Skip to content

Add openai_eval type to delegate evals to OpenAI APIs#73

Merged
EItanya merged 1 commit intomainfrom
feature/add-openai-eval-type
Mar 30, 2026
Merged

Add openai_eval type to delegate evals to OpenAI APIs#73
EItanya merged 1 commit intomainfrom
feature/add-openai-eval-type

Conversation

@krisztianfekete
Copy link
Copy Markdown
Contributor

@krisztianfekete krisztianfekete commented Mar 30, 2026

This PR adds support for OpenAI's Evals API as an evaluator backend in agentevals. Instead of running grading logic locally, it delegates to OpenAI's hosted evaluation infrastructure. The first supported grader is TextSimilarityGrader (fuzzy_match, BLEU, ROUGE, cosine, etc.), but the design is set up so adding more grader types later is straightforward.

We introduce a new evaluator type, openai_eval.

You can configure it like this:

evaluators:
  - name: response_similarity
    type: openai_eval
    threshold: 0.8
    grader:
      type: text_similarity
      evaluation_metric: bleu

Under the hood, the backend does a create-eval, create-run, poll, collect-results, cleanup cycle against the OpenAI API. To make it work for pre-existing agent traces (because we just want grading) this puts both the actual and expected text into the item namespace with include_sample_schema: False, so OpenAI never tries to generate model outputs.

@krisztianfekete krisztianfekete requested a review from peterj March 30, 2026 10:14
@krisztianfekete krisztianfekete force-pushed the feature/add-openai-eval-type branch from a74e734 to 3899891 Compare March 30, 2026 10:16
@krisztianfekete krisztianfekete requested a review from EItanya March 30, 2026 13:07
@krisztianfekete krisztianfekete changed the title Add openai_eval type to delegate eval to OpenAI APIs Add openai_eval type to delegate evals to OpenAI APIs Mar 30, 2026
@EItanya EItanya merged commit d7ef558 into main Mar 30, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants