Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 68 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,85 @@
<p align="center">
<img src="docs/assets/logo-color.png" alt="agentevals" width="420" />
<picture>
<source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-color-on-transparent.svg">
<source media="(prefers-color-scheme: light)" srcset="docs/assets/logo-dark-on-transparent.svg">
<img src="docs/assets/logo-color-on-transparent.svg" alt="agentevals" width="420" />
</picture>
</p>

`agentevals` evaluates AI agent behavior from OpenTelemetry traces, without re-running the agent. Record once, score as many times as you want.
<h1 align="center">Ship Agents Reliably</h1>

Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
<p align="center">
Benchmark your agents before they hit production.<br>
agentevals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
</p>

<p align="center">
<a href="https://github.com/agentevals-dev/agentevals/stargazers"><img src="https://img.shields.io/github/stars/agentevals-dev/agentevals?style=social" alt="GitHub Stars"></a>
&nbsp;
<a href="https://discord.gg/cpveEn8Ah2"><img src="https://img.shields.io/discord/1435836734666707190?label=Discord&logo=discord&logoColor=white&color=5865F2" alt="Discord"></a>
&nbsp;
<a href="https://github.com/agentevals-dev/agentevals/releases"><img src="https://img.shields.io/github/v/release/agentevals-dev/agentevals?label=Release" alt="Release"></a>
&nbsp;
<a href="https://github.com/agentevals-dev/agentevals/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"></a>
&nbsp;
<a href="https://pypi.org/project/agentevals-cli/"><img src="https://img.shields.io/pypi/v/agentevals-cli?label=PyPI&color=blue" alt="PyPI"></a>
</p>

<p align="center">
<a href="#installation">Install</a> · <a href="#quick-start">Quick Start</a> · <a href="https://github.com/agentevals-dev/agentevals/releases">Releases</a> · <a href="CONTRIBUTING.md">Contributing</a> · <a href="https://discord.gg/cpveEn8Ah2">Discord</a>
</p>

---

## What is agentevals?

agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.

It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.

- **CLI** for scripting and CI pipelines
- **Web UI** for visual inspection and local developer experience
- **MCP server** so MCP clients can run evaluations from a conversation

## Why agentevals?

Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. agentevals takes a different approach:

- **No re-execution** — score agents from existing traces without replaying expensive LLM calls
- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans
- **Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating
- **Custom evaluators** — write scoring logic in Python, JavaScript, or any language
- **CI/CD ready** — gate deployments on quality thresholds directly in your pipeline
- **Local-first** — no cloud dependency required; everything runs on your machine

## How It Works

agentevals follows three simple steps:

1. **Collect traces** — Instrument your agent with OpenTelemetry (or export Jaeger JSON). Point the OTLP exporter at the agentevals receiver, or load trace files directly.
2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.

```
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Your Agent │────▶│ OTel Traces │────▶│ agentevals │
│ (any framework) │ (OTLP/Jaeger) │ CLI · UI · MCP │
└─────────────┘ └──────────────┘ └──────────────────┘
┌───────┴────────┐
│ Eval Sets │
│ (golden refs) │
└────────────────┘
```

> [!IMPORTANT]
> This project is under active development. Expect breaking changes.

## Contents

- [What is agentevals?](#what-is-agentevals)
- [Why agentevals?](#why-agentevals)
- [How It Works](#how-it-works)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Integration](#integration)
Expand Down
13 changes: 13 additions & 0 deletions docs/assets/logo-color-on-transparent.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading