diff --git a/README.md b/README.md index 39be771..3262c75 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,85 @@
-
+
+
+Benchmark your agents before they hit production.
+agentevals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
+
+ Install · Quick Start · Releases · Contributing · Discord +
+ +--- + +## What is agentevals? + +agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork. + +It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges. - **CLI** for scripting and CI pipelines - **Web UI** for visual inspection and local developer experience - **MCP server** so MCP clients can run evaluations from a conversation +## Why agentevals? + +Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. agentevals takes a different approach: + +- **No re-execution** — score agents from existing traces without replaying expensive LLM calls +- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans +- **Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating +- **Custom evaluators** — write scoring logic in Python, JavaScript, or any language +- **CI/CD ready** — gate deployments on quality thresholds directly in your pipeline +- **Local-first** — no cloud dependency required; everything runs on your machine + +## How It Works + +agentevals follows three simple steps: + +1. **Collect traces** — Instrument your agent with OpenTelemetry (or export Jaeger JSON). Point the OTLP exporter at the agentevals receiver, or load trace files directly. +2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like. +3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns. + +``` +┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ +│ Your Agent │────▶│ OTel Traces │────▶│ agentevals │ +│ (any framework) │ (OTLP/Jaeger) │ CLI · UI · MCP │ +└─────────────┘ └──────────────┘ └──────────────────┘ + │ + ┌───────┴────────┐ + │ Eval Sets │ + │ (golden refs) │ + └────────────────┘ +``` + > [!IMPORTANT] > This project is under active development. Expect breaking changes. ## Contents +- [What is agentevals?](#what-is-agentevals) +- [Why agentevals?](#why-agentevals) +- [How It Works](#how-it-works) - [Installation](#installation) - [Quick Start](#quick-start) - [Integration](#integration) diff --git a/docs/assets/logo-color-on-transparent.svg b/docs/assets/logo-color-on-transparent.svg new file mode 100644 index 0000000..3c093a3 --- /dev/null +++ b/docs/assets/logo-color-on-transparent.svg @@ -0,0 +1,13 @@ + diff --git a/docs/assets/logo-dark-on-transparent.svg b/docs/assets/logo-dark-on-transparent.svg new file mode 100644 index 0000000..8c69ff8 --- /dev/null +++ b/docs/assets/logo-dark-on-transparent.svg @@ -0,0 +1,13 @@ +