From db512f5bc14647a203a354ee906c679920ed6067 Mon Sep 17 00:00:00 2001
From: Sebastian Maniak
Date: Mon, 23 Mar 2026 18:57:56 -0400
Subject: [PATCH 1/9] Redesign README hero section with theme-aware logo and
badges
Add SVG logos that swap based on GitHub dark/light theme, slogan and
tagline, shields.io badges (stars, Discord, release, license, PyPI),
and a quick-nav link row.
Co-Authored-By: Claude Opus 4.6
---
README.md | 31 +++++++++++++++++++++--
docs/assets/logo-color-on-transparent.svg | 13 ++++++++++
docs/assets/logo-dark-on-transparent.svg | 13 ++++++++++
3 files changed, 55 insertions(+), 2 deletions(-)
create mode 100644 docs/assets/logo-color-on-transparent.svg
create mode 100644 docs/assets/logo-dark-on-transparent.svg
diff --git a/README.md b/README.md
index 39be771..3c74179 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,35 @@
-
+
+
+
+
+
-`agentevals` evaluates AI agent behavior from OpenTelemetry traces, without re-running the agent. Record once, score as many times as you want.
+
Ship Agents Reliably
+
+
+Benchmark your agents before they hit production.
+AgentEvals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
+
+
+---
Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
diff --git a/docs/assets/logo-color-on-transparent.svg b/docs/assets/logo-color-on-transparent.svg
new file mode 100644
index 0000000..3c093a3
--- /dev/null
+++ b/docs/assets/logo-color-on-transparent.svg
@@ -0,0 +1,13 @@
+
diff --git a/docs/assets/logo-dark-on-transparent.svg b/docs/assets/logo-dark-on-transparent.svg
new file mode 100644
index 0000000..8c69ff8
--- /dev/null
+++ b/docs/assets/logo-dark-on-transparent.svg
@@ -0,0 +1,13 @@
+
From 0d34fab811c69254e852a23fe0c9d7fea1c58227 Mon Sep 17 00:00:00 2001
From: Sebastian Maniak
Date: Mon, 23 Mar 2026 18:58:05 -0400
Subject: [PATCH 2/9] Make slogan heading larger (h1)
Co-Authored-By: Claude Opus 4.6
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 3c74179..2fd8794 100644
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@
-
Ship Agents Reliably
+
Ship Agents Reliably
Benchmark your agents before they hit production.
From ae20e63f40981ea3ef184b4e2beadba6fa304818 Mon Sep 17 00:00:00 2001
From: Sebastian Maniak
Date: Mon, 23 Mar 2026 19:04:08 -0400
Subject: [PATCH 3/9] Add What is, Why, and How It Works sections to README
Adds three overview sections based on aevals.ai site content to help
new visitors quickly understand the project's purpose and workflow.
Co-Authored-By: Claude Opus 4.6
---
README.md | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 2fd8794..f4aa739 100644
--- a/README.md
+++ b/README.md
@@ -31,17 +31,55 @@ AgentEvals scores performance and inference quality from OpenTelemetry traces
---
-Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
+## What is AgentEvals?
+
+AgentEvals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.
+
+It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
- **CLI** for scripting and CI pipelines
- **Web UI** for visual inspection and local developer experience
- **MCP server** so MCP clients can run evaluations from a conversation
+## Why AgentEvals?
+
+Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. AgentEvals takes a different approach:
+
+- **No re-execution** — score agents from existing traces without replaying expensive LLM calls
+- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans
+- **Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating
+- **Custom evaluators** — write scoring logic in Python, JavaScript, or any language
+- **CI/CD ready** — gate deployments on quality thresholds directly in your pipeline
+- **Local-first** — no cloud dependency required; everything runs on your machine
+
+## How It Works
+
+AgentEvals follows three simple steps:
+
+1. **Collect traces** — Instrument your agent with OpenTelemetry (or export Jaeger JSON). Point the OTLP exporter at the AgentEvals receiver, or load trace files directly.
+2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
+3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.
+
+```
+┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
+│ Your Agent │────▶│ OTel Traces │────▶│ AgentEvals │
+│ (any framework) │ (OTLP/Jaeger) │ CLI · UI · MCP │
+└─────────────┘ └──────────────┘ └──────────────────┘
+ │
+ ┌───────┴────────┐
+ │ Eval Sets │
+ │ (golden refs) │
+ └────────────────┘
+```
+
> [!IMPORTANT]
> This project is under active development. Expect breaking changes.
## Contents
+- [What is AgentEvals?](#what-is-agentevals)
+- [Why AgentEvals?](#why-agentevals)
+- [How It Works](#how-it-works)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Integration](#integration)
From 3fb58ad672756892a39c6dda66acd174de0cc087 Mon Sep 17 00:00:00 2001
From: Sebastian Maniak
Date: Mon, 23 Mar 2026 19:05:45 -0400
Subject: [PATCH 4/9] Use lowercase agentevals throughout README
Co-Authored-By: Claude Opus 4.6
---
README.md | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/README.md b/README.md
index f4aa739..3262c75 100644
--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@
Benchmark your agents before they hit production.
-AgentEvals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
+agentevals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
@@ -31,9 +31,9 @@ AgentEvals scores performance and inference quality from OpenTelemetry traces
---
-## What is AgentEvals?
+## What is agentevals?
-AgentEvals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.
+agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.
It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
@@ -41,9 +41,9 @@ It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, a
- **Web UI** for visual inspection and local developer experience
- **MCP server** so MCP clients can run evaluations from a conversation
-## Why AgentEvals?
+## Why agentevals?
-Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. AgentEvals takes a different approach:
+Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. agentevals takes a different approach:
- **No re-execution** — score agents from existing traces without replaying expensive LLM calls
- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans
@@ -54,15 +54,15 @@ Most evaluation tools require you to **re-execute your agent** for every test
## How It Works
-AgentEvals follows three simple steps:
+agentevals follows three simple steps:
-1. **Collect traces** — Instrument your agent with OpenTelemetry (or export Jaeger JSON). Point the OTLP exporter at the AgentEvals receiver, or load trace files directly.
+1. **Collect traces** — Instrument your agent with OpenTelemetry (or export Jaeger JSON). Point the OTLP exporter at the agentevals receiver, or load trace files directly.
2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.
```
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
-│ Your Agent │────▶│ OTel Traces │────▶│ AgentEvals │
+│ Your Agent │────▶│ OTel Traces │────▶│ agentevals │
│ (any framework) │ (OTLP/Jaeger) │ CLI · UI · MCP │
└─────────────┘ └──────────────┘ └──────────────────┘
│
@@ -77,8 +77,8 @@ AgentEvals follows three simple steps:
## Contents
-- [What is AgentEvals?](#what-is-agentevals)
-- [Why AgentEvals?](#why-agentevals)
+- [What is agentevals?](#what-is-agentevals)
+- [Why agentevals?](#why-agentevals)
- [How It Works](#how-it-works)
- [Installation](#installation)
- [Quick Start](#quick-start)
From 7dd99f4c481749dd1f975b91aedfc06b1016e0a8 Mon Sep 17 00:00:00 2001
From: Sebastian Maniak
Date: Mon, 23 Mar 2026 19:22:13 -0400
Subject: [PATCH 5/9] Remove overview sections from table of contents
They live at the top of the README and don't need TOC entries.
Co-Authored-By: Claude Opus 4.6
---
README.md | 3 ---
1 file changed, 3 deletions(-)
diff --git a/README.md b/README.md
index 3262c75..3dbf78e 100644
--- a/README.md
+++ b/README.md
@@ -77,9 +77,6 @@ agentevals follows three simple steps:
## Contents
-- [What is agentevals?](#what-is-agentevals)
-- [Why agentevals?](#why-agentevals)
-- [How It Works](#how-it-works)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Integration](#integration)
From 5920b4e61aef490bc7d3dce03a944310e0fed042 Mon Sep 17 00:00:00 2001
From: Sebastian Maniak
Date: Mon, 23 Mar 2026 19:40:52 -0400
Subject: [PATCH 6/9] Replace ASCII diagram with animated gif in How It Works
section
Co-Authored-By: Claude Opus 4.6
---
README.md | 14 +++-----------
docs/assets/how-it-works.gif | Bin 0 -> 246826 bytes
2 files changed, 3 insertions(+), 11 deletions(-)
create mode 100644 docs/assets/how-it-works.gif
diff --git a/README.md b/README.md
index 3dbf78e..a887d0a 100644
--- a/README.md
+++ b/README.md
@@ -60,17 +60,9 @@ agentevals follows three simple steps:
2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.
-```
-┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
-│ Your Agent │────▶│ OTel Traces │────▶│ agentevals │
-│ (any framework) │ (OTLP/Jaeger) │ CLI · UI · MCP │
-└─────────────┘ └──────────────┘ └──────────────────┘
- │
- ┌───────┴────────┐
- │ Eval Sets │
- │ (golden refs) │
- └────────────────┘
-```
+
+
+
> [!IMPORTANT]
> This project is under active development. Expect breaking changes.
diff --git a/docs/assets/how-it-works.gif b/docs/assets/how-it-works.gif
new file mode 100644
index 0000000000000000000000000000000000000000..878e5a938d8e157a6796c7cc4642f0677a8ef820
GIT binary patch
literal 246826
zcmaIdXH-*N+b{Z+kVXkax*!;O$EZjL>AgfiIv9%d66sPTbVR@?NG}1YQjAhWL=4ia
zAc%B=A{~qcMMXH=&vTx4pLg%~>?MG3J;b=9f9vT-Wvg=^N@PD!F_Ce+9w;K;W3s
zOxMa%TNA5$K^8*?0#2rWK>(JMxAe&a{Ie_td5VkW)M+-Rv!__ka3BRa*o08rVr;yU
z-0Y$}?2k>FKD
zi(NvC%L_@Xh@VrHz^IFIYKU`cO7m*Ta2sLH=w3KubdlFep4UN{*GZMv^&DDL>Y~