Skip to content

atharvajoshi01/agenteval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agenteval

CI Python 3.9+ License: MIT

Lightweight evaluation framework for AI agents. Measure accuracy, cost, latency, and safety across any agent architecture.

Works with any agent that accepts a string and returns a string — LangChain, CrewAI, AutoGen, OpenAI Assistants, or plain functions.

Installation

pip install agenteval

Quick Start

from agenteval import AgentEvaluator, TaskSuite

# Define tasks
suite = TaskSuite.from_list([
    {"name": "math", "prompt": "What is 2+2?", "expected": "4", "category": "math"},
    {"name": "capital", "prompt": "Capital of France?", "expected": "Paris", "category": "geo"},
    {"name": "code", "prompt": "Write hello world in Python", "expected": "print", "category": "code"},
])

# Evaluate agents
evaluator = AgentEvaluator(
    agents={
        "agent_a": my_agent_a,  # any callable(str) -> str
        "agent_b": my_agent_b,
    },
    runs_per_task=3,  # run each task 3x for reliability measurement
)

results = evaluator.run(suite)

# Check metrics
print(results["agent_a"].metrics.accuracy)
print(results["agent_a"].metrics.latency_p95)
print(results["agent_a"].safety.safety_score)

# Compare side-by-side
evaluator.compare_results(results).print_table()
# agent   | accuracy | success_rate | latency_mean | latency_p95 | tokens_mean | cost_per_run
# agent_a | 91.1%    | 100.0%       | 2800ms       | 3200ms      | 450         | $0.0135
# agent_b | 87.3%    | 97.5%        | 3100ms       | 4500ms      | 520         | $0.0156
# Winner: agent_a

What It Measures

Module Metrics
Accuracy Exact match, containment match, custom judge functions
Latency Mean, p50, p95, p99 (per-run, in ms)
Cost Token-based cost estimation (configurable per-model pricing)
Reliability Success rate across runs, error categorization
Safety PII leakage (email, phone, SSN, credit card), prompt injection detection, custom forbidden patterns

Task Suites

Define tasks in code, JSON, or YAML:

# tasks.yaml
name: customer_support
tasks:
  - name: greeting
    prompt: "Hi, I need help with my order"
    expected: "help"
    category: greeting
  - name: refund
    prompt: "I want a refund for order #1234"
    expected: "refund"
    category: transactions
suite = TaskSuite.from_yaml("tasks.yaml")

Safety Checks

from agenteval import SafetyChecker

checker = SafetyChecker(
    check_pii=True,           # emails, phones, SSNs, credit cards, IPs
    check_injection=True,      # prompt injection leak detection
    forbidden_patterns=[       # custom regex patterns
        r"SECRET_KEY",
        r"password\s*[:=]",
    ],
)

report = checker.check(run_results)
print(report.safety_score)  # 0.0 - 1.0
print(report.violations)    # list of SafetyViolation

Judge Functions

Built-in judges for different matching strategies:

from agenteval.judges import exact_match, contains_match, numeric_match

exact_match("Paris", "Paris")           # True
contains_match("The answer is 42", "42") # True
numeric_match("3.14159", "3.14", tolerance=0.01)  # True

LLM-as-Judge

For semantic evaluation using OpenAI or Anthropic:

from agenteval.judges import llm_judge, anthropic_judge

# OpenAI
evaluator = AgentEvaluator(
    agents={"my_agent": agent_fn},
    judge_fn=llm_judge(model="gpt-4o-mini"),
)

# Anthropic
evaluator = AgentEvaluator(
    agents={"my_agent": agent_fn},
    judge_fn=anthropic_judge(model="claude-sonnet-4-20250514"),
)

Install LLM support: pip install agenteval[openai] or pip install agenteval[anthropic]

CLI

# Validate a task suite
agenteval validate tasks.yaml

# Show task suite info
agenteval info tasks.yaml

# Version
agenteval version

Development

git clone https://github.com/atharvajoshi01/agenteval.git
cd agenteval
pip install -e ".[dev]"
pytest

License

MIT

About

Lightweight evaluation framework for AI agents — measure accuracy, cost, latency, and safety across any agent architecture

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages