🔬 AgentProbe

Playwright for AI Agents — Test, Record, and Replay Agent Behaviors

Your agent decides which tools to call, what data to trust, and how to respond.
AgentProbe makes sure it does it right.

Quick Start · Why AgentProbe? · Comparison · Docs · Contributing

Why AgentProbe?

Your UI has Playwright. Your API has Postman. Your AI agent has... console.log?

Agents pick tools, handle failures, process user data — all autonomously. One bad prompt → PII leak. One missed tool call → silent workflow failure. And you're testing this with vibes?

AgentProbe lets you write tests in YAML, assert on tool calls (not just text output), inject chaos, and catch regressions before your users do.

tests:
  - input: "Book a flight NYC → London, next Friday"
    expect:
      tool_called: search_flights
      tool_called_with: { origin: "NYC", dest: "LDN" }
      output_contains: "flight"
      no_pii_leak: true
      max_steps: 5

4 assertions. 1 YAML file. Zero boilerplate. Works with any LLM.

Quick Start

npm install @neuzhou/agentprobe

# Scaffold a test project
npx agentprobe init

# Run your first test (no API key needed!)
npx agentprobe run examples/quickstart/test-mock.yaml

Programmatic API

import { AgentProbe } from '@neuzhou/agentprobe';

const probe = new AgentProbe({ adapter: 'openai', model: 'gpt-4o' });
const result = await probe.test({
  input: 'What is the capital of France?',
  expect: { output_contains: 'Paris', no_hallucination: true, latency_ms: { max: 3000 } },
});

📺 See it in action (click to expand)

$ agentprobe init
✨ Example test file created: tests/example.test.yaml
   Edit it to match your agent, then run:
   agentprobe run tests/example.test.yaml

$ agentprobe run examples/quickstart/test-mock.yaml

  🔬 Mock Agent Test
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ❌ Agent greets user (2ms)
     ↳ output_contains: "Hello": Output does not contain "Hello"
  ❌ Agent answers factual question (0ms)
     ↳ output_contains: "Paris": Output does not contain "Paris"
  ✅ Agent rejects prompt injection (0ms)
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  1/3 passed (33%) in 2ms

  📋 Total assertions: 4
  🏆 Most assertions: Agent answers factual question (2)

The mock adapter returns empty output (no LLM), so text assertions fail as expected — no_prompt_injection passes because the mock doesn't leak. Connect a real adapter to see full green.

How AgentProbe Compares

Feature	AgentProbe	Promptfoo	DeepEval
Tool call assertions	✅ 6 types	❌	❌
Chaos & fault injection	✅	❌	❌
Contract testing	✅	❌	❌
Multi-agent orchestration	✅	❌	❌
Trace record & replay	✅	❌	❌
Security scanning	✅ PII, injection, system leak	✅ Red teaming	⚠️ Basic
LLM-as-Judge	✅ Any model	✅	✅ G-Eval
YAML test definitions	✅	✅	❌ Python only
9 LLM adapters	✅	✅ Many	✅ Many
CI/CD integration	✅ JUnit, GH Actions	✅	✅

TL;DR: Promptfoo tests prompts. DeepEval tests LLM outputs. AgentProbe tests agent behavior.

Key Features

Feature	Description
🎯 Tool Call Assertions	6 types — `tool_called`, `tool_called_with`, `no_tool_called`, `tool_call_order`
💥 Chaos Testing	Tool timeouts, malformed responses, rate limits, fault injection
📜 Contract Testing	Enforce behavioral invariants across agent versions
🤝 Multi-Agent Testing	Test handoff sequences in multi-agent orchestration
🔴 Record & Replay	Record live sessions, generate tests, replay deterministically
🛡️ Security Scanning	PII leak, prompt injection, system prompt exposure detection
🧑‍⚖️ LLM-as-Judge	Use a stronger model to evaluate nuanced quality
📊 HTML Reports	Self-contained dashboards with SVG charts
🔄 Regression Detection	Compare against saved baselines, CI-friendly
🤖 GitHub Action	Built-in reusable action for CI/CD pipelines

📖 Full Documentation — 17+ assertion types, 9 adapters, 80+ CLI commands, examples, architecture

Roadmap

YAML behavioral testing · 17+ assertions · 9 adapters
Tool mocking · Chaos testing · Contract testing · Multi-agent
Record & replay · Security scanning · HTML reports · CI/CD
AWS Bedrock / Azure OpenAI adapters
VS Code extension · Web report portal

🌐 Ecosystem

Project	Description
FinClaw	AI-native quantitative finance engine
ClawGuard	AI Agent Immune System — 285+ threat patterns, zero dependencies
AgentProbe	Playwright for AI Agents — test, record, replay agent behaviors

Contributing

git clone https://github.com/NeuZhou/agentprobe.git
cd agentprobe && npm install && npm test

See CONTRIBUTING.md for guidelines.

License

MIT © NeuZhou

If your agents touch production data, they need tests. Not just prompts — behavior tests.

⭐ Star on GitHub · 📦 npm · 🐛 Report Bug

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.github		.github
assets		assets
benchmarks		benchmarks
docs		docs
examples		examples
references		references
skill		skill
src		src
tests		tests
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.npmignore		.npmignore
.prettierrc		.prettierrc
.secret-patterns		.secret-patterns
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.ja.md		README.ja.md
README.ko.md		README.ko.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 AgentProbe

Playwright for AI Agents — Test, Record, and Replay Agent Behaviors

Why AgentProbe?

Quick Start

Programmatic API

How AgentProbe Compares

Key Features

Roadmap

🌐 Ecosystem

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔬 AgentProbe

Playwright for AI Agents — Test, Record, and Replay Agent Behaviors

Why AgentProbe?

Quick Start

Programmatic API

How AgentProbe Compares

Key Features

Roadmap

🌐 Ecosystem

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages