English | ζ₯ζ¬θͺ | νκ΅μ΄ | δΈζ
Your agent decides which tools to call, what data to trust, and how to respond.
AgentProbe makes sure it does it right.
Quick Start Β· Why AgentProbe? Β· Comparison Β· Docs Β· Contributing
Your UI has Playwright. Your API has Postman. Your AI agent has... console.log?
Agents pick tools, handle failures, process user data β all autonomously. One bad prompt β PII leak. One missed tool call β silent workflow failure. And you're testing this with vibes?
AgentProbe lets you write tests in YAML, assert on tool calls (not just text output), inject chaos, and catch regressions before your users do.
tests:
- input: "Book a flight NYC β London, next Friday"
expect:
tool_called: search_flights
tool_called_with: { origin: "NYC", dest: "LDN" }
output_contains: "flight"
no_pii_leak: true
max_steps: 54 assertions. 1 YAML file. Zero boilerplate. Works with any LLM.
npm install @neuzhou/agentprobe
# Scaffold a test project
npx agentprobe init
# Run your first test (no API key needed!)
npx agentprobe run examples/quickstart/test-mock.yamlimport { AgentProbe } from '@neuzhou/agentprobe';
const probe = new AgentProbe({ adapter: 'openai', model: 'gpt-4o' });
const result = await probe.test({
input: 'What is the capital of France?',
expect: { output_contains: 'Paris', no_hallucination: true, latency_ms: { max: 3000 } },
});πΊ See it in action (click to expand)
$ agentprobe init
β¨ Example test file created: tests/example.test.yaml
Edit it to match your agent, then run:
agentprobe run tests/example.test.yaml
$ agentprobe run examples/quickstart/test-mock.yaml
π¬ Mock Agent Test
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agent greets user (2ms)
β³ output_contains: "Hello": Output does not contain "Hello"
β Agent answers factual question (0ms)
β³ output_contains: "Paris": Output does not contain "Paris"
β
Agent rejects prompt injection (0ms)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
1/3 passed (33%) in 2ms
π Total assertions: 4
π Most assertions: Agent answers factual question (2)
The mock adapter returns empty output (no LLM), so text assertions fail as expected β no_prompt_injection passes because the mock doesn't leak. Connect a real adapter to see full green.
| Feature | AgentProbe | Promptfoo | DeepEval |
|---|---|---|---|
| Tool call assertions | β 6 types | β | β |
| Chaos & fault injection | β | β | β |
| Contract testing | β | β | β |
| Multi-agent orchestration | β | β | β |
| Trace record & replay | β | β | β |
| Security scanning | β PII, injection, system leak | β Red teaming | |
| LLM-as-Judge | β Any model | β | β G-Eval |
| YAML test definitions | β | β | β Python only |
| 9 LLM adapters | β | β Many | β Many |
| CI/CD integration | β JUnit, GH Actions | β | β |
TL;DR: Promptfoo tests prompts. DeepEval tests LLM outputs. AgentProbe tests agent behavior.
| Feature | Description |
|---|---|
| π― Tool Call Assertions | 6 types β tool_called, tool_called_with, no_tool_called, tool_call_order |
| π₯ Chaos Testing | Tool timeouts, malformed responses, rate limits, fault injection |
| π Contract Testing | Enforce behavioral invariants across agent versions |
| π€ Multi-Agent Testing | Test handoff sequences in multi-agent orchestration |
| π΄ Record & Replay | Record live sessions, generate tests, replay deterministically |
| π‘οΈ Security Scanning | PII leak, prompt injection, system prompt exposure detection |
| π§ββοΈ LLM-as-Judge | Use a stronger model to evaluate nuanced quality |
| π HTML Reports | Self-contained dashboards with SVG charts |
| π Regression Detection | Compare against saved baselines, CI-friendly |
| π€ GitHub Action | Built-in reusable action for CI/CD pipelines |
π Full Documentation β 17+ assertion types, 9 adapters, 80+ CLI commands, examples, architecture
- YAML behavioral testing Β· 17+ assertions Β· 9 adapters
- Tool mocking Β· Chaos testing Β· Contract testing Β· Multi-agent
- Record & replay Β· Security scanning Β· HTML reports Β· CI/CD
- AWS Bedrock / Azure OpenAI adapters
- VS Code extension Β· Web report portal
| Project | Description |
|---|---|
| FinClaw | AI-native quantitative finance engine |
| ClawGuard | AI Agent Immune System β 285+ threat patterns, zero dependencies |
| AgentProbe | Playwright for AI Agents β test, record, replay agent behaviors |
git clone https://github.com/NeuZhou/agentprobe.git
cd agentprobe && npm install && npm testSee CONTRIBUTING.md for guidelines.
If your agents touch production data, they need tests. Not just prompts β behavior tests.
