Skip to content

NeuZhou/agentprobe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

146 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

English | ζ—₯本θͺž | ν•œκ΅­μ–΄ | δΈ­ζ–‡

πŸ”¬ AgentProbe

Playwright for AI Agents β€” Test, Record, and Replay Agent Behaviors

AgentProbe β€” Test Every Decision Your Agent Makes

Your agent decides which tools to call, what data to trust, and how to respond.
AgentProbe makes sure it does it right.

npm version CI codecov TypeScript License: MIT GitHub Stars

Quick Start Β· Why AgentProbe? Β· Comparison Β· Docs Β· Contributing


Why AgentProbe?

Your UI has Playwright. Your API has Postman. Your AI agent has... console.log?

Agents pick tools, handle failures, process user data β€” all autonomously. One bad prompt β†’ PII leak. One missed tool call β†’ silent workflow failure. And you're testing this with vibes?

AgentProbe lets you write tests in YAML, assert on tool calls (not just text output), inject chaos, and catch regressions before your users do.

tests:
  - input: "Book a flight NYC β†’ London, next Friday"
    expect:
      tool_called: search_flights
      tool_called_with: { origin: "NYC", dest: "LDN" }
      output_contains: "flight"
      no_pii_leak: true
      max_steps: 5

4 assertions. 1 YAML file. Zero boilerplate. Works with any LLM.


Quick Start

npm install @neuzhou/agentprobe

# Scaffold a test project
npx agentprobe init

# Run your first test (no API key needed!)
npx agentprobe run examples/quickstart/test-mock.yaml

Programmatic API

import { AgentProbe } from '@neuzhou/agentprobe';

const probe = new AgentProbe({ adapter: 'openai', model: 'gpt-4o' });
const result = await probe.test({
  input: 'What is the capital of France?',
  expect: { output_contains: 'Paris', no_hallucination: true, latency_ms: { max: 3000 } },
});

πŸ“Ί See it in action (click to expand)
$ agentprobe init
✨ Example test file created: tests/example.test.yaml
   Edit it to match your agent, then run:
   agentprobe run tests/example.test.yaml

$ agentprobe run examples/quickstart/test-mock.yaml

  πŸ”¬ Mock Agent Test
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ❌ Agent greets user (2ms)
     ↳ output_contains: "Hello": Output does not contain "Hello"
  ❌ Agent answers factual question (0ms)
     ↳ output_contains: "Paris": Output does not contain "Paris"
  βœ… Agent rejects prompt injection (0ms)
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  1/3 passed (33%) in 2ms

  πŸ“‹ Total assertions: 4
  πŸ† Most assertions: Agent answers factual question (2)

The mock adapter returns empty output (no LLM), so text assertions fail as expected β€” no_prompt_injection passes because the mock doesn't leak. Connect a real adapter to see full green.


How AgentProbe Compares

Feature AgentProbe Promptfoo DeepEval
Tool call assertions βœ… 6 types ❌ ❌
Chaos & fault injection βœ… ❌ ❌
Contract testing βœ… ❌ ❌
Multi-agent orchestration βœ… ❌ ❌
Trace record & replay βœ… ❌ ❌
Security scanning βœ… PII, injection, system leak βœ… Red teaming ⚠️ Basic
LLM-as-Judge βœ… Any model βœ… βœ… G-Eval
YAML test definitions βœ… βœ… ❌ Python only
9 LLM adapters βœ… βœ… Many βœ… Many
CI/CD integration βœ… JUnit, GH Actions βœ… βœ…

TL;DR: Promptfoo tests prompts. DeepEval tests LLM outputs. AgentProbe tests agent behavior.


Key Features

Feature Description
🎯 Tool Call Assertions 6 types β€” tool_called, tool_called_with, no_tool_called, tool_call_order
πŸ’₯ Chaos Testing Tool timeouts, malformed responses, rate limits, fault injection
πŸ“œ Contract Testing Enforce behavioral invariants across agent versions
🀝 Multi-Agent Testing Test handoff sequences in multi-agent orchestration
πŸ”΄ Record & Replay Record live sessions, generate tests, replay deterministically
πŸ›‘οΈ Security Scanning PII leak, prompt injection, system prompt exposure detection
πŸ§‘β€βš–οΈ LLM-as-Judge Use a stronger model to evaluate nuanced quality
πŸ“Š HTML Reports Self-contained dashboards with SVG charts
πŸ”„ Regression Detection Compare against saved baselines, CI-friendly
πŸ€– GitHub Action Built-in reusable action for CI/CD pipelines

πŸ“– Full Documentation β€” 17+ assertion types, 9 adapters, 80+ CLI commands, examples, architecture


Roadmap

  • YAML behavioral testing Β· 17+ assertions Β· 9 adapters
  • Tool mocking Β· Chaos testing Β· Contract testing Β· Multi-agent
  • Record & replay Β· Security scanning Β· HTML reports Β· CI/CD
  • AWS Bedrock / Azure OpenAI adapters
  • VS Code extension Β· Web report portal

🌐 Ecosystem

Project Description
FinClaw AI-native quantitative finance engine
ClawGuard AI Agent Immune System β€” 285+ threat patterns, zero dependencies
AgentProbe Playwright for AI Agents β€” test, record, replay agent behaviors

Contributing

git clone https://github.com/NeuZhou/agentprobe.git
cd agentprobe && npm install && npm test

See CONTRIBUTING.md for guidelines.


License

MIT Β© NeuZhou


If your agents touch production data, they need tests. Not just prompts β€” behavior tests.

⭐ Star on GitHub Β· πŸ“¦ npm Β· πŸ› Report Bug