Send a single prompt to multiple coding agents running in parallel and compare the results. Each agent works in its own git worktree on a separate branch so they never interfere with each other. Optionally, configure LLM evaluators to review each agent's diff and drive an iterative refinement loop.
uv pip install -e ".[dev]"# List built-in agents
agent-tester agents
# Run two agents on the same prompt
agent-tester run "Add unit tests for the auth module" --agents claude,aider
# Give the run a descriptive name (used in branch and report filenames)
agent-tester run "Refactor auth module" --agents claude,aider --name auth-refactor
# Use a prompt file
agent-tester run --prompt-file task.md --agents claude,codex,aider
# Keep worktrees for manual inspection
agent-tester run "Refactor logging" --agents claude,aider --keep-worktrees- You provide a prompt and select agents
- AgentTester creates a git worktree + branch for each agent from the current HEAD
- All agents run concurrently, each in its own worktree
- Agent output streams to the terminal with colored prefixes
- A markdown comparison report is generated with diff stats and timing
- Worktrees are cleaned up (branches are preserved for
git diff)
Branches are named agenttester/<agent-name>/<run-name> so you can compare results:
git diff agenttester/claude/auth-refactor agenttester/aider/auth-refactorWhen no --name is given, a slug is derived from the first six words of the prompt plus a short hash (e.g. add-unit-tests-for-the-auth-a3f2c1).
Copy config.example.yaml to agent-tester.yaml (or agent-tester.yml) in your target repo to customize agents. Built-in presets are available for claude, aider, and codex.
Auto-detected local config files must use a .yml or .yaml extension. The following names are checked in order:
agent-tester.yaml
agent-tester.yml
.agent-tester.yaml
.agent-tester.yml
You can also pass a config file explicitly — no extension required:
agent-tester run "Fix the bug" --agents claude --config /path/to/myconfigA global config at ~/.config/agenttester/config.yml or ~/.config/agenttester/config.yaml is merged automatically. Local project config takes precedence over global, which takes precedence over built-in presets.
Reports are written to ~/.config/agenttester/projects/<repo-name>/ by default. You can override this per-project:
Local config (agent-tester.yaml in your repo):
reports_dir: ~/my-reports/myprojectGlobal config (~/.config/agenttester/config.yml), per named project:
projects:
myproject:
reports_dir: ~/my-reports/myprojectLocal config takes priority over the global projects: setting.
{prompt}— replaced with the shell-escaped prompt text{prompt_file}— replaced with a path to a temp file containing the prompt- If neither placeholder is present, the prompt is piped to the agent via stdin
| Field | Description | Default |
|---|---|---|
command |
Shell command template | (required) |
commit_style |
auto (agent commits) or manual (agenttester commits) |
auto |
timeout |
Max seconds before the agent is killed | 600 |
env |
Extra environment variables (key-value map) | {} |
Skills are markdown instruction files prepended to every agent prompt. They tell agents what they are allowed to do and how to behave. AgentTester ships with four built-in skills:
| Skill | Description |
|---|---|
editing.md |
Permission to read and edit files freely; look for reusable code before writing new code; prioritise readability |
testing.md |
Run the test suite and linter after making changes; don't mark a task complete until tests pass |
git.md |
Permitted git operations (branch, commit, push, pull, rebase); never push to the default branch |
bash.md |
Permitted bash operations scoped to code editing and testing; no system-level changes outside the worktree |
You can override any built-in skill or add new ones at two levels:
Global (~/.config/agenttester/skills/): applies to all projects.
Local (.agent-tester/skills/ inside your repo): applies to this project only.
A skill file with the same name as a built-in replaces it entirely. New filenames add additional instructions. Skills are always output in priority order — built-ins first, global skills second, local skills last — so user-defined instructions appear closest to the prompt and carry the most weight with the model.
~/.config/agenttester/skills/testing.md # overrides built-in testing skill globally
your-repo/.agent-tester/skills/testing.md # overrides for this project only
your-repo/.agent-tester/skills/style.md # adds a new skill for this project
Configure one or more LLM evaluators to review each agent's diff after it runs. Multiple independent reviewers reduce the risk of hallucinated assessments, and an aggregate report is synthesized from all of them.
Add an evaluators block to your agent-tester.yaml:
evaluators:
- name: claude
api: anthropic # uses ANTHROPIC_API_KEY
model: claude-opus-4-7
- name: llama3
endpoint: http://localhost:8004 # any OpenAI-compatible endpoint
model: meta-llama/Meta-Llama-3-70B-Instruct
evaluation:
inject_raw_reports: false # true → send raw reports instead of aggregate
max_aggregate_tokens: 2000 # aggregate is summarized before injection if too longDefine a providers block to share credentials across multiple evaluators or REPL model agents. Each provider entry requires a type field. Model-level fields override the provider defaults.
Provider types
type |
Description | Install |
|---|---|---|
openai |
Any OpenAI-compatible endpoint (vLLM, etc.) | built-in |
anthropic |
Direct Anthropic Messages API | built-in |
bedrock |
AWS Bedrock Converse API | built-in (pip install agenttester[aws] for boto3 modes) |
azure |
Azure AI Foundry / Azure OpenAI Service | built-in |
vertex |
GCP Vertex AI (OpenAI-compatible endpoint) | built-in |
Each provider type reads credentials from a standard environment variable automatically — no api_key_env required unless you want to override the default. Override by adding api_key_env: MY_CUSTOM_VAR to any provider or evaluator entry.
type |
Default env var |
|---|---|
openai |
OPENAI_API_KEY |
anthropic |
ANTHROPIC_API_KEY |
azure |
AZURE_OPENAI_API_KEY |
vertex |
GOOGLE_API_KEY |
bedrock (api_key mode) |
BEDROCK_API_KEY |
OpenAI-compatible providers (generic)
providers:
my-openai:
type: openai
endpoint: http://localhost:8004
# reads OPENAI_API_KEY automatically; set api_key_env to override
evaluators:
- name: llama3
provider: my-openai
model: meta-llama/Meta-Llama-3-70B-InstructAWS Bedrock
Four auth modes via auth_method:
auth_method: api_key— readsBEDROCK_API_KEY(orapi_key_envoverride) asAuthorization: Bearer. Use with AWS Bedrock API keys or Bedrock-compatible HTTP proxies. No boto3 required.auth_method: profile— usesaws_profile(a named~/.aws/configentry: SSO, assumed roles, etc.). Requirespip install agenttester[aws].auth_method: keys— readsaws_access_key_id_env/aws_secret_access_key_env. Requirespip install agenttester[aws].auth_method: default(default) — standard boto3 credential chain (env vars,~/.aws/credentials, IAM instance role). Requirespip install agenttester[aws].
providers:
# API key — reads BEDROCK_API_KEY; no boto3 required
bedrock-apikey:
type: bedrock
region: us-east-1
auth_method: api_key
# Named AWS CLI profile (SSO, assumed roles, etc.)
bedrock-sso:
type: bedrock
region: us-east-1
auth_method: profile
aws_profile: my-sso-profile
# Explicit credentials from environment variables
bedrock-keys:
type: bedrock
region: us-east-1
auth_method: keys
aws_access_key_id_env: MY_AWS_KEY_ID
aws_secret_access_key_env: MY_AWS_SECRET
aws_session_token_env: MY_AWS_TOKEN # optional
# Default boto3 credential chain
bedrock-default:
type: bedrock
region: us-east-1
evaluators:
- name: claude-bedrock
provider: bedrock-sso
model: anthropic.claude-3-5-sonnet-20241022-v2:0Azure AI Foundry
Two auth modes:
auth_method: api_key(default) — readsAZURE_OPENAI_API_KEY(orapi_key_envoverride) and sends it as anapi-keyheader.auth_method: cli— runsaz account get-access-tokento obtain an Entra ID Bearer token. Requires the Azure CLI andaz login.
providers:
my-azure:
type: azure
endpoint: https://my-resource.openai.azure.com
# reads AZURE_OPENAI_API_KEY automatically; use auth_method: cli for Entra ID
evaluators:
- name: gpt-4o
provider: my-azure
model: gpt-4oGCP Vertex AI
Two auth modes:
auth_method: api_key(default) — readsGOOGLE_API_KEY(orapi_key_envoverride) asAuthorization: Bearer.auth_method: cli— runsgcloud auth print-access-token. Requires the Google Cloud SDK andgcloud auth login.
providers:
my-vertex:
type: vertex
endpoint: https://us-central1-aiplatform.googleapis.com/v1beta1/projects/my-project/locations/us-central1/endpoints/openapi
# reads GOOGLE_API_KEY automatically; use auth_method: cli for ADC
evaluators:
- name: gemini
provider: my-vertex
model: google/gemini-2.0-flash-001CLI tokens (Azure and GCP) are cached for 55 minutes to avoid extra subprocesses on every request.
REPL models support any provider type through a models: section that accepts the same provider references as evaluators:
models:
claude-bedrock:
provider: bedrock-sso # references a named bedrock provider
model: anthropic.claude-3-5-sonnet-20241022-v2:0
azure-gpt4o:
provider: my-azure # references a named azure provider
model: gpt-4o
gemini:
provider: my-vertex # references a named vertex provider
model: google/gemini-2.0-flash-001
local-llm:
endpoint: http://localhost:8001 # inline OpenAI-compatible endpoint
model: meta-llama/Meta-Llama-3-8B-Instruct
api_key_env: MY_KEY # optional; overrides the default OPENAI_API_KEYAgent entries whose command matches agent-tester query <endpoint> <model> {prompt} are also discovered automatically for backward compatibility.
After each iteration, each evaluator independently critiques every agent's diff for:
- Accuracy — does the code implement what was asked?
- Readability — is it clear and well-named?
- Code smells — duplication, dead code, poor design
- Correctness — bugs, missed edge cases, unsafe patterns
An aggregate assessment is then synthesized across evaluators. The terminal shows the aggregate; raw per-evaluator reports are preserved in the markdown report.
When evaluators are configured, AgentTester enters a refinement loop:
- Agents run and commit their changes (
iter-1commit message) - Evaluators review each agent's diff
- You select which agents to re-run (1–all, or press Enter to stop)
- Selected agents re-run with the aggregate feedback injected into their prompt
- New commits are appended to the same branch (
iter-2,iter-3, …) - New evaluator reports are generated for each iteration
All iterations land on the same branch — use git log to see the progression.
For querying and comparing multiple models interactively, with persistent conversation history and tool use:
agent-tester # open REPL (auto-discovers agent-tester.yaml)
agent-tester --resume <SESSION_ID> # resume a previous session
agent-tester repl --config custom.yaml # explicit config path
agent-tester repl --workdir /path/to/repo # enable tool use with a target repoThe REPL fans out each prompt to all configured models in parallel and maintains separate
conversation history per model. Tab-completes model names after @ and slash-commands.
Prompt history is persisted across invocations in ~/.config/agenttester/repl_history.
| Command | Description |
|---|---|
/reset |
Clear conversation history for all models |
/status |
Show which models are running or idle |
/stop [@model …] |
Cancel a running model. Without a tag, stops all busy models. |
/interrupt [@model …] <message> |
Cancel a running model and immediately re-dispatch with <message>. Without a tag, interrupts all busy models. |
/report |
Show each model's git commits, diff stats, and token usage |
/evaluate [m1,m2,…] |
Cross-evaluate: each model reviews the others' work. Optionally pass a comma-separated list to limit which models act as reviewers. Evaluation documents are saved as Markdown to .agenttester/evaluations/<session>/. |
/iterate <prompt> |
After /evaluate, inject each model's peer evaluations as context and send an iteration prompt. Shows a per-model plan and requires y confirmation before sending. |
Use @modelname message to address a single model. Use exit or Ctrl-C to quit.
A session is always created automatically. A UUID is generated when no --session is
passed. The session ID is printed at startup and in the exit message:
Session: 3f2a1b4c-8d9e-4f0a-b1c2-d3e4f5a6b7c8
...
bye — agent-tester --resume 3f2a1b4c-8d9e-4f0a-b1c2-d3e4f5a6b7c8
Each model's conversation history is saved on exit to
~/.config/agenttester/sessions/<session-id>.yaml and restored on resume.
List previous sessions:
agent-tester sessions # human-readable, newest first
agent-tester sessions --yaml # machine-readable YAMLEach session entry shows its date, start/end times, and associated branches with their
availability (local, remote, local,remote, or unknown).
The main REPL shows brief per-model status (✓ model: done, ✗ model: error). To see
the full context — every prompt, tool call, and response — open a second terminal:
agent-tester watch --session <SESSION_ID> --model <MODEL_NAME>The watcher tail-follows the model's event log at
~/.config/agenttester/sessions/<session-id>/events/<model>.jsonl and renders each event
with Rich as it arrives. You can open one watcher per model and keep the main REPL for
sending prompts.
Pass --workdir <dir> to enable an agent loop for OpenAI-compatible and Anthropic models.
Each model gains access to bash, read_file, write_file, git_clone, git_commit,
and git_push tools. When --workdir is a git repo, each model works in its own clone
under .agenttester/worktrees/<session-id>/ on a dedicated branch.
Before the first prompt is dispatched, all models negotiate a branch name in up to two rounds (silent LLM calls that don't affect conversation history). The agreed name is combined with a short session hash:
agenttester/<model-name>/<8-char-session>-<feature-name>
The branch is created lazily on the first write and reused for all subsequent prompts in the same session. On session resume, previously negotiated branch names are restored from the session record so models continue on the same branches without re-negotiating.
If a model hits its output token limit mid-generation, the loop automatically sends
"Continue from where you left off." and appends the continuation to the same response.
Use --pem <path> to authenticate git operations over SSH. Combine flags for a full
multi-model coding workflow:
agent-tester repl \
--session sprint-42 \
--workdir ~/dev/my-project \
--pem ~/.ssh/deploy_keyRemove branches from old sessions interactively:
agent-tester cleanup # scans CWD repo for agenttester/* branches
agent-tester cleanup --workdir /path/to/repoThe command walks you through two phases — select sessions to delete entirely, then pick individual model branches from remaining sessions — then asks whether to delete locally, remotely, or both before executing. Session records (history, reports, eval results) are preserved unless you explicitly approve their deletion.
Config resolution follows the same priority as run: global config first, then local
(or explicit) config, with local taking precedence on conflicts.
See config.example.yaml for full configuration examples.
uv pip install -e ".[dev]"
ruff check src/ tests/
ruff format src/ tests/
pytestProvider API keys are forwarded automatically from the host environment — set any of ANTHROPIC_API_KEY, AZURE_OPENAI_KEY, VERTEX_TOKEN, BEDROCK_API_KEY, or the standard AWS_* variables before running.
# Open REPL against the current directory
docker compose run --rm agent-tester repl --workdir /repo
# Open REPL against a different repo
REPO_PATH=/path/to/repo docker compose run --rm agent-tester repl --workdir /repo
# Pass a custom config
REPO_PATH=/path/to/repo docker compose run --rm agent-tester repl \
--workdir /repo --config /repo/agent-tester.yamlimport asyncio
from pathlib import Path
from rich.console import Console
from agenttester import Orchestrator, load_config
from agenttester.config import get_reports_dir
async def main():
repo = Path(".").resolve()
agents = load_config()
selected = [agents["claude"], agents["aider"]]
orch = Orchestrator(repo, Console(), get_reports_dir(repo))
results = await orch.run("Add unit tests", selected, run_name="add-tests")
for r in results:
print(f"{r.agent_name}: exit={r.exit_code} duration={r.duration:.1f}s")
asyncio.run(main())