A lightweight browser automation agent powered by Google's Gemini AI and Playwright. Describe tasks in natural language, and the agent autonomously navigates websites, interacts with elements, and saves results.
- CLI + Library: Use as a command-line tool or Python library
- LLM-Powered: Gemini function calling for autonomous decision-making
- Task Generation: AI refines vague inputs into structured steps (
--generate) - Full Browser Control: Navigate, click, type, take snapshots, save PDFs
- Zero Configuration: Works out of the box with sensible defaults
# Clone and setup
git clone https://github.com/spate141/browser-agent.git
cd browser-agent
uv venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -e .
playwright install chromium
# Configure API key
cp .env.example .env
# Edit .env and add your GOOGLE_API_KEYGet your Google API key from: https://aistudio.google.com/app/apikey
Environment Variables (optional, all have defaults):
GOOGLE_API_KEY- Required. Your Gemini API keyMODEL- Model name (default:gemini-3-flash-preview)OUTPUT_DIR- PDF output directory (default:./browser_output)MAX_STEPS- Max automation steps (default:50)HEADLESS- Run invisibly (default:false)
# Basic usage
browser-agent "Go to example.com and save as example.pdf"
# Let AI structure your task
browser-agent "find python documentation" --generate
# Configuration overrides
browser-agent "search for AI news" --headless --max-steps 20 --output-dir ./pdfs
# Quiet mode (only show final result)
browser-agent "Go to github.com" --quietimport asyncio
from browser_agent import run_browser_agent
result = await run_browser_agent(
task="Go to example.com and save as example.pdf",
max_steps=20,
headless=True
)
print(f"Success: {result['success']}")Let the AI refine vague inputs into structured steps:
# Vague input
browser-agent "find python docs" --generate
# AI generates:
# 1. Navigate to https://docs.python.org/3/
# 2. Take snapshot to verify page loaded
# 3. Save as python_docs.pdf
# Then executes automaticallyCLI:
browser-agent "Go to DuckDuckGo, search for 'Recursive Language Models', save first result as rlm.pdf" --max-steps 30Library:
result = await run_browser_agent(
task="Go to DuckDuckGo, search for 'LLMs', save first result as llm.pdf",
max_steps=30
)browser-agent "Go to httpbin.org/forms/post, fill name field with 'John Doe', click submit, save result" --generate# Collect from multiple sites concurrently
import asyncio
from browser_agent import run_browser_agent
async def main():
sites = [
("HackerNews", "https://news.ycombinator.com"),
("ArXiv AI", "https://arxiv.org/list/cs.AI/recent"),
]
tasks = [
run_browser_agent(
task=f"Go to {url}, save as {name}.pdf",
output_dir=f"./data/{name}",
verbose=False
)
for name, url in sites
]
results = await asyncio.gather(*tasks)
print(f"Completed: {sum(r['success'] for r in results)}/{len(sites)}")
asyncio.run(main())browser-agent "task description" [OPTIONS]
Options:
-g, --generate Refine task with AI into structured steps
--headless Run browser invisibly
--max-steps N Max automation iterations (default: 50)
--output-dir DIR PDF output directory (default: ./browser_output)
--model NAME Gemini model (default: gemini-3-flash-preview)
-v, --verbose Show step-by-step progress (default)
-q, --quiet Only show final result
-h, --help Show help messageExamples:
browser-agent "task" # Basic usage
browser-agent "vague task" --generate # AI refines task
browser-agent "task" --headless --quiet # Silent headless mode
browser-agent "task" --max-steps 20 --output-dir ./pdfsParameters: task (str, required), output_dir, model, max_steps, headless, selector, verbose
Returns: {"success": bool, "message": str, "steps": int}
Example:
from browser_agent import run_browser_agent
result = await run_browser_agent(
task="Go to python.org and save as PDF",
headless=True,
max_steps=20
)The agent uses Gemini's function calling to autonomously decide which browser actions to take:
| Action | Description |
|---|---|
browser_navigate(url) |
Navigate to a URL |
browser_snapshot() |
Get page state with interactive elements |
browser_click(index) |
Click an element by index |
browser_type(index, text, submit) |
Type into input field |
browser_pdf(filename) |
Save current page as PDF |
browser_back() |
Navigate back |
browser_wait(seconds) |
Wait for specified time |
task_complete(summary) |
Mark task as complete |
task_failed(reason) |
Mark task as failed |
The agent loops: Task → LLM decides action → Execute → Repeat until task completes or max steps reached.
| Issue | Solution |
|---|---|
GOOGLE_API_KEY not found |
Create .env file with valid API key |
Chromium not found |
Run playwright install chromium |
| Task times out | Increase --max-steps or simplify task |
| Elements not found | Add explicit waits: "wait 3 seconds, then click" |
| API rate limits | Add delays between runs: await asyncio.sleep(5) |
| Bot detection / CAPTCHA | Use visible mode (default), try DuckDuckGo instead of Google, add human-like delays |
Bot Detection Notes:
- Agent includes stealth measures (realistic user-agent, masked automation flags)
- Not guaranteed to bypass all detection (especially Google)
- For production: use official APIs or automation-friendly sites
Writing Good Tasks:
# ✅ Good - Specific, explicit URLs, clear steps
"Go to amazon.com, search for 'wireless mouse', save first result as mouse.pdf"
# ❌ Too vague
"Find me a mouse"
# ✅ Good - Use --generate to refine vague inputs
browser-agent "find wireless mouse on amazon" --generateMIT License - feel free to use in your projects!
