Browser Agent

A lightweight browser automation agent powered by Google's Gemini AI and Playwright. Describe tasks in natural language, and the agent autonomously navigates websites, interacts with elements, and saves results.

Features

CLI + Library: Use as a command-line tool or Python library
LLM-Powered: Gemini function calling for autonomous decision-making
Task Generation: AI refines vague inputs into structured steps (--generate)
Full Browser Control: Navigate, click, type, take snapshots, save PDFs
Zero Configuration: Works out of the box with sensible defaults

Installation

# Clone and setup
git clone https://github.com/spate141/browser-agent.git
cd browser-agent
uv venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
uv pip install -e .
playwright install chromium

# Configure API key
cp .env.example .env
# Edit .env and add your GOOGLE_API_KEY

Get your Google API key from: https://aistudio.google.com/app/apikey

Environment Variables (optional, all have defaults):

GOOGLE_API_KEY - Required. Your Gemini API key
MODEL - Model name (default: gemini-3-flash-preview)
OUTPUT_DIR - PDF output directory (default: ./browser_output)
MAX_STEPS - Max automation steps (default: 50)
HEADLESS - Run invisibly (default: false)

Quick Start

Command Line (Recommended)

# Basic usage
browser-agent "Go to example.com and save as example.pdf"

# Let AI structure your task
browser-agent "find python documentation" --generate

# Configuration overrides
browser-agent "search for AI news" --headless --max-steps 20 --output-dir ./pdfs

# Quiet mode (only show final result)
browser-agent "Go to github.com" --quiet

Python Library

import asyncio
from browser_agent import run_browser_agent

result = await run_browser_agent(
    task="Go to example.com and save as example.pdf",
    max_steps=20,
    headless=True
)
print(f"Success: {result['success']}")

Usage Examples

Task Generation with `--generate`

Let the AI refine vague inputs into structured steps:

# Vague input
browser-agent "find python docs" --generate

# AI generates:
# 1. Navigate to https://docs.python.org/3/
# 2. Take snapshot to verify page loaded
# 3. Save as python_docs.pdf

# Then executes automatically

Search and Save Results

CLI:

browser-agent "Go to DuckDuckGo, search for 'Recursive Language Models', save first result as rlm.pdf" --max-steps 30

Library:

result = await run_browser_agent(
    task="Go to DuckDuckGo, search for 'LLMs', save first result as llm.pdf",
    max_steps=30
)

Form Automation

browser-agent "Go to httpbin.org/forms/post, fill name field with 'John Doe', click submit, save result" --generate

Parallel Data Collection

# Collect from multiple sites concurrently
import asyncio
from browser_agent import run_browser_agent

async def main():
    sites = [
        ("HackerNews", "https://news.ycombinator.com"),
        ("ArXiv AI", "https://arxiv.org/list/cs.AI/recent"),
    ]

    tasks = [
        run_browser_agent(
            task=f"Go to {url}, save as {name}.pdf",
            output_dir=f"./data/{name}",
            verbose=False
        )
        for name, url in sites
    ]

    results = await asyncio.gather(*tasks)
    print(f"Completed: {sum(r['success'] for r in results)}/{len(sites)}")

asyncio.run(main())

CLI Reference

browser-agent "task description" [OPTIONS]

Options:
  -g, --generate           Refine task with AI into structured steps
  --headless               Run browser invisibly
  --max-steps N            Max automation iterations (default: 50)
  --output-dir DIR         PDF output directory (default: ./browser_output)
  --model NAME             Gemini model (default: gemini-3-flash-preview)
  -v, --verbose            Show step-by-step progress (default)
  -q, --quiet              Only show final result
  -h, --help               Show help message

Examples:

browser-agent "task"                              # Basic usage
browser-agent "vague task" --generate             # AI refines task
browser-agent "task" --headless --quiet           # Silent headless mode
browser-agent "task" --max-steps 20 --output-dir ./pdfs

Library API

`run_browser_agent(task, **kwargs)`

Parameters: task (str, required), output_dir, model, max_steps, headless, selector, verbose

Returns: {"success": bool, "message": str, "steps": int}

Example:

from browser_agent import run_browser_agent

result = await run_browser_agent(
    task="Go to python.org and save as PDF",
    headless=True,
    max_steps=20
)

How It Works

The agent uses Gemini's function calling to autonomously decide which browser actions to take:

Action	Description
`browser_navigate(url)`	Navigate to a URL
`browser_snapshot()`	Get page state with interactive elements
`browser_click(index)`	Click an element by index
`browser_type(index, text, submit)`	Type into input field
`browser_pdf(filename)`	Save current page as PDF
`browser_back()`	Navigate back
`browser_wait(seconds)`	Wait for specified time
`task_complete(summary)`	Mark task as complete
`task_failed(reason)`	Mark task as failed

The agent loops: Task → LLM decides action → Execute → Repeat until task completes or max steps reached.

Troubleshooting

Issue	Solution
`GOOGLE_API_KEY not found`	Create `.env` file with valid API key
`Chromium not found`	Run `playwright install chromium`
Task times out	Increase `--max-steps` or simplify task
Elements not found	Add explicit waits: "wait 3 seconds, then click"
API rate limits	Add delays between runs: `await asyncio.sleep(5)`
Bot detection / CAPTCHA	Use visible mode (default), try DuckDuckGo instead of Google, add human-like delays

Bot Detection Notes:

Agent includes stealth measures (realistic user-agent, masked automation flags)
Not guaranteed to bypass all detection (especially Google)
For production: use official APIs or automation-friendly sites

Writing Good Tasks:

# ✅ Good - Specific, explicit URLs, clear steps
"Go to amazon.com, search for 'wireless mouse', save first result as mouse.pdf"

# ❌ Too vague
"Find me a mouse"

# ✅ Good - Use --generate to refine vague inputs
browser-agent "find wireless mouse on amazon" --generate

License

MIT License - feel free to use in your projects!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
browser_agent.py		browser_agent.py
cli.py		cli.py
logo.png		logo.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
task_generator.py		task_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Browser Agent

Features

Installation

Quick Start

Command Line (Recommended)

Python Library

Usage Examples

Task Generation with `--generate`

Search and Save Results

Form Automation

Parallel Data Collection

CLI Reference

Library API

`run_browser_agent(task, **kwargs)`

How It Works

Troubleshooting

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Browser Agent

Features

Installation

Quick Start

Command Line (Recommended)

Python Library

Usage Examples

Task Generation with --generate

Search and Save Results

Form Automation

Parallel Data Collection

CLI Reference

Library API

run_browser_agent(task, **kwargs)

How It Works

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Task Generation with `--generate`

`run_browser_agent(task, **kwargs)`

Packages