Skip to content

spate141/browser-agent

Repository files navigation

Browser Agent Logo

Browser Agent

A lightweight browser automation agent powered by Google's Gemini AI and Playwright. Describe tasks in natural language, and the agent autonomously navigates websites, interacts with elements, and saves results.

Features

  • CLI + Library: Use as a command-line tool or Python library
  • LLM-Powered: Gemini function calling for autonomous decision-making
  • Task Generation: AI refines vague inputs into structured steps (--generate)
  • Full Browser Control: Navigate, click, type, take snapshots, save PDFs
  • Zero Configuration: Works out of the box with sensible defaults

Installation

# Clone and setup
git clone https://github.com/spate141/browser-agent.git
cd browser-agent
uv venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
uv pip install -e .
playwright install chromium

# Configure API key
cp .env.example .env
# Edit .env and add your GOOGLE_API_KEY

Get your Google API key from: https://aistudio.google.com/app/apikey

Environment Variables (optional, all have defaults):

  • GOOGLE_API_KEY - Required. Your Gemini API key
  • MODEL - Model name (default: gemini-3-flash-preview)
  • OUTPUT_DIR - PDF output directory (default: ./browser_output)
  • MAX_STEPS - Max automation steps (default: 50)
  • HEADLESS - Run invisibly (default: false)

Quick Start

Command Line (Recommended)

# Basic usage
browser-agent "Go to example.com and save as example.pdf"

# Let AI structure your task
browser-agent "find python documentation" --generate

# Configuration overrides
browser-agent "search for AI news" --headless --max-steps 20 --output-dir ./pdfs

# Quiet mode (only show final result)
browser-agent "Go to github.com" --quiet

Python Library

import asyncio
from browser_agent import run_browser_agent

result = await run_browser_agent(
    task="Go to example.com and save as example.pdf",
    max_steps=20,
    headless=True
)
print(f"Success: {result['success']}")

Usage Examples

Task Generation with --generate

Let the AI refine vague inputs into structured steps:

# Vague input
browser-agent "find python docs" --generate

# AI generates:
# 1. Navigate to https://docs.python.org/3/
# 2. Take snapshot to verify page loaded
# 3. Save as python_docs.pdf

# Then executes automatically

Search and Save Results

CLI:

browser-agent "Go to DuckDuckGo, search for 'Recursive Language Models', save first result as rlm.pdf" --max-steps 30

Library:

result = await run_browser_agent(
    task="Go to DuckDuckGo, search for 'LLMs', save first result as llm.pdf",
    max_steps=30
)

Form Automation

browser-agent "Go to httpbin.org/forms/post, fill name field with 'John Doe', click submit, save result" --generate

Parallel Data Collection

# Collect from multiple sites concurrently
import asyncio
from browser_agent import run_browser_agent

async def main():
    sites = [
        ("HackerNews", "https://news.ycombinator.com"),
        ("ArXiv AI", "https://arxiv.org/list/cs.AI/recent"),
    ]

    tasks = [
        run_browser_agent(
            task=f"Go to {url}, save as {name}.pdf",
            output_dir=f"./data/{name}",
            verbose=False
        )
        for name, url in sites
    ]

    results = await asyncio.gather(*tasks)
    print(f"Completed: {sum(r['success'] for r in results)}/{len(sites)}")

asyncio.run(main())

CLI Reference

browser-agent "task description" [OPTIONS]

Options:
  -g, --generate           Refine task with AI into structured steps
  --headless               Run browser invisibly
  --max-steps N            Max automation iterations (default: 50)
  --output-dir DIR         PDF output directory (default: ./browser_output)
  --model NAME             Gemini model (default: gemini-3-flash-preview)
  -v, --verbose            Show step-by-step progress (default)
  -q, --quiet              Only show final result
  -h, --help               Show help message

Examples:

browser-agent "task"                              # Basic usage
browser-agent "vague task" --generate             # AI refines task
browser-agent "task" --headless --quiet           # Silent headless mode
browser-agent "task" --max-steps 20 --output-dir ./pdfs

Library API

run_browser_agent(task, **kwargs)

Parameters: task (str, required), output_dir, model, max_steps, headless, selector, verbose

Returns: {"success": bool, "message": str, "steps": int}

Example:

from browser_agent import run_browser_agent

result = await run_browser_agent(
    task="Go to python.org and save as PDF",
    headless=True,
    max_steps=20
)

How It Works

The agent uses Gemini's function calling to autonomously decide which browser actions to take:

Action Description
browser_navigate(url) Navigate to a URL
browser_snapshot() Get page state with interactive elements
browser_click(index) Click an element by index
browser_type(index, text, submit) Type into input field
browser_pdf(filename) Save current page as PDF
browser_back() Navigate back
browser_wait(seconds) Wait for specified time
task_complete(summary) Mark task as complete
task_failed(reason) Mark task as failed

The agent loops: Task → LLM decides action → Execute → Repeat until task completes or max steps reached.

Troubleshooting

Issue Solution
GOOGLE_API_KEY not found Create .env file with valid API key
Chromium not found Run playwright install chromium
Task times out Increase --max-steps or simplify task
Elements not found Add explicit waits: "wait 3 seconds, then click"
API rate limits Add delays between runs: await asyncio.sleep(5)
Bot detection / CAPTCHA Use visible mode (default), try DuckDuckGo instead of Google, add human-like delays

Bot Detection Notes:

  • Agent includes stealth measures (realistic user-agent, masked automation flags)
  • Not guaranteed to bypass all detection (especially Google)
  • For production: use official APIs or automation-friendly sites

Writing Good Tasks:

# ✅ Good - Specific, explicit URLs, clear steps
"Go to amazon.com, search for 'wireless mouse', save first result as mouse.pdf"

# ❌ Too vague
"Find me a mouse"

# ✅ Good - Use --generate to refine vague inputs
browser-agent "find wireless mouse on amazon" --generate

License

MIT License - feel free to use in your projects!

About

A lightweight browser automation agent powered by Google's Gemini AI and Playwright.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages