vercel-agent-browser

Browser automation CLI for AI agents - fast native Rust tool for controlling Chrome with accessibility-first commands

Skill file

Preview skill file↓↑

---
name: vercel-agent-browser
description: Browser automation CLI for AI agents - fast native Rust tool for controlling Chrome with accessibility-first commands
triggers:
  - automate a browser task
  - control chrome with agent-browser
  - take a screenshot of a webpage
  - fill out a form automatically
  - extract data from a webpage
  - interact with web elements
  - navigate and click on a website
  - scrape website content with browser automation
---

# Vercel Agent Browser Skill

> Skill by [ara.so](https://ara.so) — AI Agent Skills collection.

## Overview

`agent-browser` is a fast native Rust CLI for browser automation designed specifically for AI agents. It provides an accessibility-first approach with semantic selectors and ref-based element targeting, making it ideal for LLM-driven web automation.

**Key Features:**
- Native Rust performance with npm/Homebrew distribution
- Accessibility tree snapshots with stable element refs (`@e1`, `@e2`, etc.)
- Semantic locators (role, text, label, placeholder)
- AI chat mode for natural language control
- Batch command execution
- Network interception and HAR recording
- Chrome DevTools Protocol (CDP) streaming

## Installation

### Global (Recommended)

```bash
npm install -g agent-browser
agent-browser install  # Downloads Chrome for Testing
```

### Project Local

```bash
npm install agent-browser
npx agent-browser install
```

### Alternative Methods

```bash
# Homebrew (macOS)
brew install agent-browser
agent-browser install

# Cargo (Rust)
cargo install agent-browser
agent-browser install

# Linux with system dependencies
agent-browser install --with-deps
```

### Upgrading

```bash
agent-browser upgrade  # Auto-detects installation method
```

## Core Concepts

### Element References (@refs)

The most powerful feature for AI agents is the accessibility snapshot with stable refs:

```bash
# Get accessibility tree with element refs
agent-browser open https://example.com
agent-browser snapshot

# Output includes refs like:
# @e1 heading "Example Domain"
# @e2 link "More information..."
# @e3 textbox "Search" (placeholder)

# Use refs directly in commands
agent-browser click @e2
agent-browser fill @e3 "search query"
agent-browser get text @e1
```

### Traditional Selectors

Standard CSS selectors also work:

```bash
agent-browser click "#submit-button"
agent-browser fill "input[name='email']" "user@example.com"
agent-browser get text ".header h1"
```

### Semantic Locators

Find elements by their semantic meaning:

```bash
# By ARIA role
agent-browser find role button click --name "Submit"
agent-browser find role textbox fill "user@example.com" --name "Email"

# By visible text
agent-browser find text "Sign In" click
agent-browser find text "Welcome" click --exact

# By label
agent-browser find label "Password" fill "secret123"

# By placeholder
agent-browser find placeholder "Search..." type "rust cli"

# By test ID
agent-browser find testid "login-form" click
```

## Common Workflows

### Basic Navigation and Interaction

```bash
# Start browser and navigate
agent-browser open https://github.com/login

# Get page structure
agent-browser snapshot

# Fill login form using refs from snapshot
agent-browser fill @e5 "username"
agent-browser fill @e6 "password"
agent-browser click @e7

# Or use semantic locators
agent-browser find label "Username" fill "username"
agent-browser find label "Password" fill "password"
agent-browser find role button click --name "Sign in"

# Wait for navigation
agent-browser wait --url "**/dashboard"

# Take screenshot
agent-browser screenshot success.png
```

### Form Automation

```bash
agent-browser open https://forms.example.com

# Fill text inputs
agent-browser fill "#name" "John Doe"
agent-browser fill "#email" "john@example.com"

# Select dropdown
agent-browser select "#country" "United States"

# Check checkboxes
agent-browser check "#newsletter"
agent-browser check "#terms"

# Upload files
agent-browser upload "#resume" "/path/to/resume.pdf"

# Submit
agent-browser find role button click --name "Submit"

# Wait for success message
agent-browser wait --text "Thank you"
```

### Data Extraction

```bash
agent-browser open https://news.ycombinator.com

# Get page snapshot for structure
agent-browser snapshot -i  # Interactive mode shows full tree

# Extract specific content
agent-browser get text ".titleline > a"

# Get multiple elements with JavaScript
agent-browser eval "Array.from(document.querySelectorAll('.titleline > a')).map(a => a.textContent)"

# Get structured data as JSON
agent-browser eval "JSON.stringify({
  title: document.querySelector('title').textContent,
  links: Array.from(document.querySelectorAll('.titleline > a')).map(a => ({
    text: a.textContent,
    url: a.href
  }))
})"
```

### Batch Execution

Reduce overhead by running multiple commands in one invocation:

```bash
# Argument mode
agent-browser batch \
  "open https://example.com" \
  "snapshot -i" \
  "click @e2" \
  "wait 1000" \
  "screenshot result.png"

# With --bail to stop on first error
agent-browser batch --bail \
  "open https://login.example.com" \
  "fill #username user@example.com" \
  "fill #password ${PASSWORD}" \
  "click #submit" \
  "wait --url '**/dashboard'"

# JSON mode (piped from script)
echo '[
  ["open", "https://example.com"],
  ["snapshot"],
  ["click", "@e1"],
  ["screenshot"]
]' | agent-browser batch --json
```

### AI Chat Mode

Natural language browser control:

```bash
# Single-shot command
agent-browser chat "go to github.com and search for rust browser automation"

# Interactive REPL
agent-browser chat
# > navigate to example.com
# > click the first link
# > take a screenshot
# > exit
```

### Network Interception

```bash
# Block specific resources
agent-browser network route "**/*.jpg" --abort
agent-browser network route "https://ads.example.com/*" --abort

# Block resource types
agent-browser network route '*' --abort --resource-type script
agent-browser network route '*' --abort --resource-type image,media

# Mock API responses
agent-browser network route "https://api.example.com/user" --body '{
  "status": "ok",
  "data": {"id": 1, "name": "Test User"}
}'

# View network requests
agent-browser network requests
agent-browser network requests --filter api
agent-browser network requests --type fetch,xhr
agent-browser network requests --method POST
agent-browser network requests --status 200

# View specific request detail
agent-browser network request req_123abc

# HAR recording
agent-browser network har start
# ... perform actions ...
agent-browser network har stop capture.har
```

### Screenshot Strategies

```bash
# Basic screenshot (saves to temp if no path)
agent-browser screenshot

# Save to specific path
agent-browser screenshot ./screenshots/page.png

# Full page screenshot
agent-browser screenshot --full full-page.png

# Annotated with element labels
agent-browser screenshot --annotate labeled.png

# Custom directory and format
agent-browser screenshot --screenshot-dir ./shots --screenshot-format jpeg --screenshot-quality 80

# Element screenshot
agent-browser eval "document.querySelector('#main').screenshot()" -b > element.png
```

### Multi-Tab Operations

```bash
# List tabs
agent-browser tab

# Open new tab with label
agent-browser tab new --label docs https://docs.example.com

# Open link in new tab
agent-browser click "#external-link" --new-tab

# Switch to tab by label
agent-browser tab docs

# Switch to tab by ID
agent-browser tab t2

# Close tab
agent-browser tab close docs
agent-browser tab close t2
```

### Wait Strategies

```bash
# Wait for element to appear
agent-browser wait "#result"

# Wait for time (milliseconds)
agent-browser wait 2000

# Wait for text (substring match)
agent-browser wait --text "Success"

# Wait for URL pattern
agent-browser wait --url "**/dashboard"

# Wait for network idle
agent-browser wait --load networkidle

# Wait for JavaScript condition
agent-browser wait --fn "document.readyState === 'complete'"
agent-browser wait --fn "window.dataLoaded === true"

# Wait for element to disappear
agent-browser wait --fn "!document.body.innerText.includes('Loading...')"
agent-browser wait "#spinner" --state hidden
```

### Keyboard and Mouse Control

```bash
# Keyboard shortcuts
agent-browser press "Control+a"
agent-browser press "Control+c"
agent-browser press "Enter"
agent-browser press "Tab"

# Type with real keystrokes (at current focus)
agent-browser keyboard type "Hello World"

# Insert text without key events
agent-browser keyboard inserttext "Pasted content"

# Hold and release keys
agent-browser keydown "Shift"
agent-browser press "a"
agent-browser keyup "Shift"

# Mouse control
agent-browser mouse move 100 200
agent-browser mouse down left
agent-browser mouse up left
agent-browser mouse wheel 100 0  # dy dx

# Drag and drop
agent-browser drag "#source" "#target"
```

### Cookies and Storage

```bash
# Cookies
agent-browser cookies
agent-browser cookies set "sessionId" "abc123"
agent-browser cookies clear

# Import from cURL
agent-browser cookies set --curl curl-cookies.txt

# LocalStorage
agent-browser storage local
agent-browser storage local get "token"
agent-browser storage local set "token" "eyJ..."
agent-browser storage local clear

# SessionStorage
agent-browser storage session
agent-browser storage session set "tempData" '{"key":"value"}'
```

### Browser Configuration

```bash
# Set viewport
agent-browser set viewport 1920 1080
agent-browser set viewport 1920 1080 2  # Retina (2x scale)

# Emulate device
agent-browser set device "iPhone 14"

# Geolocation
agent-browser set geo 37.7749 -122.4194  # San Francisco

# Offline mode
agent-browser set offline on
agent-browser set offline off

# Custom headers
agent-browser set headers '{"X-Custom-Header":"value"}'

# HTTP auth
agent-browser set credentials username password

# Color scheme
agent-browser set media dark
agent-browser set media light
```

### Advanced JavaScript Evaluation

```bash
# Simple evaluation
agent-browser eval "document.title"

# Complex extraction
agent-browser eval "
  JSON.stringify({
    url: window.location.href,
    links: Array.from(document.querySelectorAll('a')).map(a => ({
      text: a.textContent.trim(),
      href: a.href
    })).filter(l => l.text)
  })
"

# Binary output (base64)
agent-browser eval "document.querySelector('canvas').toDataURL()" -b > canvas.png

# From stdin
echo "document.body.innerHTML" | agent-browser eval --stdin
```

## Automation Script Example

Create a reusable automation script:

```bash
#!/bin/bash
# login-and-extract.sh

set -e

SESSION_FILE=".browser-session"

# Start browser
agent-browser open https://app.example.com/login

# Login
agent-browser find label "Email" fill "${USER_EMAIL}"
agent-browser find label "Password" fill "${USER_PASSWORD}"
agent-browser find role button click --name "Sign in"

# Wait for dashboard
agent-browser wait --url "**/dashboard"
agent-browser wait 1000

# Extract data
DATA=$(agent-browser eval "
  JSON.stringify({
    user: document.querySelector('.user-name').textContent,
    notifications: Array.from(document.querySelectorAll('.notification')).map(n => n.textContent)
  })
")

echo "$DATA" | jq .

# Screenshot
agent-browser screenshot dashboard.png

# Cleanup
agent-browser close
```

Usage:

```bash
export USER_EMAIL="user@example.com"
export USER_PASSWORD="secret"
chmod +x login-and-extract.sh
./login-and-extract.sh
```

## Batch Automation Pattern

For complex multi-step workflows, use batch mode with a script:

```javascript
#!/usr/bin/env node
// automation.js

const commands = [
  ["open", "https://example.com"],
  ["snapshot"],
  // Process snapshot, determine next steps...
  ["click", "@e5"],
  ["wait", "1000"],
  ["screenshot", "step1.png"],
  ["find", "role", "button", "click", "--name", "Next"],
  ["wait", "--url", "**/step2"],
  ["screenshot", "step2.png"]
];

console.log(JSON.stringify(commands));
```

Run with:

```bash
node automation.js | agent-browser batch --json --bail
```

## CDP Streaming for Real-Time Monitoring

Connect to Chrome DevTools Protocol for event streaming:

```bash
# Get CDP WebSocket URL
CDP_URL=$(agent-browser get cdp-url)
echo "CDP URL: $CDP_URL"

# Or connect directly to CDP port
agent-browser connect 9222

# Enable runtime streaming
agent-browser stream enable --port 9223
agent-browser stream status
# ... perform actions, stream events to ws://localhost:9223 ...
agent-browser stream disable
```

## Troubleshooting

### Browser Not Found

```bash
# Re-download Chrome for Testing
agent-browser install

# Or specify custom Chrome path
export CHROME_PATH=/path/to/chrome
agent-browser open
```

### Elements Not Found

```bash
# Use snapshot to see available refs
agent-browser snapshot -i

# Wait for element before interacting
agent-browser wait "#dynamic-element"
agent-browser click "#dynamic-element"

# Use semantic selectors for robustness
agent-browser find role button click --name "Submit"
```

### Session Management

```bash
# List all active sessions
agent-browser tab

# Close specific session
agent-browser close

# Close all sessions
agent-browser close --all
```

### Debugging

```bash
# Get CDP URL for DevTools connection
agent-browser get cdp-url

# View network activity
agent-browser network requests

# Enable HAR recording
agent-browser network har start
# ... actions ...
agent-browser network har stop debug.har
```

### Clipboard Access Issues

On Linux, clipboard requires `xclip` or `xsel`:

```bash
sudo apt-get install xclip
# or
sudo apt-get install xsel
```

## Environment Variables

```bash
# Custom Chrome binary
export CHROME_PATH=/Applications/Brave Browser.app/Contents/MacOS/Brave Browser

# Custom user data directory
export CHROME_USER_DATA_DIR=~/.config/agent-browser

# Headless mode (default)
export CHROME_HEADLESS=true

# Show browser UI
export CHROME_HEADLESS=false
```

## Best Practices for AI Agents

1. **Use snapshots first**: Always get `agent-browser snapshot` to understand page structure before acting
2. **Prefer refs over selectors**: Use `@e1` style refs from snapshots for stability
3. **Use semantic locators**: `find role button --name "Submit"` is more resilient than CSS selectors
4. **Wait strategically**: Add `wait` commands for dynamic content before interacting
5. **Batch when possible**: Use `batch` mode to reduce per-command overhead
6. **Handle errors**: Use `--bail` in batch mode to stop on first error
7. **Clean up**: Always `close` sessions when done
8. **Screenshot for verification**: Take screenshots at key steps for debugging
9. **Use labels for tabs**: Assign meaningful labels when opening multiple tabs

## Quick Reference

```bash
# Essential commands for AI agents
agent-browser open <url>              # Start
agent-browser snapshot                # Get structure with refs
agent-browser click @eN               # Interact with ref
agent-browser find role button click --name "Text"  # Semantic
agent-browser wait --text "Success"   # Wait for content
agent-browser screenshot result.png   # Verify
agent-browser get text @eN            # Extract
agent-browser eval "JS"               # Complex extraction
agent-browser batch "cmd1" "cmd2"     # Multi-step
agent-browser close                   # Cleanup
```

Source

Creator's repository · aradotso/ai-agent-skills

View on GitHub ↗

Security

Security checks in progress

Results will appear here once audits complete

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk