Browsercontrol

BrowserControl is an MCP server that provides real browser automation capabilities for AI agents. It uses a vision - first approach, enabling interactions such as clicking and inputting through numbered elements, without the need for CSS selectors or XPath.

Browser automation Developer tools #Browser automation #Visual interaction #AI agent tool #Web page control .Python

rating : 2.5 points

downloads : 5.7K

update time : 2026-03-12

Open Site

What is BrowserControl?

BrowserControl is an MCP (Model Context Protocol) server that endows AI agents with complete browser control capabilities. Different from traditional text-based browser automation, BrowserControl adopts a vision-first approach: AI sees a web page screenshot marked with numbers. Just tell it which number to click, and the corresponding operation can be completed. This method is closer to the way humans browse web pages, greatly simplifying the complexity of AI interacting with web pages.

How to use BrowserControl?

BrowserControl runs as an MCP server and can be integrated with any AI agent (such as Claude, Gemini, etc.) or IDE that supports the MCP protocol. After installation, the AI agent can obtain a series of browser control tools, including navigation, clicking, input, form filling, tab management, etc. AI identifies interactive elements by viewing the numbered web page screenshot and then calls the corresponding tools to complete the operations.

Applicable scenarios

BrowserControl is suitable for various scenarios that require AI to interact with web pages: 1. Web automation testing: Let AI automatically test website functions and processes. 2. Data collection and monitoring: Regularly visit websites to obtain the latest information. 3. Automated workflows: Automatically fill out forms, submit applications and other repetitive tasks. 4. Web content analysis: Let AI browse and analyze web content. 5. User behavior simulation: Simulate the interaction process between real users and websites.

Main features

Vision-first approach (Set of Marks)

Each web page screenshot will automatically mark the numbers of interactive elements. AI only needs to identify the numbers and call the corresponding operations, without the need to understand the complex HTML structure or CSS selectors.

Multi-tab management

Supports creating, switching, closing, and listing all open browser tabs. AI can freely switch and collaborate between multiple web pages.

Session and Cookie management

Provides a complete set of Cookie operation tools, supporting setting, getting, deleting, and clearing Cookies to achieve persistent login state maintenance.

File upload support

Provides native file upload tools. AI can easily handle file upload forms on web pages without complex simulation operations.

Developer tool suite

Includes professional debugging tools such as console log viewing, network request monitoring, page error detection, and element inspection to help AI diagnose web page problems.

Session recording function

Supports recording the complete browser session and generating a replayable recording file for easy debugging and reviewing of AI's operation process.

Dynamic viewport control

The browser window size can be adjusted at any time to simulate the display effects of different devices (such as mobile phones, tablets, desktops).

Persistent sessions

Automatically save the browser state (Cookies, localStorage, etc.). AI will still maintain the previous login state and browsing history after restarting.

Advantages

Intuitive and easy to use: The vision-first approach makes it simpler and more intuitive for AI to operate web pages.

Comprehensive functions: Provides a complete set of tools from basic navigation to advanced debugging.

Persistent and stable: Automatically saves session states to avoid repeated logins.

Fully localized: All operations are completed locally without the need for cloud services.

Zero cost: Open source and free, no usage fees.

Good compatibility: Supports all MCP-compatible AI agents and IDEs.

Limitations

Requires installation: Needs a Python environment and the Chromium browser.

Resource consumption: Running browser instances requires a certain amount of memory and CPU resources.

Vision dependence: Depends on AI's visual recognition ability and may not accurately recognize complex layouts.

Learning curve: AI needs to learn how to effectively use the numbered marking system.

How to use

Install BrowserControl

Install the BrowserControl package using pip or uv.

Run the MCP server

Start BrowserControl as an MCP server.

Configure the AI agent

Add the BrowserControl server to the configuration file of the AI agent (such as Claude Desktop).

Start using

Restart the AI agent. Now the AI can use the browser control function.

Usage examples

Web automation testing

Let AI automatically test the login function of a website and verify whether the login process is normal.

Data collection task

Let AI regularly visit a news website and collect the latest news titles and links.

Multi-step form filling

Let AI complete the filling and submission of a complex multi-page form.

Web debugging and diagnosis

Let AI diagnose web page loading problems and report errors.

Frequently Asked Questions

Does BrowserControl require an internet connection?

Which browsers are supported?

How to solve the "Missing X server" error?

Is BrowserControl secure?

How to view the recorded session?

Does it support mobile device simulation?

Related resources

GitHub repository

The source code and latest version of BrowserControl

MCP protocol documentation

The official documentation of the Model Context Protocol

Playwright documentation

The browser automation framework used by BrowserControl at the underlying level

FastMCP documentation

The MCP server framework used by BrowserControl

Feedback

Report bugs or request new features

🚀 BrowserControl

BrowserControl is an MCP server that empowers your AI agent with full browser access through a vision-first approach. It allows the AI to truly "see", "click", "type", and "interact" with any website, just like a human, without relying on CSS selectors, XPath, or guesswork.

🚀 Quick Start

Installation

# Using pip
pip install browsercontrol

# Or with uv (recommended for faster installs)
uv add browsercontrol

# Chromium is auto-installed on first run—no extra steps needed!

Run the Server

# Using the CLI
browsercontrol

# Or as a Python module
python -m browsercontrol

# Or with FastMCP
fastmcp run browsercontrol.server:mcp

Connect to Your AI Agent

BrowserControl is compatible with any MCP-compatible AI agent or IDE. Here are the connection methods for different platforms:

Claude Desktop

Add the following to your Claude configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "browsercontrol": {
      "command": "browsercontrol"
    }
  }
}

Restart Claude Desktop, then ask:

"Go to GitHub and star the browsercontrol repo"

✦ Gemini CLI / Google AI Studio

If using the Gemini CLI or Google AI Studio with MCP support:

# Set up MCP configuration
export MCP_SERVERS='{"browsercontrol": {"command": "browsercontrol"}}'

# Or add to your Gemini config file

For Google AI Studio, configure in the MCP settings panel.

🔧 Cline (VS Code Extension)

Install the Cline extension.
Open Cline settings (gear icon).
Navigate to "MCP Servers".
Add a new server:

{
  "browsercontrol": {
    "command": "browsercontrol"
  }
}

🤖 Continue.dev (VS Code/JetBrains)

Add to your Continue configuration (~/.continue/config.json):

{
  "mcpServers": [
    {
      "name": "browsercontrol",
      "command": "browsercontrol"
    }
  ]
}

🎯 Cursor IDE

Open Cursor Settings.
Navigate to "Features" → "Model Context Protocol".
Add server configuration:

{
  "browsercontrol": {
    "command": "browsercontrol"
  }
}

🔌 Zed Editor

Add to your Zed settings (~/.config/zed/settings.json):

{
  "context_servers": {
    "browsercontrol": {
      "command": {
        "path": "browsercontrol"
      }
    }
  }
}

🐍 Custom Python Integration

Use the MCP Python SDK to integrate BrowserControl into your own agent:

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

# Connect to BrowserControl
server_params = StdioServerParameters(
    command="browsercontrol",
    args=[],
)

async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        # Initialize
        await session.initialize()

        # List available tools
        tools = await session.list_tools()

        # Call a tool
        result = await session.call_tool("navigate_to", {
            "url": "https://github.com"
        })

🚀 Using with uv or pipx

If you installed with uv or pipx, use the full path:

{
  "mcpServers": {
    "browsercontrol": {
      "command": "uvx",
      "args": ["browsercontrol"]
    }
  }
}

Or with pipx:

{
  "mcpServers": {
    "browsercontrol": {
      "command": "pipx",
      "args": ["run", "browsercontrol"]
    }
  }
}

🔧 Advanced Configuration

You can pass environment variables to customize BrowserControl:

{
  "mcpServers": {
    "browsercontrol": {
      "command": "browsercontrol",
      "env": {
        "BROWSER_HEADLESS": "false",
        "BROWSER_VIEWPORT_WIDTH": "1920",
        "BROWSER_VIEWPORT_HEIGHT": "1080",
        "LOG_LEVEL": "DEBUG"
      }
    }
  }
}

See Configuration for all available options.

✨ Features

❌ Traditional Approach

"Find the button with class 'btn-primary'
that contains 'Submit' and is inside
form#contact-form..."

Parse complex DOM structures.
Guess at CSS selectors.
No JavaScript support.
No login persistence.
No debugging tools.

✅ BrowserControl

"click(7)"

See the rendered page with numbered elements.
Just say "click 5" or "type in 3".
Full dynamic JavaScript support.
Persistent sessions across restarts.
Complete DevTools access.

🎯 The Secret: Set of Marks (SoM)

Every screenshot comes annotated with numbered red boxes on interactive elements:

Found 15 interactive elements:
  [1] button - Sign In
  [2] input - Search...
  [3] a - Products
  [4] a - Pricing
  [5] button - Get Started

Your agent sees the numbers and simply calls click(1) to sign in. No CSS selectors. No XPath. No guessing.

🥊 Head-to-Head Comparison

Feature	BrowserControl	Playwright MCP	Stagehand	Browser-Use	AgentQL
Vision-First (SoM)	✅ Numbered boxes	❌ Text tree	⚠️ AI vision	⚠️ AI vision	❌ Selectors
Multi-Tab Support	✅ Full control	⚠️ Implicit	⚠️ Implicit	⚠️ Basic	❌ None
Cookie Management	✅ Direct tools	⚠️ JS only	⚠️ JS only	⚠️ Basic	❌ None
File Uploads	✅ Native tool	⚠️ Manual	❌ No	❌ No	❌ No
Developer Tools	✅ 8 tools	❌ None	❌ None	❌ None	❌ None
Session Recording	✅ Built-in	⚠️ Manual	❌ None	❌ None	❌ None
Persistent Sessions	✅ Automatic	⚠️ Manual	❌ None	❌ None	❌ None
Token Efficiency	✅ Tiny IDs	⚠️ Large tree	❌ Full images	❌ Full images	⚠️ Query results
100% Local/Offline	✅ Yes	✅ Yes	❌ Needs LLM API	❌ Needs LLM API	❌ Cloud only
Monthly Cost (1k actions)	$0	$0	~$30 - 50	~$20 - 40	~$50+

💪 Key Advantages

1. Multi-Tab Orchestration

Unlike other tools that get "lost" when a new window opens:

list_tabs() — See every open page, title, and URL.
switch_tab(index) — Multitask between different sites.
create_tab(url) — Open references or parallel workflows.

2. Session & Cookie Management

Stop struggling with login forms. Inject or inspect session state directly:

set_cookie() — Log in instantly by injecting an auth token.
get_cookies() — Debug session issues or export state.
clear_cookies() — Start fresh without clearing the whole profile.

3. Reliable File Uploads

Most AI agents fail when they encounter a <input type="file">. BrowserControl uses native browser engine hooks:

upload_file(id, path) — Just point at the button and the local file.

4. Developer Tools Suite

Debug like a pro with tools no one else provides:

get_console_logs()      # See browser errors
get_network_requests()  # Monitor API calls
get_page_errors()       # Catch JS exceptions
run_in_console(code)    # Debug in real-time
inspect_element(5)      # Get computed styles
get_page_performance()  # Core Web Vitals

5. Session Recording

start_recording()  →  Browse around  →  stop_recording()
                                              ↓
                               session_20260202.zip
                         (View with Playwright trace viewer)

6. Dynamic Viewport Control

Test responsive designs or emulate mobile screens on the fly:

set_viewport(width, height) — Change resolution without restarting.

7. True Persistence

What Persists	BrowserControl	Others
Cookies	✅	❌
localStorage	✅	❌
Session tokens	✅	❌
Login state	✅	❌
Browser history	✅	❌

Result: Log in once, stay logged in across sessions.

🛠️ Available Tools

Navigation

Tool	Description
`navigate_to(url)`	Go to a URL
`go_back()`	Navigate back
`go_forward()`	Navigate forward
`refresh_page()`	Reload the page
`scroll(direction, amount)`	Scroll up/down/left/right

Interaction

Tool	Description
`click(element_id)`	Click element by number
`click_at(x, y)`	Click at coordinates
`type_text(element_id, text)`	Type into input field
`press_key(key)`	Press keyboard key (Enter, Tab, etc.)
`hover(element_id)`	Hover over element
`scroll_to_element(element_id)`	Scroll element into view
`upload_file(element_id, path)`	Upload a file to an input
`wait(seconds)`	Wait for page loading

Tab Management

Tool	Description
`create_tab(url)`	Open a new browser tab
`switch_tab(index)`	Switch to a tab by its index
`close_tab(index)`	Close a specific tab
`list_tabs()`	List all open tabs and URLs

Forms

Tool	Description
`select_option(element_id, option)`	Select dropdown option
`check_checkbox(element_id)`	Toggle checkbox
`upload_file(element_id, file_path)`	Upload file to input

Content Extraction

Tool	Description
`get_page_content()`	Get page as markdown
`get_text(element_id)`	Get element text
`get_page_info()`	Get URL and title
`run_javascript(script)`	Execute JavaScript
`screenshot(annotate, full_page)`	Take screenshot

Developer Tools

Tool	Description
`get_console_logs()`	Browser console output
`get_network_requests()`	API calls and responses
`get_page_errors()`	JavaScript errors
`run_in_console(code)`	Execute JS in console
`inspect_element(id)`	Element styles/properties
`get_cookies()`	List browser cookies
`set_cookie(name, value, ...)`	Set a cookie
`delete_cookie(name)`	Remove a cookie
`clear_cookies()`	Clear all cookies
`set_viewport(width, height)`	Change window size
`get_page_performance()`	Load times, Web Vitals

Recording

Tool	Description
`start_recording()`	Begin session recording
`stop_recording()`	Save recording
`take_snapshot()`	Save screenshot + HTML
`list_recordings()`	View saved sessions

⚙️ Configuration

Configure via environment variables:

Variable	Default	Description
`BROWSER_HEADLESS`	`true`	Run without visible window
`BROWSER_VIEWPORT_WIDTH`	`1280`	Viewport width in pixels
`BROWSER_VIEWPORT_HEIGHT`	`720`	Viewport height in pixels
`BROWSER_TIMEOUT`	`30000`	Navigation timeout (ms)
`BROWSER_USER_DATA_DIR`	`~/.browsercontrol/user_data`	Browser profile path
`BROWSER_EXTENSION_PATH`	—	Path to browser extension
`LOG_LEVEL`	`INFO`	Logging verbosity

Examples:

# Run with visible browser (for debugging)
BROWSER_HEADLESS=false browsercontrol

# Mobile viewport emulation
BROWSER_VIEWPORT_WIDTH=375 BROWSER_VIEWPORT_HEIGHT=812 browsercontrol

# Verbose logging
LOG_LEVEL=DEBUG browsercontrol

💻 Usage Examples

Web Research

You: "Go to Wikipedia and find out when Python was created"

Claude: *navigates to wikipedia.org*
        *types "Python programming language" in search*
        *clicks search result*
        *reads the page*

        "Python was created by Guido van Rossum and first released in 1991."

Debug a Web App

You: "Go to my app at localhost:3000 and check for any errors"

Claude: *navigates to localhost:3000*
        *calls get_console_logs()*

        "I found 2 errors:
         [ERROR] Uncaught TypeError: Cannot read property 'map' of undefined
         [ERROR] Failed to load resource: 404 /api/users"

        *calls get_network_requests()*

        "The /api/users endpoint returned 404. Check your API route."

Automated Testing with Recording

You: "Test the login flow on my site. Start recording."

Claude: *calls start_recording()*
        *navigates to site*
        *types email in element 3*
        *types password in element 4*
        *clicks Login button (element 5)*
        *verifies successful redirect*
        *calls stop_recording()*

        "Login test passed! Recording saved to session_20260202.zip"

Form Automation

You: "Fill out the contact form on example.com"

Claude: *navigates to example.com/contact*
        *types name in element 2*
        *types email in element 3*
        *types message in element 4*
        *clicks Submit (element 5)*

        "Form submitted successfully!"

📚 Documentation

Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────┐
│   AI Agent      │────▶│  BrowserControl  │────▶│   Browser   │
│ (Claude/Gemini) │◀────│   MCP Server     │◀────│ (Chromium)  │
└─────────────────┘     └──────────────────┘     └─────────────┘
        │                        │                      │
        │   "click(5)"           │   mouse.click()      │
        │◀───────────────────────│◀─────────────────────│
        │   [annotated           │   [screenshot +      │
        │    screenshot]         │    element map]      │

How It Works

AI sends command — click(5)
Server finds element — Looks up element #5 from the last screenshot.
Browser acts — Clicks at the element's coordinates.
Capture state — Takes new screenshot, detects elements.
Annotate — Draws numbered boxes on interactive elements.
Return to AI — Sends annotated image + element list.

Project Structure

browsercontrol/
├── __init__.py          # Package exports
├── __main__.py          # CLI entry point
├── server.py            # MCP server setup
├── browser.py           # BrowserManager with SoM
├── config.py            # Environment configuration
└── tools/
    ├── navigation.py    # Navigation tools
    ├── interaction.py   # Click, type, hover tools
    ├── forms.py         # Form handling tools
    ├── content.py       # Content extraction tools
    ├── devtools.py      # Developer tools
    ├── recording.py     # Session recording tools
    └── tabs.py          # Tab management tools

🔧 Technical Details

Troubleshooting

"Missing X server" Error

Set BROWSER_HEADLESS=true or run with xvfb:

xvfb-run browsercontrol

Browser Not Starting

Chromium auto-installs on first run. If it fails, install manually:

python -m playwright install chromium

Session Not Persisting

Check that BROWSER_USER_DATA_DIR is writable:

ls -la ~/.browsercontrol/

Connection Refused

Ensure no other instance is running:

pkill -f browsercontrol
browsercontrol

View Session Recordings

Open recordings in the Playwright trace viewer:

npx playwright show-trace ~/.browsercontrol/recordings/session.zip

🤝 Contributing

Contributions are welcome! Check out our Contributing Guide for details.

Ideas for contributions:

[ ] Firefox/WebKit support
[ ] DOM diffing (detect changes)
[ ] Accessibility audit tools
[ ] Mobile emulation presets
[ ] Cookie import/export files

# Clone and install
git clone https://github.com/adityasasidhar/browsercontrol
cd browsercontrol
uv sync

# Run tests
uv run pytest

# Run in development
uv run fastmcp dev browsercontrol/server.py