Search Stack

Search Stack is a web search and scraping middleware layer specifically designed for AI Agents, providing a unified API for multi-engine search, anti-crawler rendering, Cookie management, and main content extraction. It aims to solve problems such as quota limitations, anti-crawler interception, and lack of login states that AI encounters when accessing web pages.

Search tools Developer tools #AI Search #Web Scraping #Anti-crawler Rendering #Cookie Management .TypeScript

rating : 2.5 points

downloads : 10.0K

update time : 2026-03-13

Open Site

What is Search Stack?

Search Stack is a web search and content scraping middleware service specifically designed for AI Agents. It solves various problems that AI encounters when accessing web pages, such as search engine restrictions, anti-crawler interception, and inability to access websites that require login. Through a unified API interface, AI Agents can easily search web pages and scrape full-text content, even dealing with complex anti-crawler mechanisms and login requirements.

How to use Search Stack?

Search Stack provides two main usage methods: 1) Integrate it as a native plugin into AI platforms such as OpenClaw, and AI can directly call search and scraping tools; 2) Use it in MCP Server mode for various AIs that support the MCP protocol. After deployment, AI can complete operations such as web page search, content scraping, and Cookie management through simple API calls.

Applicable Scenarios

Search Stack is particularly suitable for the following scenarios: • AI needs to search for the latest information for knowledge updates. • It is necessary to scrape content that requires login to access (e.g., Zhihu, Xiaohongshu). • The target website has complex anti-crawler mechanisms (e.g., Cloudflare). • Multiple search engines need to be used simultaneously, and the best results are automatically selected. • Multiple AI Agents in a team need to share search and scraping capabilities.

Main Features

Intelligent Multi-engine Switching

Automatically switch between three search engines: Tavily, Serper, and SearXNG. A single engine failure does not affect the service. SearXNG is completely free and unlimited, ensuring high service availability.

Intelligent Anti-anti-crawler

Built-in Browserless headless Chrome, enabling Stealth mode to bypass anti-crawler detection such as Cloudflare. Supports scraping JavaScript-rendered pages.

Dynamic Cookie Management

Provides a complete Cookie management API, supporting two ways to obtain Cookies: manual pasting and remote browser login. Cookies are automatically injected into scraping requests.

Remote Browser Login (Cookie Catcher)

Remotely control Chrome through the Web UI to complete complex login processes (e.g., OAuth, QR code scanning), and save Cookies with one click. Supports mouse, keyboard, and touch screen operations.

Intelligent Login Detection

Detect whether a page requires login from multiple dimensions: HTTP status code, text keywords, page title, HTML structure, etc. Automatically guide users to provide Cookies.

Precise Main Content Extraction

Combine three engines: trafilatura, BeautifulSoup, and readability to precisely extract the main content of web pages, removing irrelevant content such as advertisements and navigation.

SSRF Security Protection

Built-in private IP blacklist, rejecting access to internal network addresses to prevent AI from being induced to access internal systems.

Intelligent Caching

Redis caches search results and web page content with a 15-minute TTL. Repeated queries can be returned within 13ms, significantly improving the response speed.

MCP Server Support

Provides an MCP Server in stdio mode, which can be registered through mcporter for use by AI platforms such as OpenClaw that support MCP.

Social Media API Integration

Optionally integrate the TikHub social media API, supporting content acquisition from 803 social platforms such as Douyin, TikTok, and Weibo.

HTTP/SOCKS5 Proxy Support

Supports accessing blocked websites (e.g., YouTube) through a proxy or using a fixed IP to deal with anti-crawlers.

Advantages

Excellent Chinese search quality: Compared with Brave Search, the Chinese results are more abundant (e.g., Juejin, Zhihu, Smzdm).

High availability: Three engines with automatic fallback, and a single-point failure does not affect the service.

Comprehensive functions: Integration of search and scraping, supporting Cookie injection, anti-crawler bypass, and login detection.

Cost advantage: SearXNG is completely free and unlimited, significantly reducing API costs.

Flexible deployment: Supports local and remote deployment, and can be shared among multiple machines.

Fast response: Redis caching reduces the response time of repeated queries to as low as 13ms.

High security: Built-in SSRF protection, API authentication, and rate-limiting mechanisms.

Limitations

Deployment complexity: Requires a Docker environment and many configuration steps.

Resource consumption: Each Browserless Chrome session occupies about 400 - 500MB of memory.

Chrome proxy limitation: HTTP/SOCKS5 proxies with authentication cannot be used in Chrome rendering.

Learning curve: Cookie management and remote login require a certain learning cost.

Maintenance requirements: Requires regular Cookie updates and service status monitoring.

How to Use

Environment Preparation

Ensure that Docker and Docker Compose are installed on the system. Obtain optional search engine API Keys (Tavily, Serper).

Clone the Project and Configure

Clone the project repository, copy the environment variable template, and configure the necessary keys. Pay special attention to configuring SearXNG's JSON API support.

Start the Service

Use Docker Compose to start all services and wait for the containers to be in a healthy state.

Integrate into the AI Platform

Select the integration method according to the AI platform used: native plugin (recommended) or MCP Server. Configure the plugin and create a Skill file.

Test and Verify

Test the search and scraping functions through API calls to verify the success of the integration.

Usage Examples

Search for Technical Articles and Get the Full Text

AI needs to understand the latest Docker best practices, search for relevant articles, and directly obtain the full content for analysis.

Scrape a Zhihu Column that Requires Login

The user wants to understand the content of a paid article in a Zhihu column, but needs to log in to view the full article.

Bypass Anti-crawlers to Obtain Product Information

It is necessary to obtain product price information from an e-commerce website, but the website has strict anti-crawler mechanisms.

Multi-source Information Comparison and Research

Research a technical topic and need to obtain information from multiple sources for comparative analysis.

Frequently Asked Questions

What should I do if SearXNG search returns a 403 or empty result?

What should I do if AI does not use search-stack and still uses the built-in Brave search?

What should I do if scraping SPA websites such as Threads/Instagram fails?

What should I do if Browserless Chrome times out or crashes?

How can I obtain Cookies for websites that require login?

What should I do if the AI behavior does not change after updating SKILL.md?

Does it support remote deployment? OpenClaw and Search Stack are on different machines.

What should I do if Chrome rendering does not support proxies with authentication?

Related Resources

GitHub Repository

Search Stack project source code and latest documentation

OpenClaw Official Website

Official website of the OpenClaw AI platform

Tavily API

Tavily search engine API service

Serper API

Serper (Google) search engine API

SearXNG Documentation

Official documentation of the SearXNG meta-search engine

TikHub API

TikHub social media API platform

Model Context Protocol

Official specification of the MCP protocol

🚀 Search Stack

A web search and scraping middleware layer dedicated to AI agents.

Search Stack provides a unified web access API for AI agents such as OpenClaw, Claude Code, and Dify. It offers features like multi-engine search with automatic fallback, anti-scraping rendering using headless Chrome, cookie injection for login states, and precise extraction of page content. Once deployed, all agents can share this service.

🚀 Quick Start

Prerequisites

Docker + Docker Compose
(Optional) Tavily API Key - Free for 1000 requests per month
(Optional) Serper API Key - Free for 2500 requests

You can still use the service without Tavily/Serper API keys. It will automatically fallback to SearXNG, which is completely free.

Step 1: Clone the Project

git clone https://github.com/pinkpills/search-stack.git
cd search-stack

Step 2: Configure Environment Variables

cp .env.example .env

Edit the .env file:

# ====== Search engine API keys (Optional) ======
TAVILY_API_KEY=your_tavily_key
SERPER_API_KEY=your_serper_key

# ====== Internal service secrets (Required, each value must be different) ======
SEARXNG_SECRET=
PROXY_API_KEY=
BROWSERLESS_TOKEN=
REDIS_PASSWORD=

Generate random secrets with one command:

python3 -c "
import secrets
for name in ['SEARXNG_SECRET', 'PROXY_API_KEY', 'BROWSERLESS_TOKEN', 'REDIS_PASSWORD']:
    print(f'{name}={secrets.token_hex(16)}')
" >> .env

Step 3: Configure SearXNG

Must do! Otherwise, the JSON API of SearXNG will return a 403 error.

cp searxng/settings.yml.example searxng/settings.yml

Edit the searxng/settings.yml file to ensure it contains:

search:
  formats:
    - html
    - json     # ← This line is required, otherwise the JSON API will return a 403 error

If you have already started SearXNG before (it will automatically generate settings.yml), you need to manually add the formats configuration and restart the container.

Step 4: Start the Service

docker compose -f search-stack.yml up -d

Wait for all containers to be healthy (about 30 seconds):

docker compose -f search-stack.yml ps

The deployment is complete when all containers show healthy.

Step 5: Verification

# Health check
curl -s -H "X-API-Key: YOUR_PROXY_API_KEY" http://127.0.0.1:17080/health | python3 -m json.tool

# Search test (automatically select the engine)
curl -s -X POST http://127.0.0.1:17080/search \
  -H "X-API-Key: YOUR_PROXY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "hello world", "count": 3}' | python3 -m json.tool

# Specify SearXNG for search (verify if SearXNG is working properly)
curl -s -X POST http://127.0.0.1:17080/search \
  -H "X-API-Key: YOUR_PROXY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "hello world", "count": 3, "provider": "searxng"}' | python3 -m json.tool

# Scraping test
curl -s -X POST http://127.0.0.1:17080/fetch \
  -H "X-API-Key: YOUR_PROXY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "render": false}' | python3 -m json.tool

Tip: If SearXNG returns a 502 error or empty results, it is likely that the formats: [html, json] configuration is missing. Refer to Step 3.

✨ Features

Multi-engine Fallback — Automatically switch between Tavily, Serper, and SearXNG in priority order. The service remains available even if one engine fails.
Search + Scraping Integration — Use /search for searching and /fetch for scraping page content. Set enrich=true to automatically scrape the full text after searching.
Anti-scraping Resistance — Utilize the Browserless Stealth mode to bypass Cloudflare and JS challenges.
Content Extraction — Employ three engines (trafilatura, BeautifulSoup, and readability) to accurately extract page content.
Cookie Management — Dynamically add, delete, and update cookies via the API. Automatically inject cookies into Chrome for rendering. Support direct pasting of browser cookies.
Login/Anti-scraping Detection — Use multi-dimensional heuristic detection, including HTTP status codes (401/403), text keywords (Chinese, English, Japanese), page titles, HTML structures (password boxes, CAPTCHA embeds, meta refresh redirects), and SPA login walls. Return a needs_login flag to guide cookie updates.
Cookie Catcher — Enable remote login in the browser. Use WebSocket + CDP Screencast to control a remote Chrome instance in the Web UI to complete the login process and save cookies with one click.
SSRF Protection — Block access to private IPs (127/10/172.16/192.168/169.254).
URL Deduplication — Automatically remove tracking parameters (utm_*, fbclid, etc.) and limit the number of results from the same domain.
Redis Caching — Cache results with a 15-minute TTL. Return cached results for repeated queries immediately.
API Key Authentication + Rate Limiting — Implement sliding window rate limiting.
MCP Server — Run an MCP Server in stdio mode (mcp-server.ts). Register it via mcporter for use by agents like OpenClaw.
TikHub Social Media API — Optionally integrate the TikHub API, which provides access to 803 social media platforms (e.g., Douyin, TikTok, Weibo). Built-in automatic fallback.
HTTP Proxy — Support HTTP/SOCKS5 proxies for anti-scraping with a fixed IP or accessing blocked websites (e.g., YouTube).
Fully Asynchronous — Use async Redis and a shared httpx connection pool for high concurrency and low latency.

📦 Installation

Comparison with Brave Search

The following data is based on actual tests on February 7, 2026, with the search keyword "claude opus 4.6 review".

Dimension	Search Stack	Brave Search (Built-in in OpenClaw)
Search Speed	0.8 - 1.5s (Tavily/Serper)	~1 - 2s
Cache Hit	13ms (Redis cache for 15 minutes)	No cache, re-request every time
Chinese Search	Rich results (from Juejin, Zhihu, Smzdm, etc.)	Fewer Chinese results, more English sources
English Search	Excellent	Excellent
High Availability	Three engines with automatic fallback	Single point of failure
Full Text Scraping	Search + content extraction in one step with `enrich=true`	Only return summaries, require additional scraping
Anti-scraping Sites	Rendered by Browserless Chrome	Unable to scrape
Sites Requiring Login	Cookie injection + automatic detection and guidance	Not supported
Free Quota	Unlimited fallback with SearXNG	Strict limits on free keys

Conclusion: Similar search speed, better Chinese search quality, and far more features than Brave Search.

Architecture

                         +-----------+
    AI Agent  ──────────>| search-   |──> Tavily API
    (OpenClaw / Claude)  | proxy     |──> Serper API (Google)
         POST /search    | (FastAPI) |──> SearXNG (self-hosted)
         POST /fetch     +-----+-----+
                               |
                 +-------------+-------------+
                 |                           |
           +-----+-----+           +--------+--------+
           |   Redis    |           |   Browserless   |
           | (cache +   |           | (headless Chrome |
           |  rate-limit)|          |  anti-bot render)|
           +------------+           +-----------------+

Four containers, one-click start:

Service	Function
search-proxy	Core FastAPI proxy, providing unified search/scraping interfaces
Redis	Result caching (15-minute TTL) + API rate limiting
SearXNG	Self-hosted meta-search engine (aggregating Google, DuckDuckGo, Brave, etc., free and unlimited)
Browserless	Headless Chrome for rendering JS pages and bypassing anti-scraping mechanisms in Stealth mode

💻 Usage Examples

Basic Usage

# Health check
curl -s -H "X-API-Key: YOUR_PROXY_API_KEY" http://127.0.0.1:17080/health | python3 -m json.tool

# Search test (automatically select the engine)
curl -s -X POST http://127.0.0.1:17080/search \
  -H "X-API-Key: YOUR_PROXY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "hello world", "count": 3}' | python3 -m json.tool

# Specify SearXNG for search (verify if SearXNG is working properly)
curl -s -X POST http://127.0.0.1:17080/search \
  -H "X-API-Key: YOUR_PROXY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "hello world", "count": 3, "provider": "searxng"}' | python3 -m json.tool

# Scraping test
curl -s -X POST http://127.0.0.1:17080/fetch \
  -H "X-API-Key: YOUR_PROXY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "render": false}' | python3 -m json.tool

Advanced Usage

# Search + scrape full text (in-depth research)
curl -s -X POST http://127.0.0.1:17080/search \
  -H "X-API-Key: KEY" -H "Content-Type: application/json" \
  -d '{"query": "Python asyncio", "count": 3, "enrich": true}'

📚 Documentation

Why Do We Need Search Stack?

Pain Points	Search Stack's Solutions
Brave/Google searches have free quota limits	Multi-engine fallback (Tavily → Serper → SearXNG), SearXNG is completely free and unlimited
AI scraping of web pages is blocked by Cloudflare/anti-scraping mechanisms	Built-in Browserless headless Chrome with Stealth mode to bypass detection
Unable to scrape content from websites that require login (e.g., Zhihu, Xiaohongshu)	Cookie management API + automatic injection and rendering + login detection and guidance
Agents are induced to access internal networks (SSRF)	Built-in private IP blacklist + DNS verification
Search results only contain summaries, not full texts	Automatically scrape the full text of each result with `enrich=true`

Integrating with OpenClaw

Search Stack can serve as the default search/scraping tool for OpenClaw, replacing the built-in Brave search. There are two integration methods:

Method 1: Native Plugin (Recommended)

The native plugin directly registers the tool in the AI's tool list and runs within the OpenClaw process. Compared to the MCP + mcporter method:

Controllable Timeout — Use AbortSignal to return an exception on timeout, ensuring the AI always sees error messages (no SIGKILL and zero output).
Lower Latency — No need to start a subprocess.
More Reliable — Does not depend on the mcporter daemon.

Step 1: Install the Plugin

# Install with --link for symbolic linking (no need to reinstall on update)
openclaw plugins install --link /opt/search-stack/plugin/

Step 2: Configuration

Edit ~/.openclaw/openclaw.json and add the following to plugins.entries:

{
  "plugins": {
    "entries": {
      "search-stack": {
        "enabled": true,
        "config": {
          "apiUrl": "http://127.0.0.1:17080",
          "apiKey": "your_proxy_api_key",
          "tikhubApiKey": "your_tikhub_key"
        }
      }
    }
  }
}

The value of apiKey is the PROXY_API_KEY in .env. tikhubApiKey is optional. Fill in the TikHub API Key to enable the social media API. Note: The configuration must be placed inside the config nested object, not directly under search-stack.

Step 3: Disable the Built-in Brave Search

Edit ~/.openclaw/openclaw.json:

{
  "tools": {
    "web": {
      "search": {
        "enabled": false
      }
    }
  }
}

Step 4: Create a Skill File

mkdir -p ~/.openclaw/workspace/skills/web-search/
cp /opt/search-stack/skill-template/SKILL.md ~/.openclaw/workspace/skills/web-search/SKILL.md

The Skill file guides the AI on when and how to use the search tool (two-step principle, cookie workflow, TikHub priority, etc.). The template file is located at skill-template/SKILL.md and can be modified as needed.

Step 5: Restart and Verify

sudo systemctl restart openclaw

# Verify plugin loading
openclaw plugins list  # Should show search-stack: loaded

# Verify tool registration (via AI conversation or gateway API)
# The AI should be able to directly see tools like web_search without using exec

Important: If the AI is still using the old method, you need to archive the old session. OpenClaw's session context caches the previous tool usage pattern. Even if the configuration is updated, the old session will still use the old behavior. See "Common Issues → AI does not use search-stack" below for details.

Method 2: MCP + mcporter (Alternative)

This method is suitable for scenarios where you do not want to install the native plugin or need to use Search Stack in an environment other than OpenClaw.

Step 1: Install MCP Server Dependencies

The MCP Server runs using Bun + @modelcontextprotocol/sdk:

# Install Bun (if not installed)
curl -fsSL https://bun.sh/install | bash

# Install MCP SDK
bun add -g @modelcontextprotocol/sdk zod

Step 2: Register with mcporter

Edit ~/.mcporter/mcporter.json and add Search Stack:

{
  "mcpServers": {
    "search-stack": {
      "command": "/home/your_user/.bun/bin/bun",
      "args": ["run", "/opt/search-stack/proxy/mcp-server.ts"],
      "keepAlive": true,
      "env": {
        "SEARCH_STACK_URL": "http://127.0.0.1:17080",
        "SEARCH_STACK_API_KEY": "your_proxy_api_key",
        "TIKHUB_API_KEY": "your_tikhub_key"
      }
    }
  }
}

Verify the registration:

mcporter daemon restart
mcporter list
# Should show search-stack (6 tools) healthy

Step 3: Create a Skill and Restart

Create ~/.openclaw/workspace/skills/web-search/SKILL.md (using the mcporter exec call format), disable Brave search, and restart OpenClaw.

Note: In the MCP + mcporter method, the AI uses the exec tool to execute the mcporter call search-stack.* command. A SIGKILL on timeout will result in zero output, and the AI may think "the search engine is down". It is recommended to use the native plugin method to avoid this issue.

Remote Deployment (OpenClaw and Search Stack on Different Machines)

This scenario is applicable when Search Stack runs on Server A (with a public domain name) and OpenClaw runs on Server B, and you need to remotely call the Search Stack API.

Server A (search-stack)              Server B (openclaw)
┌──────────────────────┐             ┌──────────────────────┐
│  Docker four-piece   │   HTTPS     │  OpenClaw            │
│  search-proxy :17080 │◄────────────│  search-stack plugin │
│  Redis / SearXNG     │             │                      │
│  Browserless         │             │  Only need the plugin/ directory  │
└──────────────────────┘             └──────────────────────┘

Step 1: Server A — Configure Reverse Proxy (HTTPS)

Search Stack only listens on 127.0.0.1:17080 by default. For remote access, you need to expose the HTTPS port through an Nginx reverse proxy. Example Nginx configuration (assuming the domain name is search.example.com):

location /search-stack/ {
    proxy_pass http://127.0.0.1:17080/;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_read_timeout 60s;
    proxy_send_timeout 60s;

    # Cookie Catcher WebSocket support
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
}

Security Reminder: The API Key is transmitted in plain text in the request header. You must use HTTPS. It is recommended to use Certbot to automatically apply for a Let's Encrypt certificate.

Step 2: Server B — Obtain Plugin Code

You only need the plugin/ and skill-template/ directories. There is no need to install Docker or the entire project:

# Method 1: Clone the entire repository (simple)
git clone https://github.com/pinkpills/search-stack.git /opt/search-stack
cd /opt/search-stack/plugin && npm install

# Method 2: Download only the required directories (lightweight)
mkdir -p /opt/search-stack && cd /opt/search-stack
# Download plugin/ and skill-template/
git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/pinkpills/search-stack.git .
git sparse-checkout set plugin skill-template
cd plugin && npm install

Step 3: Server B — Install the Plugin into OpenClaw

openclaw plugins install --link /opt/search-stack/plugin/

Step 4: Server B — Configure the Remote API Address

Edit ~/.openclaw/openclaw.json:

{
  "plugins": {
    "entries": {
      "search-stack": {
        "enabled": true,
        "config": {
          "apiUrl": "https://search.example.com/search-stack",
          "apiKey": "your_proxy_api_key",
          "publicUrl": "https://search.example.com/search-stack",
          "tikhubApiKey": "your_tikhub_key"
        }
      }
    }
  },
  "tools": {
    "web": {
      "search": {
        "enabled": false
      }
    }
  }
}

Configuration Explanation:

Field	Description
`apiUrl`	The Search Stack API address on Server A (HTTPS URL after Nginx proxy)
`apiKey`	The `PROXY_API_KEY` in `.env` on Server A
`publicUrl`	The public URL used in the Cookie Catcher link (must be accessible from the user's browser), usually the same as `apiUrl`
`tikhubApiKey`	(Optional) TikHub API Key

Step 5: Server B — Create a Skill and Restart

mkdir -p ~/.openclaw/workspace/skills/web-search/
cp /opt/search-stack/skill-template/SKILL.md ~/.openclaw/workspace/skills/web-search/SKILL.md
sudo systemctl restart openclaw

Verification

Test in an OpenClaw conversation:

User: "Search for a review of Claude Opus 4.6"
AI:   Call web_search → Return results (from the remote Search Stack)

User: "Open the first link to see the full text"
AI:   Call page_fetch → Return the full text (rendered by the remote Browserless)

If the tool call fails, check:

Whether Server B can access apiUrl: curl -H "X-API-Key: KEY" https://search.example.com/search-stack/health
Whether the plugin is loaded: openclaw plugins list
Old session cache: Archive the old session and restart (see "Common Issues → AI does not use search-stack" for details)

Multi-machine Concurrency and Resource Control

When multiple machines share a Search Stack, Browserless Chrome is the main bottleneck — each concurrent rendering session consumes approximately 400 - 500MB of memory. The default configuration is optimized for 3 clients:

# search-stack.yml — browserless section
MAX_CONCURRENT_SESSIONS=10    # Maximum 10 Chrome instances running simultaneously
MAX_QUEUE_LENGTH=30           # Maximum queue length after exceeding concurrency
CONNECTION_TIMEOUT=120000     # Single session timeout of 2 minutes to prevent long-term occupation
deploy:
  resources:
    limits:
      memory: 4g              # Hard memory limit. Automatically restart if exceeded to avoid crashing the entire machine

Suggested adjustments based on the number of clients:

Number of Clients	`MAX_CONCURRENT_SESSIONS`	`memory`	Recommended Server Memory
1 - 2	5	2g	4GB+
3 - 5	10	4g	8GB+
5 - 10	20	8g	16GB+

When the concurrency limit is exceeded, Browserless will queue the requests. If the queue is also full, it will return a 429 error. page_fetch will report an error, but it will not affect the search function (web_search does not depend on Chrome). The Cookie Catcher has a hard limit of 2 sessions, and multiple users need to queue for login.

Tool List

Regardless of whether you use the native plugin or the MCP method, the provided tools are the same:

Tool	Description
`web_search`	Multi-engine search, supporting `enrich` for full text scraping
`page_fetch`	Scrape web page content, supporting cookie injection, Chrome rendering, and login detection
`cookies_list`	List the domains with configured cookies
`cookies_update`	Add/update domain cookies (supports pasting raw strings)
`cookies_delete`	Delete domain cookies
`cookie_catcher_link`	Generate a remote browser login link (Cookie Catcher)
`tikhub_call`	Call the TikHub social media API (requires configuration of the Key, used as needed)

Note: The scraping tool is named page_fetch instead of web_fetch to avoid conflicts with the built-in web_fetch tool in OpenClaw. The built-in web_fetch does not support cookie injection and Chrome rendering. Using the same name will cause the AI to call the wrong tool.

Cookie Workflow in Practice

There are two ways to obtain cookies:

Method 1: Manual Copy and Paste (Suitable for Desktop)

User: "Help me check this web page https://zhuanlan.zhihu.com/p/xxxx"

AI: Call page_fetch → Incomplete content (only title/summary)
AI: "This website has strict anti-scraping measures, and the full content was not retrieved.
     If you need the full content, you can provide the cookie for this website:
     1. Open the website in your browser and log in
     2. Press F12 → Network tab → Refresh the page
     3. Find the Cookie: line in the request header
     4. Copy the entire value and send it to me"

User: "z_c0=xxx; _xsrf=yyy; d_c0=zzz ..."

AI: Automatically extract the domain zhihu.com → cookies_update → Save successfully
AI: Re-scrape with bypass_cache:true → Obtain the full article content

Method 2: Cookie Catcher Remote Login (Suitable for Complex Login Processes)

1. Open in the browser: http://YOUR_HOST:17080/cookie-catcher?key=API_KEY&url=https://threads.net
2. Complete the login in the remote Chrome screen (supports mouse/keyboard/touch screen operations)
3. Click "Save Cookies" → Automatically save to cookies.json
4. Subsequent /fetch requests will automatically inject the cookie

This method is suitable for complex login scenarios that require OAuth redirection, QR code scanning, mobile verification codes, etc.

🔧 Technical Details

Common Issues and Troubleshooting

Deployment-related

Q: SearXNG search returns 403 / 502 / empty results This is the most common issue. SearXNG disables the JSON search API by default. When search-stack calls it with ?format=json, it will return a 403 Forbidden error.

Solution: Ensure that searxng/settings.yml contains:

search:
  formats:
    - html
    - json     # ← Without this line, it will return a 403 error

Restart the container after modification:

docker compose -f search-stack.yml restart searxng

Why is it not easy to detect? The search-stack proxy treats SearXNG's 403 error as "unavailable" and silently skips it, falling back to Tavily/Serper. You may think SearXNG is working fine, but it has never been used. You can use provider: "searxng" to force a test and discover the issue.

Q: Permission issues when starting SearXNG for the first time The SearXNG container runs with UID 977. If the mount directory permissions are incorrect, it will fail to start:

sudo chown -R 977:977 /opt/search-stack/searxng/
docker compose -f search-stack.yml restart searxng

Q: Browserless times out or crashes Browserless allows a maximum of 5 concurrent Chrome sessions by default. If timeouts occur frequently, check the memory (at least 2GB):

docker stats browserless

You can adjust MAX_CONCURRENT_SESSIONS in search-stack.yml.

Q: Redis connection fails Confirm that REDIS_PASSWORD is set and not empty in .env:

docker exec search-redis redis-cli -a YOUR_REDIS_PASSWORD ping
# Should return PONG

Q: search-proxy reports a redis.exceptions.ConnectionError when starting search-proxy depends on Redis and SearXNG to start first. Although docker compose's depends_on + healthcheck usually handle this, if Redis starts slowly:

docker compose -f search-stack.yml restart search-proxy

MCP Server-related

Q: mcporter list shows that search-stack is unhealthy Check step by step:

Confirm that the Docker containers are running: docker compose -f search-stack.yml ps
Confirm that the API is reachable: curl -H "X-API-Key: KEY" http://127.0.0.1:17080/health
Confirm that the Bun path is correct: which bun
Run directly to check for errors: SEARCH_STACK_URL=http://127.0.0.1:17080 SEARCH_STACK_API_KEY=your_key bun run /opt/search-stack/proxy/mcp-server.ts

Q: z.record() / schema._zod error This is a known issue with MCP SDK v1.26.0 + Zod v4. z.record() will report Cannot read properties of undefined (reading '_zod') when serializing tools/list.

Solution (handled in this project):

Use z.string() instead of z.record() for parameters, and JSON.parse() in the handler.
Use z.any() for parameters that need to accept both objects and strings.

Q: mcporter reports "expected string, received object" when passing JSON parameters mcporter automatically parses JSON strings into objects and passes them to MCP tools. If the schema is defined as z.string(), the validation will fail.

Solution (handled in this project): Use z.any() and handle both types in the handler:

const rawArgs = params.arguments as unknown;
if (typeof rawArgs === "object" && rawArgs !== null) {
  args = rawArgs as Record<string, unknown>;
} else {
  args = JSON.parse((rawArgs as string) || "{}");
}

OpenClaw integration-related

Q: The AI does not use search-stack and still uses the built-in Brave search Check three points:

Confirm that Brave search is disabled: "search": { "enabled": false } in ~/.openclaw/openclaw.json
Confirm that SKILL.md exists: ls ~/.openclaw/workspace/skills/web-search/SKILL.md
(Most critical) Archive the old session: OpenClaw's session context (possibly hundreds of thousands of tokens) caches the previous tool usage pattern. Even if SKILL.md is updated, the old session will still use the old behavior. You must archive it:

# Find active sessions
ls -lt ~/.openclaw/agents/main/sessions/*.jsonl | head -3

# Archive (rename, do not delete)
mv ~/.openclaw/agents/main/sessions/SESSION_ID.jsonl \
   ~/.openclaw/agents/main/sessions/SESSION_ID.jsonl.archived

# Remove the corresponding entry from the registry
# Edit ~/.openclaw/agents/main/sessions/sessions.json
# Find the key containing the SESSION_ID and delete the entire entry

# Restart
sudo systemctl restart openclaw

After starting a new session, the AI will read SKILL.md again and use the mcporter call command.

Q: The AI retrieves partial content but does not prompt the user to provide a cookie SKILL.md must clearly list all conditions for triggering the cookie prompt:

Return ** LOGIN REQUIRED **
Incomplete content (only title/summary, content truncated or empty)
Anti-scraping prompts ("Please log in", "Verification required", etc.)
Content that does not match expectations (only the sidebar is retrieved on an article page)

At the same time, clearly tell the AI "what not to do" — do not use content interpretation to replace failed scraping, and do not skip the prompt. If only the LOGIN REQUIRED condition is listed, the AI will not trigger the prompt when retrieving partial content.

Q: Scraping fails for SPA websites like Threads/Instagram, with the prompt "JS SPA requiring Chrome" This is usually not a problem of lacking Chrome (Browserless is running by default). The real reason is that the cookie has expired:

Browserless uses Chrome to render and injects the cookie, but the session has expired.
The React SPA does not render the actual content and returns the login page.
The login page text is short, and the AI mistakenly thinks that the JS rendering has failed.

Solution: Re-export the cookie from the browser (ensure that you are logged in), update it via cookies_update, and retry with bypass_cache: true.

detect_needs_login supports multi-dimensional detection: HTTP status codes (401/403), text keywords (Chinese, English, Japanese), page titles, HTML password boxes/CAPTCHA/meta refresh, SPA login walls (Threads/Instagram/Facebook), and will return a clear needs_login: true prompt.

You can also use the Cookie Catcher for remote login: Open /cookie-catcher?key=API_KEY&url=TARGET_URL, complete the login in the remote Chrome, and save the cookie with one click.

Q: The AI uses exec + curl to call Brave instead of mcporter call OpenClaw's AI uses the exec tool to execute shell commands to call MCP. SKILL.md must use the specific command format:

mcporter call search-stack.web_search query="keyword" --output json

Do not write it as the abstract search-stack.web_search(query="keyword"), as the AI will not translate it into a shell command on its own.

Q: The AI's behavior does not change after updating SKILL.md This is one of the most common pitfalls. Reasons and solutions:

Background mechanism: OpenClaw has a Skills Watcher (enabled by default, skills.load.watch: true), which monitors changes to the SKILL.md file and bumps the version number. However, this only refreshes the skill list (which skills are available) and does not force the AI to re-read the content of SKILL.md.

Why does the change to SKILL.md not take effect?

The AI does not read SKILL.md every round — it only reads it in the first round of the session or when it deems necessary.
The read content is cached in the session context (possibly hundreds of thousands of tokens).
More importantly, the AI mostly only looks at the description field of the tool (a short text written at registration) and does not look at SKILL.md.

Correct solutions (in order of recommendation):

Modify the tool description (most effective) — The tool description is always visible to the AI. Modify the description field in plugin/index.ts → Restart OpenClaw → The new session takes effect immediately. Key behavior constraints should be written in the description, not just in SKILL.md.

Archive the old session — Force the new session to reload all context:

# Archive all active sessions
for f in ~/.openclaw/agents/main/sessions/*.jsonl; do
  mv "$f" "$f.archived"
done

# Clear the registry
echo '{"sessions":[]}' > ~/.openclaw/agents/main/sessions/sessions.json

# Restart
sudo systemctl restart openclaw

Wait for the Skills Watcher to take effect — If only the supplementary notes in SKILL.md are modified (not involving core behavior), you can wait for the AI to be triggered to re-read SKILL.md in the next round of conversation.

Best practice: Write core behavior constraints (e.g., "Use tool A first, then tool B") in the tool description, and write detailed processes and examples in SKILL.md. In this way, even if the AI does not read SKILL.md, the tool description can still provide a fallback.

Q: What to do after updating the plugin code (index.ts) The plugin runs within the OpenClaw process. After changing the code, you need to:

# Restart OpenClaw (reload the plugin code)
sudo systemctl restart openclaw

If you also modify SKILL.md, it is recommended to archive the old session (see the previous item).

📄 API Documentation

All requests must include the X-API-Key header.

`GET /health`

Health check.

{
  "ok": true,
  "redis": true,
  "order": ["tavily", "serper", "searxng"],
  "browserless_configured": true,
  "dedupe": { "enabled": true, "max_per_host": 2 }
}

`POST /search`

Web search.

Parameter	Type	Default Value	Description
`query`	string	Required	Search keyword
`count`	int	5	Number of results to return (1 - 10)
`provider`	string	Automatic	Force to specify: `tavily` / `serper` / `searxng`
`enrich`	bool	false	Scrape the full text of each result
`max_chars`	int	8000	Maximum number of characters per page when enriching
`render`	bool	true	Render with a headless browser
`concurrency`	int	3	Number of concurrent scrapes when enriching

# Basic search
curl -s -X POST http://127.0.0.1:17080/search \
  -H "X-API-Key: KEY" -H "Content-Type: application/json" \
  -d '{"query": "Docker best practices", "count": 5}'

# Search + scrape full text (in-depth research)
curl -s -X POST http://127.0.0.1:17080/search \
  -H "X-API-Key: KEY" -H "Content-Type: application/json" \
  -d '{"query": "Python asyncio", "count": 3, "enrich": true}'

# Force to use SearXNG (free, does not consume API quota)
curl -s -X POST http://127.0.0.1:17080/search \
  -H "X-API-Key: KEY" -H "Content-Type: application/json" \
  -d '{"query": "AI news", "count": 5, "provider": "searxng"}'

Example Response:

{
  "query": "Docker best practices",
  "count": 5,
  "cached": false,
  "provider": "tavily",
  "results": [
    {
      "title": "Docker Best Practices",
      "url": "https://example.com/docker",
      "snippet": "Top 10 Docker best practices for production...",
      "source": "tavily"
    }
  ]
}

`POST /fetch`

Scrape web page content.

Parameter	Type	Default Value	Description
`url`	string	Required	Target URL
`render`	bool	true	Render with a headless browser
`max_chars`	int	20000	Maximum number of characters to extract
`timeout`	float	25	Timeout in seconds
`headers`	object	null	Custom request headers
`bypass_cache`	bool	false	Skip the cache (use after updating cookies)

Example Response:

{
  "cached": false,
  "url": "https://example.com/",
  "status_code": 200,
  "render": false,
  "title": "Example Domain",
  "text": "This domain is for use in illustrative examples..."
}

When the page requires login or is blocked by anti-scraping mechanisms, it returns needs_login: true:

{
  "needs_login": true,
  "has_cookies": false
}

has_cookies: true indicates that there are existing cookies but they have expired and need to be re-exported.

The detection rules cover the following signals (in order of priority):

Signal	Condition	Example
HTTP 401	Direct determination	API endpoint not authenticated
HTTP 403 + short content	text < 2000 characters	Access denied page
Text login keywords	1 hit + < 500 characters, or 2+ hits	"Please log in", "sign in to continue", "verify you are human"
Page title contains login words	+ text < 2000 characters	`<title>Sign In - Example</title>`
HTML password input box	+ text < 3000 characters	`<input type="password">`
Meta refresh → login URL	Direct determination	`<meta http-equiv="refresh" content="0;url=/login">`
CAPTCHA embed	+ text < 1000 characters	reCAPTCHA, hCaptcha, Cloudflare Turnstile
Empty shell ICP filing page	2+ hits + < 800 characters	Pages containing only ICP filing numbers (e.g., Xiaohongshu)

It supports Chinese, English, and Japanese login keywords, as well as OAuth prompts ("continue with Google"), paywalls ("subscribe to continue"), and Cloudflare verification ("checking your browser").

Cookie Management

Dynamically manage domain cookies without restarting. Cookies are automatically injected into Browserless rendering requests.

# List all domains
GET /cookies

# Add/update — Raw string (directly copy and paste from the browser)
PUT /cookies/zhihu.com
  {"raw": "z_c0=xxx; _xsrf=yyy; d_c0=zzz"}

# Add/update — JSON array
PUT /cookies/zhihu.com
  {"cookies": [{"name":"z_c0","value":"xxx"}, {"name":"_xsrf","value":"yyy"}]}

# Delete
DELETE /cookies/zhihu.com

# Reload from cookies.json
POST /cookies/reload

Cookie Catcher (Remote Browser Login)

For scenarios where cookies cannot be directly copied (e.g., mobile devices, complex OAuth processes), the Cookie Catcher provides a Web UI to remotely control Chrome to complete the login:

Browser access: GET /cookie-catcher?key=YOUR_API_KEY[&url=https://target-site.com]

Workflow:

Open /cookie-catcher?key=API_KEY in the browser to establish a WebSocket connection.
Enter the target website URL in the address bar and click Go.
Real-time display of the remote Chrome screen via CDP Screencast (JPEG stream).
The user operates on the screen (mouse click, keyboard input, scrolling) to complete the login.
Click "Save Cookies" to extract and save all cookies for the current domain with one click.
Cookies are automatically written to cookies.json, and subsequent /fetch rendering requests will automatically inject the cookies.

Technical Details:

Parameter	Value
WebSocket Endpoint	`WS /cookie-catcher/ws?key=API_KEY`
Maximum Concurrent Sessions	2
Session Timeout	Automatically close after 10 minutes
Screen Resolution	1280 x 800
Screen Format	JPEG, quality = 60
Input Support	Mouse (click/move/scroll), keyboard, touch screen

WebSocket Message Protocol:

Client → Server:

{"type": "navigate", "url": "https://example.com"}
{"type": "mouse", "action": "mousePressed", "x": 100, "y": 200, "button": "left"}
{"type": "key", "action": "keyDown", "key": "a", "code": "KeyA", "text": "a"}
{"type": "scroll", "x": 640, "y": 400, "deltaX": 0, "deltaY": 100}
{"type": "save_cookies", "domain": "example.com"}
{"type": "close"}

Server → Client:

{"type": "frame", "data": "<base64 JPEG>"}
{"type": "url", "url": "https://example.com/dashboard"}
{"type": "title", "title": "Dashboard"}
{"type": "cookies_saved", "domain": "example.com", "count": 15, "names": ["session", "token", ...]}
{"type": "error", "message": "Too many active sessions (max 2)"}
{"type": "closed"}

Configuration Reference

Environment Variables

Variable	Default Value	Description
`TAVILY_API_KEY`	-	Tavily API Key
`SERPER_API_KEY`	-	Serper (Google) API Key
`ORDER`	`tavily,serper,searxng`	Search engine priority
`API_KEYS`	-	Proxy authentication keys (comma-separated, supports multiple keys)
`RATE_LIMIT_PER_MIN`	`60`	Maximum number of requests per minute
`CACHE_TTL`	`900`	Cache expiration time in seconds
`FETCH_TIMEOUT`	`25`	Scraping timeout in seconds
`MAX_FETCH_BYTES`	`2000000`	Maximum number of bytes to scrape per page
`FETCH_DEFAULT_RENDER`	`true`	Enable browser rendering by default
`ALLOW_DOMAINS`	-	Domain whitelist (comma-separated)
`BLOCK_DOMAINS`	-	Domain blacklist (comma-separated)
`DEDUPE`	`true`	URL deduplication
`MAX_PER_HOST`	`2`	Maximum number of results to return from the same domain
`PROXY_URL`	-	HTTP proxy, see HTTP Proxy (Optional) for details

MCP Server Environment Variables

Passed through the env configuration of mcporter:

Variable	Description
`SEARCH_STACK_URL`	REST API address (default `http://127.0.0.1:17080`)
`SEARCH_STACK_API_KEY`	Same as `PROXY_API_KEY`
`TIKHUB_API_KEY`	(Optional) TikHub social media API Key

SearXNG Configuration

The SearXNG configuration file is located at searxng/settings.yml. Key configurations:

use_default_settings: true

server:
  secret_key: "your_secret"      # Required, random string
  limiter: false                  # Disable rate limiting (internal service, no need for rate limiting)

search:
  safe_search: 0
  formats:
    - html
    - json                        # JSON format must be enabled

outgoing:
  request_timeout: 10.0
  max_request_timeout: 20.0

Refer to the SearXNG documentation for detailed configuration.

Project Structure

search-stack/
├── search-stack.yml          # Docker Compose orchestration
├── .env                      # Environment variables (secrets, not included in Git)
├── .env.example              # Environment variable template
├── plugin/
│   ├── openclaw.plugin.json  # OpenClaw native plugin manifest
│   └── index.ts              # Plugin entry point (recommended integration method)
├── skill-template/
│   └── SKILL.md              # OpenClaw Skill template (copy to ~/.openclaw/workspace/skills/web-search/)
├── proxy/
│   ├── Dockerfile            # Proxy service image
│   ├── app.py                # FastAPI main program (REST API)
│   ├── cookie_catcher.py     # Cookie Catcher (remote browser CDP session management)
│   ├── mcp-server.ts         # MCP Server (stdio, alternative integration method)
│   ├── cookies.json          # Cookie storage (automatically updated at runtime)
│   ├── cookies.json.example  # Cookie format example
│   ├── requirements.txt      # Python dependencies
│   └── static/
│       └── cookie-catcher.html  # Cookie Catcher Web UI
└── searxng/
    ├── settings.yml          # SearXNG configuration (automatically generated on first startup)
    └── settings.yml.example  # SearXNG configuration template

HTTP Proxy (Optional)

After configuring PROXY_URL, all outbound requests are sent through the proxy. This is suitable for:

Anti-scraping with a fixed IP — The target website sees the proxy IP instead of the server's real IP.
Accessing blocked websites — Servers in China can access blocked websites such as YouTube and Google.

Configuration

Set in .env:

# HTTP proxy
PROXY_URL=http://host:port

# HTTP proxy with authentication
PROXY_URL=http://user:pass@host:port

# SOCKS5 proxy
PROXY_URL=socks5://host:port

# SOCKS5 proxy with authentication
PROXY_URL=socks5://user:pass@host:port

If not configured or left blank, all requests will be sent directly, and the behavior will be the same as before.

Proxy Coverage

Request Type	Use Proxy?	Description
Direct httpx scraping (`render=false`)	✅	`http_client` with the `proxy` parameter
Tavily / Serper API calls	✅	Same as above
SearXNG → Google/DuckDuckGo, etc.	✅	Passed to the SearXNG container via the `HTTP_PROXY`/`HTTPS_PROXY` environment variables
Browserless Chrome rendering (`render=true`)	⚠️ Conditional	Through the Chrome `--proxy-server` startup parameter, does not support proxies with authentication (see below for details)
Internal container communication (Redis, SearXNG API, Browserless API)	❌	Use an independent `http_internal` client, never use the proxy

Proxy Limitations for Chrome Rendering

The Chrome --proxy-server startup parameter only accepts the scheme://host:port format and has no mechanism for passing usernames and passwords. Therefore:

Proxy Type	Direct httpx	Chrome Rendering
No authentication (IP whitelist) `http://host:port`	✅	✅
With authentication `http://user:pass@host:port`	✅	❌ Automatically skipped, Chrome uses direct connection
SOCKS5 without authentication `socks5://host:port`	✅	✅
SOCKS5 with authentication `socks5://user:pass@host:port`	✅	❌ Automatically skipped, Chrome uses direct connection

Recommendation: If you need Chrome rendering to also use the proxy, use a proxy with IP whitelist authentication (add the server IP to the whitelist in the proxy service provider's background and remove the username and password). Most fixed IP proxy service providers support this method.

The code will automatically detect whether PROXY_URL contains authentication information (@). If there is authentication, it will skip Chrome proxy injection to ensure that Browserless rendering does not report errors due to authentication failure.

Security Notes

All internal services (Redis, SearXNG, Browserless) do not expose host ports and only communicate through the Docker internal network.
Redis enables password authentication.
Built-in SSRF protection: Block access to private IPs (127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16).
Automatically normalize URLs and remove tracking parameters.
API Key authentication + sliding window rate limiting per minute.
The .env file contains sensitive secrets. Make sure to add it to .gitignore.

Reverse Proxy (Production Deployment)

The service only listens on 127.0.0.1:17080 by default. In a production environment, if you need external access, use an Nginx reverse proxy + HTTPS:

location /search-stack/ {
    proxy_pass http://127.0.0.1:17080/;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_read_timeout 60s;
    proxy_send_timeout 60s;
}