Lightweight MCP Server Tool: Multi-source Search & Optimized LLM Input Integration

Light Research MCP

A lightweight MCP server tool that provides efficient web content search and extraction functions for LLMs, supporting DuckDuckGo search, GitHub code search, and web content cleaning to optimize LLM input.

Developer tools Research and data #LLM Tools #Content Extraction #Web Search #MCP Service .TypeScript

rating : 2.5 points

downloads : 6.6K

update time : 2025-07-24

Open Site

What is the LLM Researcher MCP Server?

The LLM Researcher MCP server is a lightweight Model Context Protocol (MCP) service that allows large language models (LLMs) to enhance their knowledge base through web search and content extraction. It uses DuckDuckGo and GitHub for searching and extracts web content through Playwright and @mozilla/readability.

How to use the LLM Researcher MCP Server?

Users can start the MCP server through the command-line interface or directly use the search, code lookup, and content extraction functions in the CLI. Additionally, the server can be integrated into MCP-supported tools such as Claude Code to extend the capabilities of LLMs.

Applicable Scenarios

Suitable for scenarios that require real-time web information retrieval, code example lookup, and web content extraction, especially for developers, researchers, and technical teams.

Main Features

Web Search

Perform web searches through DuckDuckGo to obtain the latest web information.

GitHub Code Search

Search for code examples and implementation patterns on GitHub to help developers quickly find relevant code.

Content Extraction

Use Playwright and @mozilla/readability to extract clean content from web pages, ensuring a friendly output format.

MCP Server

Run as a Model Context Protocol server, providing search and content extraction tools for LLMs.

Rate Limiting

Respect DuckDuckGo's rate limits to ensure the stable operation of the service.

Cross-Platform Support

Support operating systems such as macOS, Linux, and WSL, facilitating use in different environments.

Advantages

Use DuckDuckGo search for free without an API key

Support GitHub code search for easy code example lookup

Intelligent content extraction with clean Markdown output

Support multiple usage modes: CLI, MCP server, interactive, etc.

Lightweight and easy to deploy

Limitations

Dependent on external services (such as DuckDuckGo), may be subject to network or service limitations

Some websites may block automated access, affecting content extraction results

Require the installation of Node.js and Playwright browsers

How to Use

Install Dependencies

Ensure that Node.js version 20.0.0 or higher is installed and install the project dependencies.

Build the Project

Build the project using tsup to generate an executable file.

Install Playwright Browsers

Install the browsers required by Playwright to support web page rendering.

Start the MCP Server

Run the following command to start the MCP server for LLMs to call.

Use the CLI for Search

Enter search terms in the command line to perform web searches or code lookups.

Usage Examples

Find React Hooks Examples

Users want to find examples of using useState and useEffect in React.

Search for TypeScript Best Practices

Users need to learn about the latest TypeScript best practices and programming suggestions.

Extract Web Page Content

Users want to extract text content from a specific web page for further processing.

Frequently Asked Questions

How to solve the 'Browser not found' error?

How to integrate the MCP server with other tools?

Why can't some web pages have their content extracted?

How to improve search speed?

Related Resources

Official Documentation

Detailed project documentation and configuration instructions

GitHub Repository

Project source code and contribution guidelines

Video Tutorial

Video tutorial on how to use the LLM Researcher MCP server

🚀 LLM Researcher

LLM Researcher is a lightweight MCP (Model Context Protocol) server designed for LLM orchestration. It offers efficient web content search and extraction capabilities, allowing LLMs to search DuckDuckGo and extract clean, LLM - friendly content from web pages.

Built with TypeScript, tsup, and vitest to offer a modern development experience.

✨ Features

MCP Server Support: Offers a Model Context Protocol server for seamless LLM integration.
Free Operation: Utilizes the DuckDuckGo HTML endpoint, incurring no API costs.
GitHub Code Search: Enables searching GitHub repositories for code examples and implementation patterns.
Smart Content Extraction: Employs Playwright and @mozilla/readability to extract clean content.
LLM - Optimized Output: Provides sanitized Markdown (only h1 - h3, bold, italic, and links).
Rate Limited: Respects DuckDuckGo's rate limits, with a 1 req/sec limit.
Cross - Platform: Works on macOS, Linux, and WSL.
Multiple Modes: Supports CLI, MCP server, search, direct URL, and interactive modes.
Type Safe: Fully implemented in TypeScript with strict typing.
Modern Tooling: Built using the tsup bundler and vitest testing framework.

📦 Installation

Prerequisites

Node.js 20.0.0 or higher.
No local Chrome installation is required as it uses Playwright's bundled Chromium.

Setup

# Clone or download the project
cd light - research - mcp

# Install dependencies (using pnpm)
pnpm install

# Build the project
pnpm build

# Install Playwright browsers
pnpm install - browsers

# Optional: Link globally for system - wide access
pnpm link --global

💻 Usage

MCP Server Mode

Use it as a Model Context Protocol server to provide search and content extraction tools to LLMs:

# Start MCP server (stdio transport)
llmresearcher --mcp

# The server provides these tools to MCP clients:
# - github_code_search: Search GitHub repositories for code
# - duckduckgo_web_search: Search the web with DuckDuckGo
# - extract_content: Extract detailed content from URLs

Setting up with Claude Code

# Add as an MCP server to Claude Code
claude mcp add light - research - mcp /path/to/light - research - mcp/dist/bin/llmresearcher.js --mcp

# Or with project scope for team sharing
claude mcp add light - research - mcp - s project /path/to/light - research - mcp/dist/bin/llmresearcher.js --mcp

# List configured servers
claude mcp list

# Check server status
claude mcp get light - research - mcp

MCP Tool Usage Examples

Once configured, you can use these tools in Claude:

> Search for React hooks examples on GitHub
Tool: github_code_search
Query: "useState useEffect hooks language:javascript"

> Search for TypeScript best practices
Tool: duckduckgo_web_search  
Query: "TypeScript best practices 2024"
Locale: us - en (or wt - wt for no region)

> Extract content from a search result
Tool: extract_content
URL: https://example.com/article - from - search - results

Command Line Interface

# Search mode - Search DuckDuckGo and interactively browse results
llmresearcher "machine learning transformers"

# GitHub Code Search mode - Search GitHub for code
llmresearcher - g "useState hooks language:typescript"

# Direct URL mode - Extract content from specific URL
llmresearcher - u https://example.com/article

# Interactive mode - Enter interactive search session
llmresearcher

# Verbose logging - See detailed operation logs
llmresearcher - v "search query"

# MCP Server mode - Start as Model Context Protocol server
llmresearcher --mcp

📚 Documentation

Development

Scripts

# Build the project
pnpm build

# Build in watch mode (for development)
pnpm dev

# Run tests
pnpm test

# Run tests in CI mode (single run)
pnpm test:run

# Type checking
pnpm type - check

# Clean build artifacts
pnpm clean

# Install Playwright browsers
pnpm install - browsers

Interactive Commands

When in search results view:

1 - 10: Select a result by number.
b or back: Return to search results.
open <n>: Open result #n in an external browser.
q or quit: Exit the program.

When viewing content:

b or back: Return to search results.
/<term>: Search for a term within the extracted content.
open: Open the current page in an external browser.
q or quit: Exit the program.

Configuration

Environment Variables

Create a .env file in the project root:

USER_AGENT=Mozilla/5.0 (compatible; LLMResearcher/1.0)
TIMEOUT=30000
MAX_RETRIES=3
RATE_LIMIT_DELAY=1000
CACHE_ENABLED=true
MAX_RESULTS=10

Configuration File

Create ~/.llmresearcherrc in your home directory:

{
  "userAgent": "Mozilla/5.0 (compatible; LLMResearcher/1.0)",
  "timeout": 30000,
  "maxRetries": 3,
  "rateLimitDelay": 1000,
  "cacheEnabled": true,
  "maxResults": 10
}

Configuration Options

Property	Details
`userAgent`	User agent for HTTP requests. Default: `Mozilla/5.0 (compatible; LLMResearcher/1.0)`
`timeout`	Request timeout in milliseconds. Default: `30000`
`maxRetries`	Maximum retry attempts for failed requests. Default: `3`
`rateLimitDelay`	Delay between requests in milliseconds. Default: `1000`
`cacheEnabled`	Enable/disable local caching. Default: `true`
`maxResults`	Maximum search results to display. Default: `10`

Architecture

Core Components

MCPResearchServer (src/mcp - server.ts)
- Implements the Model Context Protocol server.
- Offers three main tools: github_code_search, duckduckgo_web_search, extract_content.
- Provides JSON - based responses for LLM consumption.
DuckDuckGoSearcher (src/search.ts)
- Scrapes DuckDuckGo search results in HTML with locale support.
- Decodes URLs in the /l/?uddg= format.
- Incorporates rate limiting and retry logic.
GitHubCodeSearcher (src/github - code - search.ts)
- Integrates with the GitHub Code Search API via the gh CLI.
- Supports advanced queries with language, repo, and file filters.
- Handles authentication and rate limiting.
ContentExtractor (src/extractor.ts)
- Renders pages using Playwright with resource blocking.
- Extracts main content using @mozilla/readability.
- Sanitizes content with DOMPurify and converts it to Markdown.
CLIInterface (src/cli.ts)
- Provides an interactive command - line interface.
- Enables navigation of search results.
- Allows content viewing and text search.
Configuration (src/config.ts)
- Loads configuration from environment and RC files.
- Supports verbose logging.

Content Processing Pipeline

MCP Server Mode

Search:
- DuckDuckGo: HTML endpoint → Parse results → JSON response with pagination.
- GitHub: Code Search API → Format results → JSON response with code snippets.
Extract: URL from search results → Playwright navigation → Content extraction.
Process: @mozilla/readability → DOMPurify sanitization → Clean JSON output.
Output: Structured JSON for LLM consumption.

CLI Mode

Search: DuckDuckGo HTML endpoint → Parse results → Display numbered list.
Extract: Playwright navigation → Resource blocking → JS rendering.
Process: @mozilla/readability → DOMPurify sanitization → Turndown Markdown.
Output: Clean Markdown with h1 - h3, bold, italic, links only.

Security Features

Resource Blocking: Prevents the loading of images, CSS, and fonts for speed and security.
Content Sanitization: DOMPurify removes scripts, iframes, and dangerous elements.
Limited Markdown: Allows only safe formatting elements (h1 - h3, strong, em, a).
Rate Limiting: Respects DuckDuckGo's rate limits with exponential backoff.

Examples

MCP Server Usage with Claude Code

1. GitHub Code Search

You: "Find React hook examples for state management"

Claude uses github_code_search tool:
{
  "query": "useState useReducer state management language:javascript",
  "results": [
    {
      "title": "facebook/react/packages/react/src/ReactHooks.js",
      "url": "https://raw.githubusercontent.com/facebook/react/main/packages/react/src/ReactHooks.js",
      "snippet": "function useState(initialState) {\n  return dispatcher.useState(initialState);\n}"
    }
  ],
  "pagination": {
    "currentPage": 1,
    "hasNextPage": true,
    "nextPageToken": "2"
  }
}

2. Web Search with Locale

You: "Search for Vue.js tutorials in Japanese"

Claude uses duckduckgo_web_search tool:
{
  "query": "Vue.js チュートリアル 入門",
  "locale": "jp - jp",
  "results": [
    {
      "title": "Vue.js入門ガイド",
      "url": "https://example.com/vue - tutorial",
      "snippet": "Vue.jsの基本的な使い方を学ぶチュートリアル..."
    }
  ]
}

3. Content Extraction

You: "Extract the full content from that Vue.js tutorial"

Claude uses extract_content tool:
{
  "url": "https://example.com/vue - tutorial",
  "title": "Vue.js入門ガイド",
  "extractedAt": "2024 - 01 - 15T10:30:00.000Z",
  "content": "# Vue.js入門ガイド\n\nVue.jsは...\n\n## インストール\n\n..."
}

CLI Examples

Basic Search

$ llmresearcher "python web scraping"

🔍 Search Results:
══════════════════════════════════════════════════

1. Python Web Scraping Tutorial
   URL: https://realpython.com/python - web - scraping - practical - introduction/
   Complete guide to web scraping with Python using requests and Beautiful Soup...

2. Web Scraping with Python - BeautifulSoup and requests
   URL: https://www.dataquest.io/blog/web - scraping - python - tutorial/
   Learn how to scrape websites with Python, Beautiful Soup, and requests...

══════════════════════════════════════════════════
Commands: [1 - 10] select result | b) back | q) quit | open <n>) open in browser

> 1

📥 Extracting content from: Python Web Scraping Tutorial

📄 Content:
══════════════════════════════════════════════════

**Python Web Scraping Tutorial**
Source: https://realpython.com/python - web - scraping - practical - introduction/
Extracted: 2024 - 01 - 15T10:30:00.000Z

──────────────────────────────────────────────────

# Python Web Scraping: A Practical Introduction

Web scraping is the process of collecting and parsing raw data from the web...

## What Is Web Scraping?

Web scraping is a technique to automatically access and extract large amounts...

══════════════════════════════════════════════════
Commands: b) back to results | /<term>) search in text | q) quit | open) open in browser

> /beautiful soup

🔍 Found 3 matches for "beautiful soup":
──────────────────────────────────────────────────
Line 15: Beautiful Soup is a Python library for parsing HTML and XML documents.
Line 42: from bs4 import BeautifulSoup
Line 67: soup = BeautifulSoup(html_content, 'html.parser')

Direct URL Mode

$ llmresearcher - u https://docs.python.org/3/tutorial/

📄 Content:
══════════════════════════════════════════════════

**The Python Tutorial**
Source: https://docs.python.org/3/tutorial/
Extracted: 2024 - 01 - 15T10:35:00.000Z

──────────────────────────────────────────────────

# The Python Tutorial

Python is an easy to learn, powerful programming language...

## An Informal Introduction to Python

In the following examples, input and output are distinguished...

Verbose Mode

$ llmresearcher - v "nodejs tutorial"

[VERBOSE] Searching: https://duckduckgo.com/html/?q=nodejs%20tutorial&kl=us - en
[VERBOSE] Response: 200 in 847ms
[VERBOSE] Parsed 10 results
[VERBOSE] Launching browser...
[VERBOSE] Blocking resource: https://example.com/style.css
[VERBOSE] Blocking resource: https://example.com/image.png
[VERBOSE] Navigating to page...
[VERBOSE] Page loaded in 1243ms
[VERBOSE] Processing content with Readability...
[VERBOSE] Readability extraction successful
[VERBOSE] Closing browser...

Testing

Running Tests

# Run tests in watch mode
pnpm test

# Run tests once (CI mode)
pnpm test:run

# Run tests with coverage
pnpm test -- --coverage

Test Coverage

The test suite includes:

Unit Tests: Individual component testing
- search.test.ts: Tests DuckDuckGo search functionality, URL decoding, and rate limiting.
- extractor.test.ts: Tests content extraction, Markdown conversion, and resource management.
- config.test.ts: Tests configuration validation and environment handling.
Integration Tests: End - to - end workflow testing
- integration.test.ts: Tests complete search - to - extraction workflows, error handling, and cleanup.

Test Features

Fast: Powered by vitest for quick feedback.
Type - safe: Fully supports TypeScript in tests.
Isolated: Each test cleans up its resources.
Comprehensive: Covers search, extraction, configuration, and integration scenarios.

Troubleshooting

Common Issues

"Browser not found" Error

pnpm install - browsers

Rate Limiting Issues

The tool automatically handles rate limiting with 1 - second delays.
If you encounter 429 errors, the tool will automatically retry with exponential backoff.

Content Extraction Failures

Some sites may block automated access.
The tool includes fallback extraction methods (main → body content).
Use verbose mode (-v) to see detailed error information.

Permission Denied (Unix/Linux)

chmod +x bin/llmresearcher.js

Performance Optimization

The tool is optimized for speed:

Resource Blocking: Automatically blocks images, CSS, and fonts.
Network Idle: Waits for JavaScript to complete rendering.
Content Caching: Supports local caching to avoid repeated requests.
Minimal Dependencies: Uses lightweight, focused libraries.

Development

Project Structure

light - research - mcp/
├── dist/                      # Built JavaScript files (generated)
│   ├── bin/
│   │   └── llmresearcher.js   # CLI entry point (executable)
│   └── *.js                   # Compiled TypeScript modules
├── src/                       # TypeScript source files
│   ├── bin.ts                 # CLI entry point
│   ├── index.ts               # Main LLMResearcher class
│   ├── mcp - server.ts        # MCP server implementation
│   ├── search.ts              # DuckDuckGo search implementation
│   ├── github - code - search.ts  # GitHub Code Search implementation
│   ├── extractor.ts           # Content extraction with Playwright
│   ├── cli.ts                 # Interactive CLI interface
│   ├── config.ts              # Configuration management
│   └── types.ts               # TypeScript type definitions
├── test/                      # Test files (vitest)
│   ├── search.test.ts         # Search functionality tests
│   ├── extractor.test.ts      # Content extraction tests
│   ├── config.test.ts         # Configuration tests
│   ├── mcp - locale.test.ts     # MCP locale functionality tests
│   ├── mcp - content - extractor.test.ts # MCP content extractor tests
│   └── integration.test.ts    # End - to - end integration tests
├── tsconfig.json              # TypeScript configuration
├── tsup.config.ts             # Build configuration
├── vitest.config.ts           # Test configuration
├── package.json
└── README.md

Dependencies

Runtime Dependencies

@modelcontextprotocol/sdk: Model Context Protocol server implementation.
@mozilla/readability: Content extraction from HTML.
cheerio: HTML parsing for search results.
commander: CLI argument parsing.
dompurify: HTML sanitization.
dotenv: Environment variable loading.
jsdom: DOM manipulation for server - side processing.
playwright: Browser automation for JS rendering.
turndown: HTML to Markdown conversion.

Development Dependencies

typescript: TypeScript compiler.
tsup: Fast TypeScript bundler.
vitest: Fast unit test framework.
@types/*: TypeScript type definitions.

📄 License

This project is under the MIT License. See the LICENSE file for details.

👥 Contributing

Fork the repository.
Create a feature branch.
Make your changes.
Add tests if applicable.
Submit a pull request.

🗺️ Roadmap

Planned Features

Enhanced MCP Tools: Additional specialized search tools for documentation, APIs, etc.
Caching Layer: SQLite - based URL → Markdown caching with a 24 - hour TTL.
Search Engine Abstraction: Support for Brave Search, Bing, and other engines.
Content Summarization: Optional AI - powered content summarization.
Export Formats: JSON, plain text, and other output formats.
Batch Processing: Process multiple URLs from file input.
SSE Transport: Support for Server - Sent Events MCP transport.