🚀 LLM Researcher
LLM Researcher is a lightweight MCP (Model Context Protocol) server designed for LLM orchestration. It offers efficient web content search and extraction capabilities, allowing LLMs to search DuckDuckGo and extract clean, LLM - friendly content from web pages.
Built with TypeScript, tsup, and vitest to offer a modern development experience.
✨ Features
- MCP Server Support: Offers a Model Context Protocol server for seamless LLM integration.
- Free Operation: Utilizes the DuckDuckGo HTML endpoint, incurring no API costs.
- GitHub Code Search: Enables searching GitHub repositories for code examples and implementation patterns.
- Smart Content Extraction: Employs Playwright and @mozilla/readability to extract clean content.
- LLM - Optimized Output: Provides sanitized Markdown (only h1 - h3, bold, italic, and links).
- Rate Limited: Respects DuckDuckGo's rate limits, with a 1 req/sec limit.
- Cross - Platform: Works on macOS, Linux, and WSL.
- Multiple Modes: Supports CLI, MCP server, search, direct URL, and interactive modes.
- Type Safe: Fully implemented in TypeScript with strict typing.
- Modern Tooling: Built using the tsup bundler and vitest testing framework.
📦 Installation
Prerequisites
- Node.js 20.0.0 or higher.
- No local Chrome installation is required as it uses Playwright's bundled Chromium.
Setup
cd light - research - mcp
pnpm install
pnpm build
pnpm install - browsers
pnpm link --global
💻 Usage
MCP Server Mode
Use it as a Model Context Protocol server to provide search and content extraction tools to LLMs:
llmresearcher --mcp
Setting up with Claude Code
claude mcp add light - research - mcp /path/to/light - research - mcp/dist/bin/llmresearcher.js --mcp
claude mcp add light - research - mcp - s project /path/to/light - research - mcp/dist/bin/llmresearcher.js --mcp
claude mcp list
claude mcp get light - research - mcp
MCP Tool Usage Examples
Once configured, you can use these tools in Claude:
> Search for React hooks examples on GitHub
Tool: github_code_search
Query: "useState useEffect hooks language:javascript"
> Search for TypeScript best practices
Tool: duckduckgo_web_search
Query: "TypeScript best practices 2024"
Locale: us - en (or wt - wt for no region)
> Extract content from a search result
Tool: extract_content
URL: https://example.com/article - from - search - results
Command Line Interface
llmresearcher "machine learning transformers"
llmresearcher - g "useState hooks language:typescript"
llmresearcher - u https://example.com/article
llmresearcher
llmresearcher - v "search query"
llmresearcher --mcp
📚 Documentation
Development
Scripts
pnpm build
pnpm dev
pnpm test
pnpm test:run
pnpm type - check
pnpm clean
pnpm install - browsers
Interactive Commands
When in search results view:
- 1 - 10: Select a result by number.
- b or back: Return to search results.
- open <n>: Open result #n in an external browser.
- q or quit: Exit the program.
When viewing content:
- b or back: Return to search results.
- /<term>: Search for a term within the extracted content.
- open: Open the current page in an external browser.
- q or quit: Exit the program.
Configuration
Environment Variables
Create a .env
file in the project root:
USER_AGENT=Mozilla/5.0 (compatible; LLMResearcher/1.0)
TIMEOUT=30000
MAX_RETRIES=3
RATE_LIMIT_DELAY=1000
CACHE_ENABLED=true
MAX_RESULTS=10
Configuration File
Create ~/.llmresearcherrc
in your home directory:
{
"userAgent": "Mozilla/5.0 (compatible; LLMResearcher/1.0)",
"timeout": 30000,
"maxRetries": 3,
"rateLimitDelay": 1000,
"cacheEnabled": true,
"maxResults": 10
}
Configuration Options
Property |
Details |
userAgent |
User agent for HTTP requests. Default: Mozilla/5.0 (compatible; LLMResearcher/1.0) |
timeout |
Request timeout in milliseconds. Default: 30000 |
maxRetries |
Maximum retry attempts for failed requests. Default: 3 |
rateLimitDelay |
Delay between requests in milliseconds. Default: 1000 |
cacheEnabled |
Enable/disable local caching. Default: true |
maxResults |
Maximum search results to display. Default: 10 |
Architecture
Core Components
- MCPResearchServer (
src/mcp - server.ts
)
- Implements the Model Context Protocol server.
- Offers three main tools: github_code_search, duckduckgo_web_search, extract_content.
- Provides JSON - based responses for LLM consumption.
- DuckDuckGoSearcher (
src/search.ts
)
- Scrapes DuckDuckGo search results in HTML with locale support.
- Decodes URLs in the
/l/?uddg=
format.
- Incorporates rate limiting and retry logic.
- GitHubCodeSearcher (
src/github - code - search.ts
)
- Integrates with the GitHub Code Search API via the gh CLI.
- Supports advanced queries with language, repo, and file filters.
- Handles authentication and rate limiting.
- ContentExtractor (
src/extractor.ts
)
- Renders pages using Playwright with resource blocking.
- Extracts main content using @mozilla/readability.
- Sanitizes content with DOMPurify and converts it to Markdown.
- CLIInterface (
src/cli.ts
)
- Provides an interactive command - line interface.
- Enables navigation of search results.
- Allows content viewing and text search.
- Configuration (
src/config.ts
)
- Loads configuration from environment and RC files.
- Supports verbose logging.
Content Processing Pipeline
MCP Server Mode
- Search:
- DuckDuckGo: HTML endpoint → Parse results → JSON response with pagination.
- GitHub: Code Search API → Format results → JSON response with code snippets.
- Extract: URL from search results → Playwright navigation → Content extraction.
- Process: @mozilla/readability → DOMPurify sanitization → Clean JSON output.
- Output: Structured JSON for LLM consumption.
CLI Mode
- Search: DuckDuckGo HTML endpoint → Parse results → Display numbered list.
- Extract: Playwright navigation → Resource blocking → JS rendering.
- Process: @mozilla/readability → DOMPurify sanitization → Turndown Markdown.
- Output: Clean Markdown with h1 - h3, bold, italic, links only.
Security Features
- Resource Blocking: Prevents the loading of images, CSS, and fonts for speed and security.
- Content Sanitization: DOMPurify removes scripts, iframes, and dangerous elements.
- Limited Markdown: Allows only safe formatting elements (h1 - h3, strong, em, a).
- Rate Limiting: Respects DuckDuckGo's rate limits with exponential backoff.
Examples
MCP Server Usage with Claude Code
1. GitHub Code Search
You: "Find React hook examples for state management"
Claude uses github_code_search tool:
{
"query": "useState useReducer state management language:javascript",
"results": [
{
"title": "facebook/react/packages/react/src/ReactHooks.js",
"url": "https://raw.githubusercontent.com/facebook/react/main/packages/react/src/ReactHooks.js",
"snippet": "function useState(initialState) {\n return dispatcher.useState(initialState);\n}"
}
],
"pagination": {
"currentPage": 1,
"hasNextPage": true,
"nextPageToken": "2"
}
}
2. Web Search with Locale
You: "Search for Vue.js tutorials in Japanese"
Claude uses duckduckgo_web_search tool:
{
"query": "Vue.js チュートリアル 入門",
"locale": "jp - jp",
"results": [
{
"title": "Vue.js入門ガイド",
"url": "https://example.com/vue - tutorial",
"snippet": "Vue.jsの基本的な使い方を学ぶチュートリアル..."
}
]
}
3. Content Extraction
You: "Extract the full content from that Vue.js tutorial"
Claude uses extract_content tool:
{
"url": "https://example.com/vue - tutorial",
"title": "Vue.js入門ガイド",
"extractedAt": "2024 - 01 - 15T10:30:00.000Z",
"content": "# Vue.js入門ガイド\n\nVue.jsは...\n\n## インストール\n\n..."
}
CLI Examples
Basic Search
$ llmresearcher "python web scraping"
🔍 Search Results:
══════════════════════════════════════════════════
1. Python Web Scraping Tutorial
URL: https://realpython.com/python - web - scraping - practical - introduction/
Complete guide to web scraping with Python using requests and Beautiful Soup...
2. Web Scraping with Python - BeautifulSoup and requests
URL: https://www.dataquest.io/blog/web - scraping - python - tutorial/
Learn how to scrape websites with Python, Beautiful Soup, and requests...
══════════════════════════════════════════════════
Commands: [1 - 10] select result | b) back | q) quit | open <n>) open in browser
> 1
📥 Extracting content from: Python Web Scraping Tutorial
📄 Content:
══════════════════════════════════════════════════
**Python Web Scraping Tutorial**
Source: https://realpython.com/python - web - scraping - practical - introduction/
Extracted: 2024 - 01 - 15T10:30:00.000Z
──────────────────────────────────────────────────
Web scraping is the process of collecting and parsing raw data from the web...
Web scraping is a technique to automatically access and extract large amounts...
══════════════════════════════════════════════════
Commands: b) back to results | /<term>) search in text | q) quit | open) open in browser
> /beautiful soup
🔍 Found 3 matches for "beautiful soup":
──────────────────────────────────────────────────
Line 15: Beautiful Soup is a Python library for parsing HTML and XML documents.
Line 42: from bs4 import BeautifulSoup
Line 67: soup = BeautifulSoup(html_content, 'html.parser')
Direct URL Mode
$ llmresearcher - u https://docs.python.org/3/tutorial/
📄 Content:
══════════════════════════════════════════════════
**The Python Tutorial**
Source: https://docs.python.org/3/tutorial/
Extracted: 2024 - 01 - 15T10:35:00.000Z
──────────────────────────────────────────────────
Python is an easy to learn, powerful programming language...
In the following examples, input and output are distinguished...
Verbose Mode
$ llmresearcher - v "nodejs tutorial"
[VERBOSE] Searching: https://duckduckgo.com/html/?q=nodejs%20tutorial&kl=us - en
[VERBOSE] Response: 200 in 847ms
[VERBOSE] Parsed 10 results
[VERBOSE] Launching browser...
[VERBOSE] Blocking resource: https://example.com/style.css
[VERBOSE] Blocking resource: https://example.com/image.png
[VERBOSE] Navigating to page...
[VERBOSE] Page loaded in 1243ms
[VERBOSE] Processing content with Readability...
[VERBOSE] Readability extraction successful
[VERBOSE] Closing browser...
Testing
Running Tests
pnpm test
pnpm test:run
pnpm test -- --coverage
Test Coverage
The test suite includes:
- Unit Tests: Individual component testing
search.test.ts
: Tests DuckDuckGo search functionality, URL decoding, and rate limiting.
extractor.test.ts
: Tests content extraction, Markdown conversion, and resource management.
config.test.ts
: Tests configuration validation and environment handling.
- Integration Tests: End - to - end workflow testing
integration.test.ts
: Tests complete search - to - extraction workflows, error handling, and cleanup.
Test Features
- Fast: Powered by vitest for quick feedback.
- Type - safe: Fully supports TypeScript in tests.
- Isolated: Each test cleans up its resources.
- Comprehensive: Covers search, extraction, configuration, and integration scenarios.
Troubleshooting
Common Issues
"Browser not found" Error
pnpm install - browsers
Rate Limiting Issues
- The tool automatically handles rate limiting with 1 - second delays.
- If you encounter 429 errors, the tool will automatically retry with exponential backoff.
Content Extraction Failures
- Some sites may block automated access.
- The tool includes fallback extraction methods (main → body content).
- Use verbose mode (
-v
) to see detailed error information.
Permission Denied (Unix/Linux)
chmod +x bin/llmresearcher.js
Performance Optimization
The tool is optimized for speed:
- Resource Blocking: Automatically blocks images, CSS, and fonts.
- Network Idle: Waits for JavaScript to complete rendering.
- Content Caching: Supports local caching to avoid repeated requests.
- Minimal Dependencies: Uses lightweight, focused libraries.
Development
Project Structure
light - research - mcp/
├── dist/ # Built JavaScript files (generated)
│ ├── bin/
│ │ └── llmresearcher.js # CLI entry point (executable)
│ └── *.js # Compiled TypeScript modules
├── src/ # TypeScript source files
│ ├── bin.ts # CLI entry point
│ ├── index.ts # Main LLMResearcher class
│ ├── mcp - server.ts # MCP server implementation
│ ├── search.ts # DuckDuckGo search implementation
│ ├── github - code - search.ts # GitHub Code Search implementation
│ ├── extractor.ts # Content extraction with Playwright
│ ├── cli.ts # Interactive CLI interface
│ ├── config.ts # Configuration management
│ └── types.ts # TypeScript type definitions
├── test/ # Test files (vitest)
│ ├── search.test.ts # Search functionality tests
│ ├── extractor.test.ts # Content extraction tests
│ ├── config.test.ts # Configuration tests
│ ├── mcp - locale.test.ts # MCP locale functionality tests
│ ├── mcp - content - extractor.test.ts # MCP content extractor tests
│ └── integration.test.ts # End - to - end integration tests
├── tsconfig.json # TypeScript configuration
├── tsup.config.ts # Build configuration
├── vitest.config.ts # Test configuration
├── package.json
└── README.md
Dependencies
Runtime Dependencies
- @modelcontextprotocol/sdk: Model Context Protocol server implementation.
- @mozilla/readability: Content extraction from HTML.
- cheerio: HTML parsing for search results.
- commander: CLI argument parsing.
- dompurify: HTML sanitization.
- dotenv: Environment variable loading.
- jsdom: DOM manipulation for server - side processing.
- playwright: Browser automation for JS rendering.
- turndown: HTML to Markdown conversion.
Development Dependencies
- typescript: TypeScript compiler.
- tsup: Fast TypeScript bundler.
- vitest: Fast unit test framework.
- @types/*: TypeScript type definitions.
📄 License
This project is under the MIT License. See the LICENSE
file for details.
👥 Contributing
- Fork the repository.
- Create a feature branch.
- Make your changes.
- Add tests if applicable.
- Submit a pull request.
🗺️ Roadmap
Planned Features
- Enhanced MCP Tools: Additional specialized search tools for documentation, APIs, etc.
- Caching Layer: SQLite - based URL → Markdown caching with a 24 - hour TTL.
- Search Engine Abstraction: Support for Brave Search, Bing, and other engines.
- Content Summarization: Optional AI - powered content summarization.
- Export Formats: JSON, plain text, and other output formats.
- Batch Processing: Process multiple URLs from file input.
- SSE Transport: Support for Server - Sent Events MCP transport.
Performance Improvements
- Parallel Processing: Concurrent content extraction for multiple results.
- Smart Caching: Intelligent cache invalidation based on content freshness.
- Memory Optimization: Streaming content processing for large documents.