SuperFetch MCP Server: Smart Web Content Extractor with Safe Output

Super Fetch MCP Server

The SuperFetch MCP server is a web content extraction tool based on the Model Context Protocol. It can intelligently crawl web pages, extract readable content using Mozilla Readability, and return it in an AI-friendly JSONL or Markdown format. It supports structured content blocks, built-in caching, and security protection.

Research and data Developer tools #Web crawling #Content extraction #AI tool #MCP service .TypeScript

rating : 2 points

downloads : 5.7K

update time : 2025-12-29

Open Site

What is SuperFetch MCP Server?

SuperFetch is a Model Context Protocol (MCP) server dedicated to intelligent web page content crawling and extraction. It is not just a simple web crawler. Instead, it uses Mozilla Readability technology to automatically identify and extract the main content of web pages, removing irrelevant elements such as advertisements, navigation bars, and footers, and then converts the content into a format that is easy for AI assistants to process. In simple terms, it allows AI to read web pages like humans, focusing only on truly valuable content rather than the cluttered information of the entire page.

How to use SuperFetch?

Using SuperFetch is very simple and does not require installing any software. You just need to add the configuration in the AI assistant tool you are using (such as Claude Desktop, VS Code, Cursor, etc.). It supports one-click installation and can be used within a few seconds. There are two main ways to use it: 1. **JSONL format**: Obtain structured content blocks, suitable for AI analysis and processing. 2. **Markdown format**: Obtain a clean document format, suitable for human reading and document generation.

Applicable scenarios

SuperFetch is particularly suitable for the following scenarios: • **AI assistants need to read web page content**: Enable AIs such as Claude and ChatGPT to access and understand web page information. • **Document generation and summarization**: Automatically extract the main content of articles to generate summaries or documents. • **RAG (Retrieval Augmented Generation) system**: Provide a high-quality web content source for the knowledge base. • **Content analysis and research**: Extract structured information from multiple web pages in batches for analysis. • **Technical document processing**: Extract technical content such as code examples and API documents.

Main features

Intelligent content extraction

Use Mozilla Readability technology to automatically identify and extract the core content of web pages, intelligently removing irrelevant elements such as advertisements, navigation bars, and footers, and only retaining truly valuable information.

Multiple output formats

Supports two output formats: JSONL (structured data) and Markdown (document format), meeting the needs of different scenarios. JSONL is suitable for AI processing, and Markdown is suitable for human reading.

Structured content blocks

Decompose web page content into semantic blocks: headings, paragraphs, lists, code blocks, tables, images, quotes, etc., allowing AI to better understand and process the content structure.

Built-in cache system

Automatically cache the extracted content, reducing duplicate requests and improving response speed. The cache time and maximum number of cache entries can be configured to optimize performance.

Robust crawling mechanism

Supports retry mechanism, timeout control, and redirection handling to ensure successful content acquisition even in unstable network conditions.

Security-first design

Built-in security mechanisms such as SSRF protection, URL verification, and header cleaning to prevent malicious requests and attacks and ensure safe use.

Easy integration

Supports one-click installation and is compatible with all mainstream MCP clients (Claude Desktop, VS Code, Cursor, etc.) without complex configuration.

Advantages

Intelligent content extraction: Automatically identify and extract core content, removing irrelevant elements

AI-friendly format: Structured output optimized specifically for AI processing

Easy to use: One-click installation, no technical background required

Highly configurable: Supports multiple configuration options to adapt to different needs

Safe and reliable: Built-in multiple security protection mechanisms

Excellent performance: Cache and retry mechanisms ensure stable and efficient operation

Limitations

Dynamic content limitation: Limited support for content dynamically loaded by JavaScript

Anti-crawler websites: May be blocked by the anti-crawler mechanisms of some websites

Large file processing: Extra processing may be required for extremely large web page content

Format conversion: Format conversion of some complex web pages may not be perfect

How to use

Select your AI tool

Select the corresponding configuration method according to the AI assistant tool you are using. It supports mainstream tools such as Claude Desktop, VS Code, Cursor, Codex IDE, and Windsurf.

Add configuration

Add the SuperFetch configuration to the tool's configuration file. Most tools support one-click installation, and you just need to copy the configuration code.

Restart the tool

After saving the configuration file, restart your AI tool for the configuration to take effect.

Start using

Now you can use the SuperFetch tool in your AI assistant. You can request to extract web page content and choose the JSONL or Markdown format.

Usage examples

Technical blog content analysis

You are researching a certain technical topic and need to analyze multiple relevant blog articles. Using SuperFetch, you can quickly extract the core content of these articles, remove advertisements and navigation, and only retain the technical discussion part.

Product documentation conversion

You need to convert the online documentation of a certain product into a local Markdown file for offline reading or localization processing.

News summary generation

You need to quickly understand the headline news content of multiple news websites and generate a daily summary.

Academic paper collection

Researchers need to collect abstracts and key findings of academic papers in related fields.

Frequently Asked Questions

What is the difference between SuperFetch and ordinary web crawling?

Do I need programming knowledge to use it?

Which websites are supported?

Which format should I choose, JSONL or Markdown?

Is there a size limit for the extracted content?

How to ensure safe use?

Does it support Chinese websites?

What should I do if I encounter a website that cannot be accessed?

Related resources

Official GitHub repository

View source code, submit issues, and participate in contributions

MCP Protocol official website

Learn detailed information and specifications of the Model Context Protocol

NPM package page

View package versions, download statistics, and update logs

MCP Registry

Discover more MCP servers and tools

Problem feedback

Report bugs or propose feature suggestions

🚀 superFetch MCP Server

superFetch MCP Server is a Model Context Protocol (MCP) server. It can fetch web pages, extract readable content using Mozilla Readability, and return AI - friendly JSONL or Markdown, facilitating AI - related web content processing.

Published to MCP Registry - Search for io.github.j0hanz/superfetch

🚀 Quick Start

You can add superFetch to your MCP client configuration without installation.

Claude Desktop

Add the following to your claude_desktop_config.json:

{
  "mcpServers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"]
    }
  }
}

VS Code

Add the following to .vscode/mcp.json in your workspace:

{
  "servers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"]
    }
  }
}

With Custom Configuration

Configure SuperFetch behavior by adding environment variables to the env property:

{
  "servers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"],
      "env": {
        "CACHE_TTL": "7200",
        "LOG_LEVEL": "debug",
        "FETCH_TIMEOUT": "60000"
      }
    }
  }
}

See Configuration for all available options.

Cursor

Open Cursor Settings.
Go to Features > MCP Servers.
Click "+ Add new global MCP server".
Add the following configuration:

{
  "mcpServers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"]
    }
  }
}

Tip (Windows): If you encounter issues, try: cmd /c "npx -y @j0hanz/superfetch@latest --stdio"

Codex IDE

Add the following to your ~/.codex/config.toml file:

Basic Configuration:

[mcp_servers.superfetch]
command = "npx"
args = ["-y", "@j0hanz/superfetch@latest", "--stdio"]

With Environment Variables:

[mcp_servers.superfetch]
command = "npx"
args = ["-y", "@j0hanz/superfetch@latest", "--stdio"]
env = { CACHE_TTL = "7200", LOG_LEVEL = "debug", FETCH_TIMEOUT = "60000" }

Access config file: Click the gear icon -> "Codex Settings > Open config.toml"

Documentation: Codex MCP Guide

Cline (VS Code Extension)

Open the Cline MCP settings file:

macOS:

code ~/Library/Application\ Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json

Windows:

code %APPDATA%\Code\User\globalStorage\saoudrizwan.claude-dev\settings\cline_mcp_settings.json

Add the following configuration:

{
  "mcpServers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"],
      "disabled": false,
      "autoApprove": []
    }
  }
}

Windsurf

Add the following to ./codeium/windsurf/model_config.json:

{
  "mcpServers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"]
    }
  }
}

Claude Desktop (Config File Locations)

macOS:

# Open config file
open -e "$HOME/Library/Application Support/Claude/claude_desktop_config.json"

# Or with VS Code
code "$HOME/Library/Application Support/Claude/claude_desktop_config.json"

Windows:

code %APPDATA%\Claude\claude_desktop_config.json

✨ Features

Feature	Description
Smart extraction	Mozilla Readability removes ads, navigation, and boilerplate when enabled
JSONL + Markdown	JSONL semantic blocks or clean Markdown with frontmatter
Structured blocks	Headings, paragraphs, lists, code, tables, images, blockquotes
Built-in caching	In-memory cache with TTL, max keys, and resource subscriptions
Resilient fetching	Redirect handling plus retry with exponential backoff + jitter
Security first	URL validation, SSRF/DNS/IP blocklists, header sanitization
HTTP mode	API key auth, session management, rate limiting, CORS

📦 Installation

One-Click Install

Alternative Installation

Global Installation

npm install -g @j0hanz/superfetch

# Run in stdio mode
superfetch --stdio

# Run HTTP server (requires API_KEY)
superfetch

From Source

git clone https://github.com/j0hanz/super-fetch-mcp-server.git
cd super-fetch-mcp-server
npm install
npm run build

Running the Server

stdio Mode (direct MCP integration)

node dist/index.js --stdio

HTTP Mode (default)

HTTP mode requires API_KEY and only binds to loopback addresses unless ALLOW_REMOTE=true.

API_KEY=supersecret npx -y @j0hanz/superfetch@latest
# Server runs at http://127.0.0.1:3000

Windows (PowerShell):

$env:API_KEY = "supersecret"
npx -y @j0hanz/superfetch@latest

Endpoints (all require Authorization: Bearer <API_KEY> or X-API-Key: <API_KEY>):

GET /health
POST /mcp
GET /mcp (SSE stream)
DELETE /mcp
GET /mcp/downloads/:namespace/:hash

Sessions are managed via the mcp-session-id header (see HTTP Mode Details).

💻 How to Choose a Tool

Use this guide to select the right tool for your web content extraction needs.

Decision Tree

Need web content for AI?
- Want structured JSONL blocks -> fetch-url (format: jsonl)
- Want clean Markdown -> fetch-markdown
- Want Markdown but also need contentBlocks count -> fetch-url (format: markdown)

Quick Reference Table

Tool	Best For	Output Format	Use When
`fetch-url`	Single page with structured blocks	JSONL (or Markdown via `format`)	RAG pipelines, content parsing, analytics
`fetch-markdown`	Single page in readable format	Markdown + frontmatter	Documentation, summaries, human review

Common Use Cases

Task	Recommended Tool	Why
Parse a blog post for AI	`fetch-url`	Returns semantic blocks (headings, paragraphs, code)
Generate documentation	`fetch-markdown`	Clean markdown with frontmatter
Extract article for RAG	`fetch-url` + `extractMainContent: true`	Removes ads/nav, keeps main content

💻 Available Tools

Tool Response Notes

Both tools return:

structuredContent for machine-readable fields
content blocks that include:
- a text block containing JSON of structuredContent
- a resource block with a file:///... URI containing the full content (stdio-friendly)
- a resource_link block when content exceeds MAX_INLINE_CONTENT_CHARS and cache is enabled

If content is too large and cache is disabled, the server truncates output and appends ...[truncated].

`fetch-url`

Fetches a webpage and converts it to AI-readable JSONL format with semantic content blocks. You can also request Markdown with format: "markdown".

Parameter	Type	Default	Description
`url`	string	required	URL to fetch
`format`	"jsonl" \| "markdown"	`"jsonl"`	Output format
`extractMainContent`	boolean	`true`	Use Readability to extract main content
`includeMetadata`	boolean	`true`	Include page metadata
`maxContentLength`	number	-	Maximum content length in characters
`customHeaders`	object	-	Custom HTTP headers (sanitized)
`timeout`	number	`30000`	Request timeout in milliseconds (1000-120000)
`retries`	number	`3`	Number of retry attempts (1-10)

Example structuredContent:

{
  "url": "https://example.com/article",
  "title": "Example Article",
  "contentBlocks": 42,
  "fetchedAt": "2025-12-11T10:30:00.000Z",
  "format": "jsonl",
  "contentSize": 12345,
  "cached": false,
  "content": "{\"type\":\"metadata\",\"title\":\"Example Article\",\"url\":\"https://example.com/article\"}\n{\"type\":\"heading\",\"level\":1,\"text\":\"Introduction\"}"
}

`fetch-markdown`

Fetches a webpage and converts it to clean Markdown with optional frontmatter.

Parameter	Type	Default	Description
`url`	string	required	URL to fetch
`extractMainContent`	boolean	`true`	Extract main content only
`includeMetadata`	boolean	`true`	Include YAML frontmatter
`maxContentLength`	number	-	Maximum content length in characters
`customHeaders`	object	-	Custom HTTP headers (sanitized)
`timeout`	number	`30000`	Request timeout in milliseconds (1000-120000)
`retries`	number	`3`	Number of retry attempts (1-10)

Example structuredContent:

{
  "url": "https://example.com/docs",
  "title": "Documentation",
  "fetchedAt": "2025-12-11T10:30:00.000Z",
  "markdown": "---\ntitle: Documentation\nsource: \"https://example.com/docs\"\n---\n\n# Getting Started\n\nWelcome...",
  "contentSize": 9876,
  "cached": false,
  "truncated": false,
  "file": {
    "downloadUrl": "/mcp/downloads/markdown/abc123def456",
    "fileName": "documentation.md",
    "expiresAt": "2025-12-11T11:30:00.000Z"
  }
}

file is included only in HTTP mode when content is cached and too large to inline.

Large Content Handling

Inline limit: MAX_INLINE_CONTENT_CHARS (default 20000).
If content exceeds the limit and cache is enabled, responses include resourceUri and a resource_link block.
If cache is disabled, content is truncated with ...[truncated].
Use maxContentLength per request to enforce a lower limit.

📚 Resources

URI	Description
`superfetch://health`	Real-time server health and memory checks
`superfetch://stats`	Server stats and cache metrics
`superfetch://cache/list`	List cached entries and their resource URIs
`superfetch://cache/{namespace}/{urlHash}`	Cached content entry (`namespace`: `url`, `markdown`)

Resource subscriptions notify clients when cache entries update.

📚 Download Endpoint (HTTP Mode)

When running in HTTP mode, cached content can be downloaded directly.

Endpoint

GET /mcp/downloads/:namespace/:hash

namespace: markdown or url
Auth required (Authorization: Bearer <API_KEY> or X-API-Key: <API_KEY>)

Response Headers

Header	Value
`Content-Type`	`text/markdown; charset=utf-8` or `application/x-ndjson; charset=utf-8`
`Content-Disposition`	`attachment; filename="<name>"`
`Cache-Control`	`private, max-age=<CACHE_TTL>`

Example Usage

curl -H "Authorization: Bearer $API_KEY" \
  http://localhost:3000/mcp/downloads/markdown/abc123.def456 \
  -o article.md

Error Responses

Status	Code	Description
400	`BAD_REQUEST`	Invalid namespace or hash format
404	`NOT_FOUND`	Content not found or expired
503	`SERVICE_UNAVAILABLE`	Download service disabled

📚 Configuration

Configure SuperFetch behavior by adding environment variables to your MCP client configuration's env property.

Fetcher Settings

Variable	Default	Valid Values	Description
`FETCH_TIMEOUT`	`30000`	`5000`-`120000`	Request timeout in milliseconds
`USER_AGENT`	`superFetch-MCP/1.0`	Any valid user agent	Custom user agent

Cache Settings

Variable	Default	Valid Values	Description
`CACHE_ENABLED`	`true`	`true` / `false`	Enable response caching
`CACHE_TTL`	`3600`	`60`-`86400`	Cache lifetime in seconds
`CACHE_MAX_KEYS`	`100`	`10`-`1000`	Maximum cached entries

Output Settings

Variable	Default	Valid Values	Description
`MAX_INLINE_CONTENT_CHARS`	`20000`	`1000`-`200000`	Inline content limit before resource_link

Logging Settings

Variable	Default	Valid Values	Description
`LOG_LEVEL`	`info`	`debug` / `info` / `warn` / `error`	Logging verbosity
`ENABLE_LOGGING`	`true`	`true` / `false`	Enable/disable logging

Extraction Settings

Variable	Default	Valid Values	Description
`EXTRACT_MAIN_CONTENT`	`true`	`true` / `false`	Use Readability to extract main content
`INCLUDE_METADATA`	`true`	`true` / `false`	Include metadata/frontmatter

HTTP Server Settings

Variable	Default	Description
`API_KEY`	-	Required for HTTP mode
`HOST`	`127.0.0.1`	HTTP server host
`PORT`	`3000`	HTTP server port
`ALLOW_REMOTE`	`false`	Allow binding to non-loopback interfaces
`TRUST_PROXY`	`false`	Trust proxy headers for client IP resolution
`SESSION_TTL_MS`	`1800000`	Session TTL in milliseconds (30 min)
`SESSION_INIT_TIMEOUT_MS`	`10000`	Time allowed for session initialization
`MAX_SESSIONS`	`200`	Maximum active sessions

CORS Settings

Variable	Default	Description
`ALLOWED_ORIGINS`	`[]`	Comma-separated list of allowed origins
`CORS_ALLOW_ALL`	`false`	Allow all origins (dev only)

Rate Limiting

Variable	Default	Valid Values	Description
`RATE_LIMIT_ENABLED`	`true`	`true` / `false`	Enable/disable HTTP rate limiting
`RATE_LIMIT_MAX`	`100`	`1`-`10000`	Max requests per window per IP
`RATE_LIMIT_WINDOW_MS`	`60000`	`1000`-`3600000`	Rate limit window in milliseconds
`RATE_LIMIT_CLEANUP_MS`	`60000`	`10000`-`3600000`	Cleanup interval for limiter entries

Configuration Presets

Default (Recommended) - No configuration needed

{
  "servers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"]
    }
  }
}

Debug Mode - Verbose logging and no cache

{
  "servers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"],
      "env": {
        "LOG_LEVEL": "debug",
        "CACHE_ENABLED": "false"
      }
    }
  }
}

Performance Mode - Aggressive caching for speed

{
  "servers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"],
      "env": {
        "CACHE_TTL": "7200",
        "CACHE_MAX_KEYS": "500",
        "LOG_LEVEL": "warn"
      }
    }
  }
}

Custom User Agent - For sites that block bots

{
  "servers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"],
      "env": {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
      }
    }
  }
}

Slow Networks / CI - Extended timeouts

{
  "servers": {
    "superFetch": {
      "command": "npx",
      "args": ["-y", "@j0hanz/superfetch@latest", "--stdio"],
      "env": {
        "FETCH_TIMEOUT": "60000",
        "CACHE_ENABLED": "false",
        "LOG_LEVEL": "warn"
      }
    }
  }
}

📚 HTTP Mode Details

HTTP mode uses the MCP Streamable HTTP transport. The workflow is:

POST /mcp with an initialize request and no mcp-session-id header.
The server returns mcp-session-id in the response headers.
Use that header for subsequent POST /mcp, GET /mcp, and DELETE /mcp requests.

If MAX_SESSIONS is reached, the server evicts the oldest session when possible, otherwise returns a 503.

📚 Content Block Types

JSONL output includes semantic content blocks:

Type	Description
`metadata`	Minimal page metadata (type, title, url)
`heading`	Headings (h1-h6) with level indicator
`paragraph`	Text paragraphs
`list`	Ordered/unordered lists
`code`	Code blocks with optional language
`table`	Tables with headers and rows
`image`	Images with src and alt text
`blockquote`	Block quote text

🔧 Security

SSRF Protection

Blocked destinations include:

Localhost and loopback addresses
Private IP ranges (10.x.x.x, 172.16-31.x.x, 192.168.x.x)
Cloud metadata endpoints (AWS, GCP, Azure)
IPv6 link-local and unique local addresses
Internal suffixes such as .local and .internal

URL Validation

Only http and https URLs
No embedded credentials in URLs
Max URL length: 2048 characters

Header Sanitization

Blocked headers: host, authorization, cookie, x-forwarded-for, x-real-ip, proxy-authorization

Rate Limiting

Default: 100 requests/minute per IP (HTTP mode only). Configure with RATE_LIMIT_MAX and RATE_LIMIT_WINDOW_MS.

🔧 Development

Scripts

Command	Description
`npm run dev`	Development server with hot reload
`npm run build`	Compile TypeScript
`npm start`	Production server
`npm run lint`	Run ESLint
`npm run type-check`	TypeScript type checking
`npm run format`	Format with Prettier
`npm test`	Run Vitest tests
`npm run test:coverage`	Run tests with coverage
`npm run bench`	Run minimal performance benchmark
`npm run release`	Create new release
`npm run knip`	Find unused exports/dependencies
`npm run knip:fix`	Auto-fix unused code

Tech Stack

Category	Technology
Runtime	Node.js >=20.12
Language	TypeScript 5.9
MCP SDK	@modelcontextprotocol/sdk ^1.25.1
Content Extraction	@mozilla/readability ^0.6.0
HTML Parsing	Cheerio ^1.1.2, LinkeDOM ^0.18.12
Markdown	Turndown ^7.2.2
HTTP	Express ^5.2.1, undici ^6.22.0
Validation	Zod ^3.24.1

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Ensure linting passes: npm run lint
Run tests: npm test
Commit changes: git commit -m 'Add amazing feature'
Push: git push origin feature/amazing-feature
Open a Pull Request

For examples of other MCP servers, see: github.com/modelcontextprotocol/servers

⚠️ Important Note

This server can access URLs on behalf of AI assistants. Built-in SSRF protection blocks private IP ranges and cloud metadata endpoints, but exercise caution when deploying in sensitive environments.