Scrapling MCP服務器 - 自適應網頁抓取庫，支持多方式抓取與AI集成的開發利器

探索

Scrapling

Scrapling是一個自適應網頁抓取庫，能自動學習網站變化並重新定位元素，支持多種抓取方式和AI集成，提供高性能解析和開發者友好體驗。

研究與數據開發者工具 #網頁抓取 #自適應學習 #AI集成 #高性能解析 .Python

評分 : 5分

下載量 : 12.4K

更新時間 : 2025-09-18

打開站點

什麼是Scrapling MCP Server?

Scrapling MCP Server是一個專門為AI助手設計的模型上下文協議服務器，它利用Scrapling的強大網頁抓取能力，讓AI能夠智能地提取網頁內容。服務器會在將內容傳遞給AI之前先進行精確的內容定位，從而顯著減少token使用量並提高處理效率。

如何使用Scrapling MCP Server?

通過配置AI工具（如Claude Desktop、Cursor等）連接到MCP服務器，AI助手就能直接調用Scrapling的網頁抓取功能。您只需要提供URL和需要提取的內容描述，服務器會自動處理複雜的網頁抓取和數據提取任務。

適用場景

適用於需要從網頁提取結構化數據的各種場景，包括市場研究、價格監控、內容聚合、競爭分析、學術研究等，特別適合需要AI輔助進行數據分析和處理的場景。

主要功能

智能內容提取

基於自然語言描述精準定位和提取網頁內容，無需編寫複雜的選擇器

Token使用優化

在將內容傳遞給AI之前進行預處理，只提取相關部分，大幅減少token消耗

自適應抓取

自動適應網站結構變化，確保長期穩定的數據提取能力

多格式輸出

支持JSON、文本、Markdown等多種輸出格式，滿足不同AI模型的需求

隱身模式

內置反檢測技術，能夠繞過Cloudflare等反爬蟲系統

優勢

大幅降低AI處理的token成本，提高經濟效益

簡化網頁數據提取流程，無需技術背景即可使用

強大的自適應能力，減少網站更新導致的提取失敗

支持複雜的JavaScript渲染頁面和反爬蟲保護

與主流AI工具無縫集成，開箱即用

侷限性

需要安裝和配置MCP服務器環境

對於極其複雜的動態內容可能仍需人工干預

需要網絡連接才能進行網頁抓取

某些網站可能有嚴格的訪問限制

如何使用

安裝MCP服務器

確保已安裝Scrapling的AI擴展功能

配置AI工具

在Claude Desktop、Cursor或其他支持MCP的AI工具中配置服務器連接

開始使用

通過自然語言指令讓AI助手使用Scrapling進行網頁抓取

使用案例

電商價格監控

監控競爭對手網站的產品價格變化

新聞內容聚合

從多個新聞網站收集最新新聞頭條

學術研究數據收集

從學術網站收集研究論文信息

常見問題

MCP服務器是什麼？

需要編程知識才能使用嗎？

支持哪些AI工具？

抓取速度如何？

如何處理需要登錄的網站？

🚀 Scrapling

Scrapling 是首個自適應網頁抓取庫，它能從網站的變化中學習並隨之進化。當其他庫因網站結構更新而失效時，Scrapling 能自動重新定位元素，確保抓取程序持續運行，讓你告別與反爬蟲系統的鬥爭，無需在網站更新後重寫選擇器。

🚀 快速開始

基礎用法

from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher
from scrapling.fetchers import FetcherSession, StealthySession, DynamicSession

# HTTP requests with session support
with FetcherSession(impersonate='chrome') as session:  # Use latest version of Chrome's TLS fingerprint
    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
    quotes = page.css('.quote .text::text')

# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text')

# Advanced stealth mode (Keep the browser open until you finish)
with StealthySession(headless=True, solve_cloudflare=True) as session:
    page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
    data = page.css('#padded_content a')

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a')
    
# Full browser automation (Keep the browser open until you finish)
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
    page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
    data = page.xpath('//span[@class="text"]/text()')  # XPath selector if you prefer it

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text')

高級解析與導航

from scrapling.fetchers import Fetcher

# Rich element selection and navigation
page = Fetcher.get('https://quotes.toscrape.com/')

# Get quotes with multiple selection methods
quotes = page.css('.quote')  # CSS selector
quotes = page.xpath('//div[@class="quote"]')  # XPath
quotes = page.find_all('div', {'class': 'quote'})  # BeautifulSoup-style
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote')  # and so on...
# Find element by text content
quotes = page.find_by_text('quote', tag='div')

# Advanced navigation
first_quote = page.css_first('.quote')
quote_text = first_quote.css('.text::text')
quote_text = page.css('.quote').css_first('.text::text')  # Chained selectors
quote_text = page.css_first('.quote .text').text  # Using `css_first` is faster than `css` if you want the first element
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent

# Element relationships and similarity
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()

你可以直接使用解析器，而無需像下面這樣抓取網站：

from scrapling.parser import Selector

page = Selector("<html>...</html>")

它的工作方式完全相同！

異步會話管理示例

import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession

async with FetcherSession(http3=True) as session:  # `FetcherSession` is context-aware and can work in both sync/async patterns
    page1 = session.get('https://quotes.toscrape.com/')
    page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')

# Async session usage
async with AsyncStealthySession(max_pages=2) as session:
    tasks = []
    urls = ['https://example.com/page1', 'https://example.com/page2']
    
    for url in urls:
        task = session.fetch(url)
        tasks.append(task)
    
    print(session.get_pool_stats())  # Optional - The status of the browser tabs pool (busy/free/error)
    results = await asyncio.gather(*tasks)
    print(session.get_pool_stats())

✨ 主要特性

支持會話的高級網站抓取

HTTP 請求：使用 Fetcher 類進行快速且隱蔽的 HTTP 請求。可以模擬瀏覽器的 TLS 指紋、頭部信息，並使用 HTTP3。
動態加載：通過支持 Playwright 的 Chromium、真實 Chrome 和自定義隱身模式的 DynamicFetcher 類，實現全瀏覽器自動化來抓取動態網站。
繞過反爬蟲機制：StealthyFetcher 具有高級隱身功能，使用修改版的 Firefox 和指紋欺騙技術。可以輕鬆通過自動化繞過所有級別的 Cloudflare 的 Turnstile。
會話管理：FetcherSession、StealthySession 和 DynamicSession 類支持持久會話，用於跨請求的 cookie 和狀態管理。
異步支持：所有抓取器都提供完整的異步支持，並配有專門的異步會話類。

自適應抓取與 AI 集成

🔄 智能元素跟蹤：使用智能相似性算法在網站更改後重新定位元素。
🎯 智能靈活選擇：支持 CSS 選擇器、XPath 選擇器、基於過濾器的搜索、文本搜索、正則表達式搜索等。
🔍 查找相似元素：自動定位與已找到元素相似的元素。
🤖 可與 AI 配合使用的 MCP 服務器：內置 MCP 服務器，用於 AI 輔助的網頁抓取和數據提取。MCP 服務器具有自定義的強大功能，利用 Scrapling 在將目標內容傳遞給 AI（Claude/Cursor 等）之前進行提取，從而通過減少令牌使用來加快操作速度並降低成本。(演示視頻)

高性能且經過實戰檢驗的架構

🚀 閃電般快速：經過優化的性能，超越了大多數 Python 抓取庫。
🔋 內存高效：優化的數據結構和惰性加載，佔用的內存極少。
⚡ 快速 JSON 序列化：比標準庫快 10 倍。
🏗️ 經過實戰檢驗：Scrapling 不僅擁有 92% 的測試覆蓋率和完整的類型提示覆蓋率，而且在過去一年中，每天都有數百名網頁抓取人員在使用它。

對開發者/網頁抓取人員友好的體驗

🎯 交互式網頁抓取 shell：可選的內置 IPython shell，集成了 Scrapling，提供快捷方式和新工具，可加快網頁抓取腳本的開發速度，例如將 curl 請求轉換為 Scrapling 請求，並在瀏覽器中查看請求結果。
🚀 直接從終端使用：你可以選擇直接使用 Scrapling 來抓取 URL，而無需編寫任何代碼！
🛠️ 豐富的導航 API：通過父級、兄弟級和子級導航方法實現高級 DOM 遍歷。
🧬 增強的文本處理：內置正則表達式、清理方法和優化的字符串操作。
📝 自動選擇器生成：為任何元素生成強大的 CSS/XPath 選擇器。
🔌 熟悉的 API：類似於 Scrapy/BeautifulSoup，使用與 Scrapy/Parsel 相同的偽元素。
📘 完整的類型覆蓋：完整的類型提示，為 IDE 提供出色的支持和代碼補全功能。

全新的會話架構

Scrapling 0.3 引入了全新的會話系統：

持久會話：在多個請求之間保持 cookie、頭部信息和身份驗證。
自動會話管理：智能處理會話生命週期，並進行適當的清理。
會話繼承：所有抓取器都支持一次性請求和持久會話使用。
併發會話支持：同時運行多個隔離的會話。

📦 安裝指南

Scrapling 需要 Python 3.10 或更高版本：

pip install scrapling

從 v0.3.2 開始，此安裝僅包括解析引擎及其依賴項，不包括任何抓取器或命令行依賴項。

可選依賴項

如果你打算使用以下任何額外功能、抓取器或它們的類，則需要安裝抓取器的依賴項，然後使用以下命令安裝瀏覽器依賴項：
```
pip install "scrapling[fetchers]"

scrapling install
```
這將下載所有瀏覽器及其系統依賴項和指紋操作依賴項。
額外功能：
- 安裝 MCP 服務器功能：
```
pip install "scrapling[ai]"
```
- 安裝 shell 功能（網頁抓取 shell 和 extract 命令）：
```
pip install "scrapling[shell]"
```
- 安裝所有功能：
```
pip install "scrapling[all]"
```
不要忘記在安裝這些額外功能後（如果你還沒有安裝），使用 scrapling install 安裝瀏覽器依賴項。

💻 使用示例

基礎用法

# 保持原始代碼和註釋不變
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
# 啟用自適應模式
StealthyFetcher.adaptive = True
# 在隱蔽模式下獲取網站源代碼！
page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
print(page.status)
200
# 抓取能夠在網站設計變更後仍能正常工作的數據！
products = page.css('.product', auto_save=True)
# 稍後，如果網站結構發生變化，傳遞 `adaptive=True` 參數
products = page.css('.product', adaptive=True)
# Scrapling 仍然能夠找到它們！

高級用法

# 高級場景說明：使用不同的抓取器和會話類進行復雜的網頁抓取操作，支持異步、隱身、動態加載等多種模式，同時可以處理會話管理和元素定位。
from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher
from scrapling.fetchers import FetcherSession, StealthySession, DynamicSession

# HTTP requests with session support
with FetcherSession(impersonate='chrome') as session:  # Use latest version of Chrome's TLS fingerprint
    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
    quotes = page.css('.quote .text::text')

# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text')

# Advanced stealth mode (Keep the browser open until you finish)
with StealthySession(headless=True, solve_cloudflare=True) as session:
    page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
    data = page.css('#padded_content a')

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a')
    
# Full browser automation (Keep the browser open until you finish)
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
    page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
    data = page.xpath('//span[@class="text"]/text()')  # XPath selector if you prefer it

# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text')

📚 詳細文檔

Scrapling v0.3 包含一個強大的命令行界面：

# 啟動交互式網頁抓取 shell
scrapling shell

# 直接將頁面提取到文件中，無需編程（默認提取 `body` 標籤內的內容）
# 如果輸出文件以 `.txt` 結尾，則將提取目標的文本內容。
# 如果以 `.md` 結尾，它將是 HTML 內容的 markdown 表示形式，而 `.html` 則是直接的 HTML 內容。
scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome'  # 所有匹配 CSS 選擇器 '#fromSkipToProducts' 的元素
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare

⚠️ 重要提示

還有許多其他功能，但我們希望保持此頁面簡潔，例如 MCP 服務器和交互式網頁抓取 shell。請查看完整文檔此處

🔧 技術細節

文本提取速度測試（5000 個嵌套元素）

#	庫	時間 (ms)	與 Scrapling 相比
1	Scrapling	1.92	1.0x
2	Parsel/Scrapy	1.99	1.036x
3	Raw Lxml	2.33	1.214x
4	PyQuery	20.61	~11x
5	Selectolax	80.65	~42x
6	BS4 with Lxml	1283.21	~698x
7	MechanicalSoup	1304.57	~679x
8	BS4 with html5lib	3331.96	~1735x