Content Core MCP Server: AI Development Tool for Multi-source Content Extraction and Integration

Content Core

Content Core is an AI-driven, multi-functional content extraction and processing platform that supports content extraction from various sources such as web pages, documents, audio and video, and provides functions such as intelligent cleaning and summary generation. It supports multiple integration methods such as the command line, Python libraries, and MCP services.

Artificial intelligence chatbots Research and data #Content extraction #AI processing #Multi-format support #Intelligent summarization .Python

rating : 2.5 points

downloads : 8.8K

update time : 2025-07-31

Open Site

What is the Content Core MCP Server?

The Content Core MCP Server is a content processing service based on the Model Context Protocol. It can intelligently extract content from various sources such as web pages, documents, videos, and audios, and perform structured processing and AI summary generation.

How to use the MCP Server?

You can call the powerful content extraction and processing capabilities through simple HTTP requests or by directly integrating it into applications such as Claude Desktop. It supports both zero-configuration quick start and in-depth customization modes.

Use cases

It is suitable for scenarios such as AI applications that need to extract content from multiple formats, research data collection, content summary generation, and knowledge base construction. It is particularly suitable for use with AI assistants like Claude.

Main Features

Multi-source content extraction

Supports intelligent extraction from more than 20 types of content sources such as web pages, documents (PDF/Word, etc.), videos, audios, and images.

AI content processing

Built-in capabilities for content cleaning, structured processing, and summary generation in various styles.

Intelligent engine selection

Automatically selects the best extraction engine (Firecrawl/Jina/OCR, etc.) based on the content type.

Deep integration with Claude

You can directly call the content extraction function in Claude conversations with one-click configuration.

Advantages

Ready to use without complex configuration

Supports a wide range of content types, from web pages to professional documents

Fast processing speed, using an asynchronous architecture

Seamless integration with AI assistants like Claude

Limitations

Some professional document formats require additional OCR support

Video processing requires local computing resources

Advanced features require an API key

How to Use

Install the service

Install via pip or run using the uvx zero-install method.

Claude integration configuration

Add MCP server settings to the Claude Desktop configuration file.

Call the service

Use specific syntax in conversations to trigger content extraction or summary functions.

Usage Examples

Web page content extraction

Extract the main text from a news website and generate an executive summary.

Academic PDF processing

Extract the abstract and figure descriptions from a research paper.

Meeting audio transcription

Convert an audio recording file into a text transcript and extract action items.

Frequently Asked Questions

Do I need to deploy my own OCR service?

Will processing large files time out?

How can I view the extracted raw content?

Does it support Chinese content processing?

Related Resources

Official GitHub repository

Source code and latest version releases

MCP protocol documentation

Technical specifications of the Model Context Protocol

Installation and configuration video tutorial

A 10-minute quick start guide

🚀 Content Core

Content Core is a powerful, AI-powered content extraction and processing platform. It can transform any source into clean, structured content. You can extract text from websites, transcribe videos, process documents, and generate AI summaries through a unified interface with multiple integration options.

🚀 Quick Start

You can quickly integrate content-core into your Python projects to extract, clean, and summarize content from various sources.

import content_core as cc

# Extract content from a URL, file, or text
result = await cc.extract("https://example.com/article")

# Clean messy content
cleaned_text = await cc.clean("...messy text with [brackets] and extra spaces...")

# Summarize content with optional context
summary = await cc.summarize_content("long article text", context="explain to a child")

✨ Features

Extract content from anywhere

📄 Documents: PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
🎥 Media: Videos (MP4, AVI, MOV) with automatic transcription
🎵 Audio: MP3, WAV, M4A with speech-to-text conversion
🌐 Web: Any URL with intelligent content extraction
🖼️ Images: JPG, PNG, TIFF with OCR text recognition
📦 Archives: ZIP, TAR, GZ with content analysis

Process with AI

✨ Clean & format extracted content automatically
📝 Generate summaries with customizable styles (bullet points, executive summary, etc.)
🎯 Context-aware processing: explain to a child, technical summary, action items
🔄 Smart engine selection: automatically chooses the best extraction method

Key Features

🎯 Intelligent Auto-Detection: Automatically selects the best extraction method based on content type and available services
🔧 Smart Engine Selection:
- URLs: Firecrawl → Jina → BeautifulSoup fallback chain
- Documents: Docling → Enhanced PyMuPDF → Simple extraction fallback
- Media: OpenAI Whisper transcription
- Images: OCR with multiple engine support
📊 Enhanced PDF Processing: Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
🌍 Multiple Integrations: CLI, Python library, MCP server, Raycast extension, macOS Services
⚡ Zero-Install Options: Use uvx for instant access without installation
🧠 AI-Powered Processing: LLM integration for content cleaning and summarization
🔄 Asynchronous: Built with asyncio for efficient processing

📦 Installation

Install Content Core using pip:

# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
pip install content-core

# With enhanced document processing (adds Docling)
pip install content-core[docling]

# With MCP server support (now included by default)
pip install content-core

# Full installation (with enhanced document processing)
pip install content-core[docling]

Alternatively, if you’re developing locally:

# Clone the repository
git clone https://github.com/lfnovo/content-core
cd content-core

# Install with uv
uv sync

💻 Usage Examples

🖥️ Command Line (Zero Install)

# Extract content from any source
uvx --from "content-core" ccore https://example.com
uvx --from "content-core" ccore document.pdf

# Generate AI summaries  
uvx --from "content-core" csum video.mp4 --context "bullet points"

🤖 Claude Desktop Integration

One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.

🔍 Raycast Extension

Smart auto-detection commands:

Extract Content: Full interface with format options
Summarize Content: 9 summary styles available
Quick Extract: Instant clipboard extraction

🖱️ macOS Right-Click Integration

Right-click any file in Finder → Services → Extract or Summarize content instantly.

🐍 Python Library

import content_core as cc

# Extract from any source
result = await cc.extract("https://example.com/article")
summary = await cc.summarize_content(result, context="explain to a child")

📚 Documentation

For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our Usage Documentation.

🔧 Technical Details

MCP Server Integration

Content Core includes a Model Context Protocol (MCP) server that enables seamless integration with Claude Desktop and other MCP-compatible applications. The MCP server exposes Content Core's powerful extraction capabilities through a standardized protocol.

Quick Setup with Claude Desktop

# Install Content Core (MCP server included)
pip install content-core

# Or use directly with uvx (no installation required)
uvx --from "content-core" content-core-mcp

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "content-core": {
      "command": "uvx",
      "args": [
        "--from",
        "content-core",
        "content-core-mcp"
      ]
    }
  }
}

For detailed setup instructions, configuration options, and usage examples, see our MCP Documentation.

Enhanced PDF Processing

Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.

Key Improvements

🔬 Mathematical Formula Extraction: Enhanced quality flags eliminate  placeholders
📊 Automatic Table Detection: Tables converted to markdown format for LLM consumption
🔧 Quality Text Rendering: Better ligature, whitespace, and image-text integration
⚡ Optional OCR Enhancement: Selective OCR for formula-heavy pages (requires Tesseract)

Configuration for Scientific Documents

For documents with heavy mathematical content, enable OCR enhancement:

# In cc_config.yaml
extraction:
  pymupdf:
    enable_formula_ocr: true      # Enable OCR for formula-heavy pages
    formula_threshold: 3          # Min formulas per page to trigger OCR
    ocr_fallback: true           # Graceful fallback if OCR fails

# Runtime configuration
from content_core.config import set_pymupdf_ocr_enabled
set_pymupdf_ocr_enabled(True)

Requirements for OCR Enhancement

# Install Tesseract OCR (optional, for formula enhancement)
# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

Note: OCR is optional - you get improved PDF extraction automatically without any additional setup.

macOS Services Integration

Content Core provides powerful right-click integration with macOS Finder, allowing you to extract and summarize content from any file without installation. Choose between clipboard or TextEdit output for maximum flexibility.

Available Services

Create 4 convenient services for different workflows:

Extract Content → Clipboard: Quick copy for immediate pasting
Extract Content → TextEdit: Review before using
Summarize Content → Clipboard: Quick summary copying
Summarize Content → TextEdit: Formatted summary with headers

Quick Setup

Install uv (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Create services manually using Automator (5 minutes setup)

Usage

Right-click any supported file in Finder → Services → Choose your option:

PDFs, Word docs: Instant text extraction
Videos, audio files: Automatic transcription
Images: OCR text recognition
Web content: Clean text extraction
Multiple files: Batch processing support

Features

Zero-install processing: Uses uvx for isolated execution
Multiple output options: Clipboard or TextEdit display
System notifications: Visual feedback on completion
Wide format support: 20+ file types supported
Batch processing: Handle multiple files at once
Keyboard shortcuts: Assignable hotkeys for power users

For complete setup instructions with copy-paste scripts, see macOS Services Documentation.

Raycast Extension

Content Core provides a powerful Raycast extension with smart auto-detection that handles both URLs and file paths seamlessly. Extract and summarize content directly from your Raycast interface without switching applications.

Quick Setup

From Raycast Store (coming soon):

Open Raycast and search for "Content Core"
Install the extension by luis_novo
Configure API keys in preferences

Manual Installation:

Download the extension from the repository
Open Raycast → "Import Extension"
Select the raycast-content-core folder

Commands

🔍 Extract Content: Smart URL/file detection with full interface
- Auto-detects URLs vs file paths in real-time
- Multiple output formats (Text, JSON, XML)
- Drag & drop support for files
- Rich results view with metadata
📝 Summarize Content: AI-powered summaries with customizable styles
- 9 different summary styles (bullet points, executive summary, etc.)
- Auto-detects source type with visual feedback
- One-click snippet creation and quicklinks
⚡ Quick Extract: Instant extraction to clipboard
- Type → Tab → Paste source → Enter
- No UI, works directly from command bar
- Perfect for quick workflows

Features

Smart Auto-Detection: Instantly recognizes URLs vs file paths
Zero Installation: Uses uvx for Content Core execution
Rich Integration: Keyboard shortcuts, clipboard actions, Raycast snippets
All File Types: Documents, videos, audio, images, archives
Visual Feedback: Real-time type detection with icons

For detailed setup, configuration, and usage examples, see Raycast Extension Documentation.

Using with Langchain

For users integrating with the Langchain framework, content-core exposes a set of compatible tools. These tools, located in the src/content_core/tools directory, allow you to leverage content-core extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.

You can import and use these tools like any other Langchain tool. For example:

from content_core.tools import extract_content_tool, cleanup_content_tool, summarize_content_tool
from langchain.agents import initialize_agent, AgentType

tools = [extract_content_tool, cleanup_content_tool, summarize_content_tool]
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
agent.run("Extract the content from https://example.com and then summarize it.")

Refer to the source code in src/content_core/tools for specific tool implementations and usage details.

Basic Usage

The core functionality revolves around the extract_content function.

import asyncio
from content_core.extraction import extract_content

async def main():
    # Extract from raw text
    text_data = await extract_content({"content": "This is my sample text content."})
    print(text_data)

    # Extract from a URL (uses 'auto' engine by default)
    url_data = await extract_content({"url": "https://www.example.com"})
    print(url_data)

    # Extract from a local video file (gets transcript, engine='auto' by default)
    video_data = await extract_content({"file_path": "path/to/your/video.mp4"})
    print(video_data)

    # Extract from a local markdown file (engine='auto' by default)
    md_data = await extract_content({"file_path": "path/to/your/document.md"})
    print(md_data)

    # Per-execution override with Docling for documents
    doc_data = await extract_content({
        "file_path": "path/to/your/document.pdf",
        "document_engine": "docling",
        "output_format": "html"
    })
    
    # Per-execution override with Firecrawl for URLs
    url_data = await extract_content({
        "url": "https://www.example.com",
        "url_engine": "firecrawl"
    })
    print(doc_data)

if __name__ == "__main__":
    asyncio.run(main())

(See src/content_core/notebooks/run.ipynb for more detailed examples.)

Docling Integration

Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).

Enabling Docling

Docling is not the default engine when parsing documents. If you don't want to use it, you need to set engine to "simple".

Via configuration file

In your cc_config.yaml or custom config, set:

extraction:
  document_engine: docling  # 'auto' (default), 'simple', or 'docling'
  url_engine: auto          # 'auto' (default), 'simple', 'firecrawl', or 'jina'
  docling:
    output_format: markdown  # markdown | html | json

Programmatically in Python

from content_core.config import set_document_engine, set_url_engine, set_docling_output_format

# switch document engine to Docling
set_document_engine("docling")

# switch URL engine to Firecrawl
set_url_engine("firecrawl")

# choose output format: 'markdown', 'html', or 'json'
set_docling_output_format("html")

# now use ccore.extract or ccore.ccore
result = await cc.extract("document.pdf")

Configuration

Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or .env files, loaded automatically via python-dotenv.

Example .env:

OPENAI_API_KEY=your-key-here
GOOGLE_API_KEY=your-key-here

# Engine Selection (optional)
CCORE_DOCUMENT_ENGINE=auto  # auto, simple, docling
CCORE_URL_ENGINE=auto       # auto, simple, firecrawl, jina

Engine Selection via Environment Variables

For deployment scenarios like MCP servers or Raycast extensions, you can override the extraction engines using environment variables:

CCORE_DOCUMENT_ENGINE: Force document engine (auto, simple, docling)
CCORE_URL_ENGINE: Force URL engine (auto, simple, firecrawl, jina)

These variables take precedence over config file settings and provide explicit control for different deployment scenarios.

Custom Prompt Templates

Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the prompts directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the PROMPT_PATH environment variable in your .env file or system environment.

Example .env with custom prompt path:

OPENAI_API_KEY=your-key-here
GOOGLE_API_KEY=your-key-here
PROMPT_PATH=/path/to/your/custom/prompts

When a prompt template is requested, Content Core will first look in the custom directory specified by PROMPT_PATH (if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.

Development

To set up a development environment:

# Clone the repository
git clone <repository-url>
cd content-core

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv sync --group dev

# Run tests
make test

# Lint code
make lint

# See all commands
make help