Archive-agent MCP: Intelligent File Indexer with AI Search & OCR, Multifile Support

Archive Agent

Archive Agent is an intelligent file indexing tool that supports searching and querying file content through natural language. It combines AI search (RAG engine), automatic OCR, and the MCP interface, and can handle various file types, including text, documents, PDFs, and images.

Developer tools Knowledge management and memory #File Search #AI Indexing #OCR Processing #MCP Service .Python

rating : 2.5 points

downloads : 5.9K

update time : 2025-07-24

Open Site

What is an MCP server?

The MCP server is a supporting service for the Archive Agent, enabling automated interaction with IDEs or AI extensions through the Model Context Protocol. It allows users to communicate efficiently with the file system and AI models via simple commands.

How to use the MCP server?

After starting the MCP server, users can directly call its functions through an IDE or AI extension, such as searching for files, querying content, and getting answers. With simple configuration, users can enjoy a seamless AI-assisted workflow.

Applicable scenarios

Suitable for work environments that require frequent file access, information querying, and AI-assisted analysis. For example, developers can quickly find documentation while writing code, and researchers can obtain key information when dealing with large amounts of data.

Main Features

Automated file indexing

Automatically track file changes and update the index to ensure that all file information is available in real-time.

AI-enhanced search

Combine AI models to provide intelligent search and answer generation, improving information retrieval efficiency.

Multi-platform compatibility

Support multiple operating systems and development environments, including Linux, Windows, and mainstream IDEs.

Security and privacy

Locally run AI models protect user data privacy and prevent sensitive information leakage.

Advantages

Improve work efficiency and reduce manual operations

Support the integration of complex AI models to enhance functional diversity

Limitations

Initial configuration may be difficult for beginners

Depends on network connection (for remote AI models)

How to Use

Install Archive Agent

Follow the guide in the README to complete the installation of Archive Agent.

Start the MCP server

Run the following command in the Archive Agent directory to start the MCP server.

Configure the IDE or AI extension

Set the connection parameters of the MCP server according to the documentation of your IDE or AI extension.

Usage Examples

Quickly find files

Use the GitHub Copilot agent mode in VS Code and enter #get_files_tracked to view the list of tracked files.

Query specific information

Use Archive Agent to query file content, for example, ask "Which files mention donuts?".

Frequently Asked Questions

Does the MCP server have to run all the time?

How to stop the MCP server?

Which AI models does the MCP server support?

Related Resources

Official Documentation

Complete documentation and source code repository for Archive Agent

MCP Hub

Official page of the MCP server, providing detailed technical instructions and usage guides

YouTube Tutorials

Video tutorials for Archive Agent to help users better understand its functions and usage

🚀 Archive Agent

Find your files with natural language and ask questions.

Archive Agent is a smart file indexer equipped with an AI search (RAG engine), automatic OCR, and an MCP interface. It enables users to locate files using natural language queries and interact with their file content in an intelligent way.

GitHub Release GitHub License

✨ Features

File Indexing: Indexes various file types including plaintext, documents, PDFs, images.
Image Processing: Utilizes automatic OCR and entity extraction for image content handling.
AI Search: Enables search and query of files using popular AI providers such as OpenAI, Ollama, and LM Studio.
MCP Integration: Includes an MCP server for automation through IDE or AI extensions.

📦 Installation

Please install these requirements before proceeding:

🐳 Docker (for running Qdrant server)
🐍 Python >= 3.10 (core runtime) (usually already installed)

Ubuntu / Linux Mint

This installation method should work on any Linux distribution derived from Ubuntu (e.g., Linux Mint).

To install Archive Agent in the current directory of your choice, run this once:

git clone https://github.com/shredEngineer/Archive-Agent
cd Archive-Agent
chmod +x install.sh
./install.sh

The install.sh script will execute the following steps in order:

Download and install uv (used for Python environment management).
Install the custom Python environment.
Install the spaCy tokenizer model (used for chunking).
Install pandoc (used for document parsing).
Download and install the Qdrant docker image with persistent storage and auto-restart.
Install a global archive-agent command for the current user.

🚀 Archive Agent is now installed!

👉 Please complete the AI provider setup next.
(Afterward, you'll be ready to Run Archive Agent!)

💻 Usage Examples

Basic Usage

For example, to track your documents and images, run this:

archive-agent include "~/Documents/**" "~/Images/**"
archive-agent update

To start the GUI, run this:

archive-agent

Or, to ask questions from the command line:

archive-agent query "Which files mention donuts?"

Advanced Usage

To see additional information on embedding, retrieval, reranking and querying, pass the --verbose option. To bypass the AI cache (embedding, reranking) for this query, pass the --nocache option:

archive-agent query "Which files mention donuts?" --verbose --nocache

📚 Documentation

How Archive Agent Works

Which files are processed

Archive Agent currently supports these file types:

Text:
- Plaintext: .txt, .md
- Documents:
  - ASCII documents: .html, .htm
  - Binary documents: .odt, .docx (including images)
- PDF documents: .pdf (including images, see OCR strategies)
Images: .jpg, .jpeg, .png, .gif, .webp, .bmp

How files are processed

Ultimately, Archive Agent decodes everything to text like this:

Plaintext files are decoded to UTF-8.
Documents are converted to plaintext, images are extracted.
PDF documents are decoded according to the OCR strategy.
Images are decoded to text using AI vision. The vision model will reject unintelligible images.

Using Pandoc for documents, PyMuPDF4LLM for PDFs, Pillow for images.

📌 Note: Unsupported files are tracked but not processed.

OCR strategies

For PDF documents, there are different OCR strategies supported by Archive Agent:

strict OCR strategy:
- PDF OCR text layer is ignored.
- PDF pages are treated as images.
- Expensive and slow, but more accurate.
relaxed OCR strategy:
- PDF OCR text layer is extracted.
- PDF foreground images are decoded, but background images are ignored.
- Cheap and fast, but less accurate.
auto OCR strategy:
- Selects best OCR strategy for each page based on the number of characters extracted from the PDF OCR text layer, if any.
- Decides based on ocr_auto_threshold, the minimum number of characters for auto OCR strategy to resolve to relaxed instead of strict.
- Trade-off between cost, speed, and accuracy.

See Archive Agent settings: ocr_strategy, ocr_auto_threshold

📌 Note: The strict OCR strategy is recommended for best results. PDF documents often contain small images related to page style/layout which cause overhead while contributing little information or even cluttering the result.

💡 Good to know: You will be prompted to choose an OCR strategy at startup (see Run Archive Agent).

How smart chunking works

Archive Agent processes decoded text like this:

Decoded text is sanitized and split into sentences.
Sentences are grouped into reasonably-sized blocks.
Each block is split into smaller chunks using an AI model.
- Block boundaries are handled gracefully (last chunk carries over).
Each chunk is prefixed with a context header (improves search).
Each chunk is turned into a vector using AI embeddings.
Each vector is turned into a point with file metadata.
Each point is stored in the Qdrant database.

See Archive Agent settings: chunk_lines_block

💡 Good to know: This smart chunking improves the accuracy and effectiveness of the retrieval.

How chunk references work

To ensure that every chunk can be traced back to its origin, Archive Agent maps the text contents of each chunk to the corresponding line numbers or page numbers of the source file.

Line-based files (e.g., .txt) use the range of line numbers as reference.
Page-based files (e.g., .pdf) use the range of page numbers as reference.

📌 Note: References are only approximate due to paragraph/sentence splitting/joining in the chunking process.

How chunks are retrieved

Archive Agent retrieves chunks related to your question like this:

The question is turned into a vector using AI embeddings.
Points with similar vectors are retrieved from the Qdrant database.
Only chunks of points with sufficient score are kept.

See Archive Agent settings: retrieve_score_min, retrieve_chunks_max

How chunks are reranked and expanded

Archive Agent filters the retrieved chunks:

The retrieved chunks are reranked by relevance to your question.
Only the top relevant chunks are kept (the other chunks are discarded).
Each selected chunk is expanded to get a larger context from the relevant documents.

See Archive Agent settings: rerank_chunks_max, expand_chunks_radius

How answers are generated

Archive Agent answers your question using the reranked and expanded chunks like this:

The LLM receives the chunks as context to the question.
The LLM's answer is returned as structured output and formatted.

💡 Good to know: Archive Agent uses an answer template that aims to be universally helpful.

How files are selected for tracking

Archive Agent uses patterns to select your files:

Patterns can be actual file paths.
Patterns can be paths containing wildcards that resolve to actual file paths.
💡 Patterns must be specified as (or resolve to) absolute paths, e.g. /home/user/Documents/*.txt (or ~/Documents/*.txt).
💡 Use the wildcard * to match any file in the given directory.
💡 Use the wildcard ** to match any files and zero or more directories, subdirectories, and symbolic links to directories.

There are included patterns and excluded patterns:

The set of resolved excluded files is removed from the set of resolved included files.
Only the remaining set of files (included but not excluded) is tracked by Archive Agent.
Hidden files are always ignored!

This approach gives you the best control over the specific files or file types to track.

Run Archive Agent

💡 Good to know: At startup, you will be prompted to choose the following:

Profile name
AI provider (see AI Provider Setup)
OCR strategy (see OCR strategies)

Quickstart

As shown in the basic usage example above.

Show list of commands

To show the list of supported commands, run this:

archive-agent

Create or switch profile

To switch to a new or existing profile, run this:

archive-agent switch "My Other Profile"

📌 Note: Always use quotes for the profile name argument, or skip it to get an interactive prompt.

💡 Good to know: Profiles are useful to manage independent Qdrant collections (see Qdrant database) and Archive Agent settings.

Open current profile config in nano

To open the current profile's config (JSON) in the nano editor, run this:

archive-agent config

See Archive Agent settings for details.

Add included patterns

To add one or more included patterns, run this:

archive-agent include "~/Documents/*.txt"

📌 Note: Always use quotes for the pattern argument (to prevent your shell's wildcard expansion), or skip it to get an interactive prompt.

Add excluded patterns

To add one or more excluded patterns, run this:

archive-agent exclude "~/Documents/*.txt"

📌 Note: Always use quotes for the pattern argument (to prevent your shell's wildcard expansion), or skip it to get an interactive prompt.

Remove included / excluded patterns

To remove one or more previously included / excluded patterns, run this:

archive-agent remove "~/Documents/*.txt"

📌 Note: Always use quotes for the pattern argument (to prevent your shell's wildcard expansion), or skip it to get an interactive prompt.

List included / excluded patterns

To show the list of included / excluded patterns, run this:

archive-agent patterns

Resolve patterns and track files

To resolve all patterns and track changes to your files, run this:

archive-agent track

List tracked files

To show the list of tracked files, run this:

archive-agent list

📌 Note: Don't forget to track your files first.

List changed files

To show the list of changed files, run this:

archive-agent diff

📌 Note: Don't forget to track your files first.

Commit changed files to database

To sync changes to your files with the Qdrant database, run this:

archive-agent commit

To see additional information on chunking and embedding, pass the --verbose option. To bypass the AI cache (vision, chunking, embedding) for this commit, pass the --nocache option.

💡 Good to know: Changes are triggered by:

File added
File removed
File changed:
- Different file size
- Different modification date

📌 Note: Don't forget to track your files first.

Combined track and commit

To track and then commit in one go, run this:

archive-agent update

To see additional information on chunking and embedding, pass the --verbose option. To bypass the AI cache (vision, chunking, embedding) for this commit, pass the --nocache option.

Search your files

archive-agent search "Which files mention donuts?"

Lists files relevant to the question.

📌 Note: Always use quotes for the question argument, or skip it to get an interactive prompt.

To see additional information on embedding, retrieval, reranking and querying, pass the --verbose option. To bypass the AI cache (embedding, reranking) for this search, pass the --nocache option.

Query your files

archive-agent query "Which files mention donuts?"

Answers your question using RAG.

📌 Note: Always use quotes for the question argument, or skip it to get an interactive prompt.

To see additional information on embedding, retrieval, reranking and querying, pass the --verbose option. To bypass the AI cache (embedding, reranking) for this query, pass the --nocache option.

Launch Archive Agent GUI

To launch the Archive Agent GUI in your browser, run this:

archive-agent gui

📌 Note: Press CTRL+C in the console to close the GUI server.

Start MCP Server

To start the Archive Agent MCP server, run this:

archive-agent mcp

📌 Note: Press CTRL+C in the console to close the MCP server.

💡 Good to know: Use these MCP configurations to let your IDE or AI extension automate Archive Agent:

MCP Tools

Archive Agent exposes these tools via MCP:

MCP tool	Equivalent CLI command(s)	Argument(s)	Description
`get_patterns`	`patterns`	None	Get the list of included / excluded patterns.
`get_files_tracked`	`track` and then `list`	None	Get the list of tracked files.
`get_files_changed`	`track` and then `diff`	None	Get the list of changed files.
`get_search_result`	`search`	`question`	Get the list of files relevant to the question.
`get_answer_rag`	`query`	`question`	Get answer to question using RAG.

📌 Note: These commands are read-only, preventing the AI from changing your Qdrant database.

💡 Good to know: Just type #get_answer_rag (e.g.) in your IDE or AI extension to call the tool directly.

Update Archive Agent

This step is not needed right away if you just installed Archive Agent. However, to get the latest features, you should update your installation regularly.

To update your Archive Agent installation, run this in the installation directory:

./update.sh

📌 Note: If updating doesn't work, try removing the installation directory and then Install Archive Agent again. Your config and data are safely stored in another place; see Archive Agent settings and Qdrant database for details.

💡 Good to know: To also update the Qdrant docker image, run this:

sudo ./manage-qdrant.sh update

Archive Agent settings

Archive Agent settings are organized as profile folders in ~/.archive-agent-settings/.

E.g., the default profile is located in ~/.archive-agent-settings/default/.

The currently used profile is stored in ~/.archive-agent-settings/profile.json.

📌 Note: To delete a profile, simply delete the profile folder. This will not delete the Qdrant collection (see Qdrant database).

Profile configuration

The profile configuration is contained in the profile folder as config.json.

💡 Good to know: Use the config CLI command to open the current profile's config (JSON) in the nano editor (see Open current profile config in nano).

💡 Good to know: Use the switch CLI command to switch to a new or existing profile (see Create or switch profile).

Key	Description
`config_version`	Config version
`mcp_server_port`	MCP server port (default `8008`)
`ocr_strategy`	OCR strategy in
`ocr_auto_threshold`	Minimum number of characters for `auto` OCR strategy to resolve to `relaxed` instead of `strict`
`chunk_lines_block`	Number of lines per block for chunking
`qdrant_server_url`	URL of the Qdrant server
`qdrant_collection`	Name of the Qdrant collection
`retrieve_score_min`	Minimum similarity score of retrieved chunks (`0`...`1`)
`retrieve_chunks_max`	Maximum number of retrieved chunks
`rerank_chunks_max`	Number of top chunks to keep after reranking
`expand_chunks_radius`	Number of preceding and following chunks to prepend and append to each reranked chunk
`ai_provider`	AI provider in
`ai_server_url`	AI server URL
`ai_model_chunk`	AI model used for chunking
`ai_model_embed`	AI model used for embedding
`ai_model_rerank`	AI model used for reranking
`ai_model_query`	AI model used for queries
`ai_model_vision`	AI model used for vision (`""` disables vision)
`ai_vector_size`	Vector size of embeddings (used for Qdrant collection)
`ai_temperature_query`	Temperature of the query model

Watchlist

The profile watchlist is contained in the profile folder as watchlist.json.

The watchlist is managed by these commands only:

include / exclude / remove
track / commit / update

AI cache

Each profile folder also contains an ai_cache folder.

The AI cache ensures that, in a given profile:

The same image is only OCR-ed once.
The same text is only chunked once.
The same text is only embedded once.
The same combination of chunks is only reranked once.

This way, Archive Agent can quickly resume where it left off if a commit was interrupted.

To bypass the AI cache for a single commit, pass the --nocache option to the commit or update command (see Commit changed files to database and Combined track and commit).

💡 Good to know: Queries are never cached, so you always get a fresh answer.

📌 Note: To clear the entire AI cache, simply delete the profile's cache folder.

📌 Technical Note: Archive Agent keys the cache using a composite hash made from the text/image bytes, and of the AI model names for chunking, embedding, reranking, and vision. Cache keys are deterministic and change generated whenever you change the chunking, embedding or vision AI model names. Since cache entries are retained forever, switching back to a prior combination of AI model names will again access the "old" keys.

Qdrant database

The Qdrant database is stored in ~/.archive-agent-qdrant-storage/.

📌 Note: This folder is created by the Qdrant Docker image running as root.

💡 Good to know: Visit your Qdrant dashboard to manage collections and snapshots.

Developer's guide

Archive Agent was written from scratch for educational purposes (on either end of the software).

💡 Good to know: Tracking the test_data/ gets you started with some kind of test data.

Important modules

To get started, check out these epic modules:

Files are processed in
The app context is initialized in
The default config is defined in
The CLI commands are defined in
The commit logic is implemented in
The CLI verbosity is handled in
The GUI is implemented in
The AI API prompts for chunking, embedding, vision, and querying are defined in
The AI provider registry is located in

If you miss something or spot bad patterns, feel free to contribute and refactor!

Code testing and analysis

To run unit tests, check types, and check style, run this:

./audit.sh

🔧 Technical Details

The development of Archive Agent involves multiple technical aspects, including file processing, AI integration, and database management. For example, in file processing, different libraries are used for different file types to ensure accurate decoding and extraction of information. The AI integration part requires careful configuration of different AI providers and models to achieve efficient search and query functions. The Qdrant database is used for storing file metadata and vectors, which requires proper management and maintenance.

📄 License

This project is licensed under the GNU GPL v3.0 license.

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

See LICENSE for details.