Zotero Chunk MCP

DeepZotero is a tool that provides semantic search functionality for the Zotero literature library. It extracts text, tables, and images from PDFs, chunks, embeds, and stores them in ChromaDB. Finally, it provides 13 types of semantic search, Boolean search, table/image search, context expansion, citation graph query, index management, and cost - tracking tools to clients such as Claude Code through the MCP server.

Search tools Research and data #Semantic search #Literature management #PDF parsing #Citation analysis .Python

rating : 2.5 points

downloads : 9.2K

update time : 2026-03-12

Open Site

What is DeepZotero?

DeepZotero is an enhanced academic literature management tool based on semantic search. It can deeply analyze PDF documents in your Zotero literature library, extract text content, table data, and chart information, and perform intelligent indexing through artificial intelligence technology. You can search your literature library in natural language, just like having a conversation with an assistant, and quickly find relevant research content, data tables, or charts.

How to use DeepZotero?

Using DeepZotero mainly consists of three steps: First, install and configure the necessary API keys. Then, run the indexing program to analyze your literature library. Finally, use various search tools through Claude Code or other MCP clients. The entire process is highly automated, and even non - technical users can easily get started.

Applicable scenarios

DeepZotero is particularly suitable for researchers, students, and scholars. You may need it when: 1. Quickly find relevant research on a specific topic in a large number of literatures. 2. Find papers containing specific data tables. 3. Track the citation relationships of a certain concept. 4. Organize and review the research progress in a certain field. 5. Find relevant literatures and evidence for writing papers.

Main features

Intelligent semantic search

Search literature content using natural language, understand the semantic meaning of the query, rather than just keyword matching. You can search text paragraphs, table content, and chart descriptions.

Intelligent table extraction

Use AI vision technology to accurately extract table data from PDFs, convert it into a structured format, and support searching by table content.

Chart recognition and search

Automatically recognize charts and their titles in PDFs, extract them as searchable image files, and support searching by chart descriptions.

Intelligent re - ranking

Search results are not only based on similarity but also consider the importance of literature chapters (e.g., the results section has a higher weight) and journal quality, providing a more relevant ranking.

Citation relationship analysis

Find the citation relationships of literature through the OpenAlex database, understand which papers cite a certain literature, and which other studies are cited by this literature.

Context expansion

After finding relevant paragraphs, you can view their context content. For table results, you can also find relevant discussions in the text that refer to the table.

13 search tools

Provide diverse search tools, including semantic search, topic search, table search, chart search, Boolean search, etc., to meet different search needs.

Advantages

Intelligent search: Understand semantics and find truly relevant content, not just keyword matching.

Comprehensive coverage: Search text, tables, and charts simultaneously, providing a complete literature analysis.

Easy to use: Interact through natural language without learning complex query syntax.

Incremental indexing: Only process new literatures, saving time and computing resources.

Highly configurable: You can adjust parameters such as search weights and chunk sizes according to your needs.

Limitations

Requires API keys: Using the Gemini and Anthropic APIs requires registration and obtaining keys.

Initial indexing is time - consuming: Analyzing a large number of literatures for the first time takes a long time.

Depends on Zotero: You need to have Zotero installed and have a certain number of literatures.

Visual extraction cost: Using AI to extract tables incurs a small fee.

Technical requirements: Installation and configuration require basic command - line operation capabilities.

How to use

Installation preparation

Ensure that Python 3.10+ and Zotero are installed, and prepare the necessary API keys.

Configuration settings

Create a configuration file and fill in the path to your Zotero data directory and API keys.

Index the literature library

Run the indexing program to analyze your literature library. This process will automatically extract and index all PDF content.

Configure the MCP client

Add the DeepZotero server configuration in the Claude Code settings. You can use it after restarting.

Start searching

Use natural language to directly search your literature library in Claude Code, such as 'Find recent research on the impact of climate change on agriculture'.

Usage examples

Literature review preparation

When you need to write a literature review for a certain research topic, you can use DeepZotero to quickly find all relevant research and sort them by importance.

Data search

When you need to cite specific data in your writing, you can search for tables containing that data.

Citation tracking

When you find a key literature and want to understand its influence and subsequent research, you can view the citation relationships.

Cross - literature concept search

When you want to understand the expression and application of a certain concept in different studies, you can conduct a cross - literature search.

Frequently Asked Questions

Does DeepZotero need to be used online?

How long does it take to index my literature library?

How much does it cost to use DeepZotero?

Does DeepZotero support Chinese literatures?

Can I index literatures while Zotero is running?

How to update the indexed literatures?

Related resources

GitHub repository

The source code and latest version of DeepZotero

Zotero official website

The official website of the literature management tool Zotero

Gemini API application

The page for applying for a Gemini API key

Anthropic API console

The console for applying for an Anthropic API key and managing usage

Model Context Protocol documentation

The official technical documentation of the MCP protocol

🚀 LLM Vision Method Analysis - Complete Report

This report evaluates the performance of LLM vision methods (Claude Haiku + Sonnet) in academic table extraction and explores why they are not being utilized in the pipeline despite their superior accuracy.

🚀 Quick Start

This report focuses on the evaluation of LLM vision methods in academic table extraction. It aims to understand the performance of these methods, the reasons for their non - selection in the pipeline, and provides recommendations for improvement.

✨ Features

1. LLM Methods Are Dramatically Better

LLM average accuracy: 76.7% (87 runs)
Traditional method average: 20.6% (1,131 runs)
Advantage: +56.1 percentage points
Perfect extractions (100%): LLM 25.3%, Traditional ~4%
Excellent (>95%): LLM 27.6%, Traditional ~18%

2. LLM Methods Aren't Being Selected by Pipeline

LLM results computed: 87 method runs across 44 tables
LLM results used: 0 methods selected as final winners
Pipeline winners: 36/40 tables used single_point_hotspot:rawdict (8.07% avg accuracy)
Root cause: Confidence multipliers too low in pipeline_weights.json

3. Two Distinct Failure Modes Exist

Well - formed tables (8/10 papers): LLM wins 55 - 78 pp advantage
Corrupted extractions (huang - emd - 1998): LLM loses by 13.6 pp (attempts inference, fails when structure broken)
Mathematical notation (friston - life): Both methods weak (0% ground truth), LLM scores 44.62%

4. Ground Truth Alignment Issues

Friston table: LLM scores 44.62%, ground truth cell accuracy 0% (shape mismatch)
Suggests quality metrics may not align with actual extraction correctness
Needs validation of scoring function

📚 Documentation

Report Files

1. EXECUTIVE_SUMMARY.md (226 lines)

High - level findings, recommendations, statistical evidence

Performance by model and paper
Root cause analysis of failures
Priority - ordered recommendations
Expected impact estimates
Quick reference tables

2. llm_vision_analysis.md (240 lines)

Detailed analytical report

LLM performance metrics by model
Per - paper performance breakdown
Critical failure case analysis (huang, friston)
Strengths/limitations comparison
Detailed recommendations with code examples
Statistical evidence markers

3. QUERY_RESULTS.md (277 lines)

Complete SQL query results and interpretation

All 5 requested queries with full result sets
Data tables for friston and huang papers
Method performance rankings
Ground truth comparison details
Next steps and remediation items

💻 Usage Examples

Enable LLM Method Selection (CRITICAL)

Update pipeline_weights.json with high multipliers for LLM methods:

{
  "confidence_multipliers": {
    "llm_sonnet": 3.0,
    "llm_haiku": 2.5,
    "single_point_hotspot": 1.0
  }
}

Impact: LLM methods will be selected, raising accuracy from 8% to ~75%

🔧 Technical Details

Recommendations (Priority Order)

1. Enable LLM Method Selection (CRITICAL)

Update pipeline_weights.json to increase the confidence multipliers for LLM methods, which will allow the pipeline to select LLM methods and significantly improve the extraction accuracy.

2. Add Conditional LLM Activation

Skip LLM on corrupted extractions (fill_rate < 0.3) to prevent regressions similar to the huang - emd - 1998 case.

3. Implement Fallback Logic

If LLM scores 0%, use the best traditional method to handle orphan edge cases gracefully.

4. Validate Scoring Metrics

Compare quality_score vs ground truth cell_accuracy_pct to ensure better selection logic for edge cases.

5. Unicode/Math Handling

Apply special processing for mathematical notation tables to achieve better results on mathematical papers.

Data Sources

Database: C:\local_working_projects\zotero_citation_mcp_stress_test_debug.db
Tables examined:
- method_results (87 LLM runs, 1,131 traditional runs)
- extracted_tables (40+ tables from 10 papers)
- pipeline_runs (40 final method selections)
- ground_truth_diffs (friston and huang comparisons)
- papers (metadata for 10 papers)

Statistical Evidence

[STAT:n] n = 87 LLM runs, n = 1,131 traditional runs across 44 unique tables
[STAT:effect_size] Cohen's d = 1.84 (very large effect) LLM vs traditional on well - formed tables
[STAT:ci] 95% CI on LLM accuracy: [74%, 80%] (excluding orphans: [85%, 90%])
[STAT:p_value] p < 0.001 (highly significant advantage for LLM on non - corrupted extractions)

Performance Summary

Property	Details
Avg Accuracy	LLM Sonnet: 79.16%, LLM Haiku: 74.18%, Traditional: 20.57%
Perfect (100%)	LLM Sonnet: 13/44, LLM Haiku: 9/43, Traditional: ~50/1131
Excellent (>95%)	LLM Sonnet: 24/44, LLM Haiku: 21/43, Traditional: ~200/1131
Orphan Failures	LLM Sonnet: 6/44, LLM Haiku: 5/43, Traditional: 89/1131
Best Paper Advantage	LLM Sonnet: +78.47 pp (helm)
Worst Paper (Huang)	LLM Sonnet: - 13.60 pp

Conclusion

LLM vision methods are production - ready on well - formed tables (75+ accuracy) but require:

High confidence multipliers to be selected by the pipeline
Pre - extraction quality checks to skip on corrupted data
Fallback logic for edge cases
Validation of scoring metrics against ground truth

Expected outcome: Raising extraction quality from 8% to 60 - 75% with minimal code changes.

Analysis Completed

Date: 2026 - 02 - 25
Analyst: Scientist Agent
Database: _stress_test_debug.db (44 tables, 10 papers, 1,218 total method runs)
Coverage: All 5 requested SQL queries executed and analyzed

search_papers

Perform a semantic search on research paper fragments. Return relevant paragraphs and their context. The results are re - ranked by a composite score that combines semantic similarity, document chapters, and journal quartiles.

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

search_topic

Find the most relevant papers on a certain topic, deduplicating by document. Search in all fragments and then group by paper. Each paper is scored by its average composite relevance and best single fragment. The results are sorted by the average composite score.

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

search_tables

Search for tables in indexed papers. Perform a semantic search on table content (title, cells, description). Return tables and their metadata in Markdown format.

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

search_figures

Search for charts by description content. Perform a semantic search on chart descriptions. Return charts, their descriptions, page numbers, and extracted image paths.

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

get_passage_context

Expand the context around a specific paragraph. Use it after search_papers to get more context. For table fragments (from search_tables), pass table_page and table_index to find the text that refers to the table and return its context.

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

index_library

Index Zotero PDFs into a vector store. Extract text, tables, and charts from PDFs, process them in chunks, and store the embeddings in ChromaDB. Unless force_reindex is True, only incrementally index new/changed documents.

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

get_index_stats

Get statistical information about the index collection.

get_reranking_config

Get the current re - ranking configuration. Return chapter weights, journal quartile weights, alpha index, and valid chapter names for the section_weights parameter.

search_boolean

Perform a Boolean full - text search using the Zotero native word index. Used for exact word matching with AND/OR logic. Different from semantic search, this only finds exact word matches (no synonyms or similar meanings).

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

Parameters

: *

Description

find_citing_papers

Find papers that cite a given document. The document must have a DOI. Use the OpenAlex API to get citation data. The rate limit is 1 request/second (10 requests/second if openalex_email is configured).

Parameters

: *

Description

Parameters

: *

Description

find_references

Find papers cited by a document (its references). The document must have a DOI. Use the OpenAlex API. The rate limit is 1 request/second (10 requests/second if openalex_email is configured).

Parameters

: *

Description

Parameters

: *

Description

get_citation_count

Get the citation count and reference count of a document. The document must have a DOI. Use the OpenAlex API.

Parameters

: *

Description

get_vision_costs

Get a summary of visual API batch usage and costs. Read the visual cost log written during table extraction and return token usage, costs, and a breakdown for each session.

Parameters

: *

Description