MCP Speaker Diarization
M

MCP Speaker Diarization

The MCP Speaker Diarization and Recognition System is a complete solution integrating GPU-accelerated speaker separation, speech recognition, emotion detection, and a web interface. It combines the speaker separation of pyannote.audio with the faster-whisper transcription technology, supporting persistent speaker recognition (one-time registration, permanent recognition), dual-detector emotion analysis (combining general AI and personalized voiceprints), real-time stream processing, REST API, and MCP server, and is designed for AI intelligent agent integration and hobby projects.
2 points
0

What is MCP Speaker Diarization?

This is an advanced speech processing system that can automatically perform the following tasks: 1. **Speaker separation**: Identify how many different people are speaking in the audio. 2. **Speaker recognition**: Remember the voice characteristics of each person and automatically recognize them the next time their voices are heard. 3. **Emotion detection**: Analyze the emotional states during speaking (happy, angry, neutral, etc.). 4. **Speech-to-text**: Convert the conversation content into text records. The system is particularly suitable for scenarios such as meeting records, customer service analysis, and multi-round conversations, and can provide complete conversation context memory for AI assistants.

How to use MCP Speaker Diarization?

It is very easy to use: 1. **Upload audio files**: Support common formats such as MP3 and WAV. 2. **Real-time recording**: Record directly through the web page and process in real-time. 3. **View results**: The system automatically analyzes who said what and their emotions. 4. **AI integration**: Allow AI assistants to access the conversation history through the MCP protocol. No professional knowledge is required, and the system will automatically learn and improve the recognition accuracy.

Applicable scenarios

This system is particularly suitable for the following scenarios: • **Team meeting records**: Automatically distinguish different speakers and record the meeting content. • **Customer service quality analysis**: Analyze the emotions and content of the conversations between customer service representatives and customers. • **AI assistant enhancement**: Enable AI to remember the conversation history and identities of different users. • **Interview transcription**: Quickly convert interview recordings into text transcripts with speaker labels. • **Language learning**: Analyze the speech patterns and emotional expressions in conversations.

Main features

Intelligent speaker recognition
The system can remember the voice characteristics of each person. Once a person's voice is recognized, they can be automatically recognized in all subsequent recordings, and will no longer be displayed as 'unknown speaker'.
Dual-mode emotion detection
Combining general emotion AI and individual voice characteristics to provide more accurate emotion recognition. It can recognize 9 emotions: happy, angry, sad, neutral, fearful, surprised, disgusted, etc.
Real-time speech processing
Supports real-time recording on the web page and analyzes while recording. The voice activity detection automatically segments the conversation fragments, with a quick response.
AI assistant integration
Through the MCP protocol, AI assistants such as Claude and Flowise can directly access the conversation history and remember the identities and conversation content of different speakers.
Multi-language support
Supports speech-to-text for 99 languages, including mainstream languages such as Chinese, English, Japanese, and Korean.
Intelligent error correction learning
When you correct the system's recognition errors, the system will learn and improve, and will be more accurate the next time it encounters similar situations.
Batch processing
Supports uploading multiple audio files for batch processing at the same time, which is suitable for processing a large number of historical recordings.
Data backup and recovery
All speaker voice characteristics and settings can be backed up, which is convenient for migrating or recovering data.
Advantages
🎯 High accuracy: Combining multiple AI models, the recognition accuracy far exceeds that of a single system.
🧠 Continuous learning: The system will learn from your corrections and become more accurate over time.
⚡ Fast processing: Accelerated by GPU, it only takes a few minutes to process a one-hour recording.
🔄 Automatic update: After recognizing a speaker, all historical recordings will be automatically updated.
🔌 Easy to integrate: Provides standard APIs and MCP protocol, which is convenient for integration with other systems.
📱 Multi-terminal support: Covers web interfaces, API interfaces, and AI assistant integration.
Limitations
💻 Requires GPU: The best performance requires an NVIDIA graphics card, and the CPU mode is slower.
🎙️ Dependent on audio quality: Noisy environments or low-quality recordings will affect the recognition accuracy.
👥 Requires initial learning: New speakers require a certain amount of voice samples to be accurately recognized.
🌐 Network requirements: Real-time recording requires a stable network connection.
🔊 Handling of simultaneous speaking: It may not be able to perfectly separate when multiple people speak simultaneously.

How to use

Environment preparation
Ensure that your computer has an NVIDIA graphics card (recommended) or sufficient CPU performance. Install Docker or a Python environment.
Get an API key
Visit the HuggingFace website to register an account and obtain an access token (Token) for downloading AI models.
Configure the system
Copy the configuration file template and fill in your HuggingFace Token and other settings.
Start the system
Start the service, and the system will automatically download the required AI models (it may take a few minutes for the first startup).
Access the interface
Open a browser to access the system interface and start uploading audio or recording in real-time.
Train the system
When using it for the first time, enter the voice samples of known speakers through the 'Speaker management' function.

Usage examples

Team meeting records
Record weekly team meetings and automatically generate meeting minutes with speaker labels, and统计 each person's speaking duration and emotional changes.
Customer service quality monitoring
Analyze customer service call recordings, automatically identify customer service representatives and customers, detect customer emotional changes, and find areas that need improvement.
AI assistant memory enhancement
Enable AI assistants to remember the conversation history and preferences of family members and provide personalized responses.
Interview transcription and analysis
Convert expert interview recordings into text, automatically distinguish interviewers and interviewees, and analyze the emotions and key content of the interviewees.

Frequently Asked Questions

What kind of computer configuration is required?
What preparations are needed for the first use?
How accurate is the system's recognition?
How to improve the recognition accuracy?
Which languages are supported?
How is data security ensured?
How long of a recording can it handle?
How to integrate with an AI assistant?

Related resources

Official GitHub repository
Project source code, latest version, and issue feedback
Web front-end project
A beautiful web interface that provides complete user operation functions
HuggingFace model
Speaker separation model, the usage terms need to be accepted
API interaction documentation
Complete REST API documentation and test interface
MCP protocol documentation
Official specification of the Model Context Protocol
Question discussion area
Submit questions, feature suggestions, and technical discussions
Docker installation guide
Installation tutorial for Docker and Docker Compose
CUDA installation guide
Download and installation of the NVIDIA CUDA toolkit

Installation

Copy the following command to your Client for configuration
{
  "mcpServers": {
    "speaker-diarization": {
      "url": "http://localhost:8000/mcp",
      "transport": "http"
    }
  }
}

{
  "mcpServers": {
    "speaker-diarization": {
      "command": "node",
      "args": ["/path/to/mcp-proxy.js", "http://localhost:8000/mcp"]
    }
  }
}
Note: Your key is sensitive information, do not share it with anyone.
N
Notion Api MCP
Certified
A Python-based MCP Server that provides advanced to-do list management and content organization functions through the Notion API, enabling seamless integration between AI models and Notion.
Python
17.5K
4.5 points
M
Markdownify MCP
Markdownify is a multi-functional file conversion service that supports converting multiple formats such as PDFs, images, audio, and web page content into Markdown format.
TypeScript
28.6K
5 points
G
Gitlab MCP Server
Certified
The GitLab MCP server is a project based on the Model Context Protocol that provides a comprehensive toolset for interacting with GitLab accounts, including code review, merge request management, CI/CD configuration, and other functions.
TypeScript
17.5K
4.3 points
D
Duckduckgo MCP Server
Certified
The DuckDuckGo Search MCP Server provides web search and content scraping services for LLMs such as Claude.
Python
53.9K
4.3 points
F
Figma Context MCP
Framelink Figma MCP Server is a server that provides access to Figma design data for AI programming tools (such as Cursor). By simplifying the Figma API response, it helps AI more accurately achieve one - click conversion from design to code.
TypeScript
51.3K
4.5 points
U
Unity
Certified
UnityMCP is a Unity editor plugin that implements the Model Context Protocol (MCP), providing seamless integration between Unity and AI assistants, including real - time state monitoring, remote command execution, and log functions.
C#
24.3K
5 points
G
Gmail MCP Server
A Gmail automatic authentication MCP server designed for Claude Desktop, supporting Gmail management through natural language interaction, including complete functions such as sending emails, label management, and batch operations.
TypeScript
17.2K
4.5 points
C
Context7
Context7 MCP is a service that provides real-time, version-specific documentation and code examples for AI programming assistants. It is directly integrated into prompts through the Model Context Protocol to solve the problem of LLMs using outdated information.
TypeScript
75.7K
4.7 points
AIBase
Zhiqi Future, Your AI Solution Think Tank
© 2025AIBase