MCP Speaker Diarization

The MCP Speaker Diarization and Recognition System is a complete solution integrating GPU-accelerated speaker separation, speech recognition, emotion detection, and a web interface. It combines the speaker separation of pyannote.audio with the faster-whisper transcription technology, supporting persistent speaker recognition (one-time registration, permanent recognition), dual-detector emotion analysis (combining general AI and personalized voiceprints), real-time stream processing, REST API, and MCP server, and is designed for AI intelligent agent integration and hobby projects.

Voice processing Customer support #Speaker separation #Speech recognition #Emotion analysis #AI integration .Python

rating : 2 points

downloads : 9.7K

update time : 2025-12-12

Open Site

What is MCP Speaker Diarization?

This is an advanced speech processing system that can automatically perform the following tasks: 1. **Speaker separation**: Identify how many different people are speaking in the audio. 2. **Speaker recognition**: Remember the voice characteristics of each person and automatically recognize them the next time their voices are heard. 3. **Emotion detection**: Analyze the emotional states during speaking (happy, angry, neutral, etc.). 4. **Speech-to-text**: Convert the conversation content into text records. The system is particularly suitable for scenarios such as meeting records, customer service analysis, and multi-round conversations, and can provide complete conversation context memory for AI assistants.

How to use MCP Speaker Diarization?

It is very easy to use: 1. **Upload audio files**: Support common formats such as MP3 and WAV. 2. **Real-time recording**: Record directly through the web page and process in real-time. 3. **View results**: The system automatically analyzes who said what and their emotions. 4. **AI integration**: Allow AI assistants to access the conversation history through the MCP protocol. No professional knowledge is required, and the system will automatically learn and improve the recognition accuracy.

Applicable scenarios

This system is particularly suitable for the following scenarios: • **Team meeting records**: Automatically distinguish different speakers and record the meeting content. • **Customer service quality analysis**: Analyze the emotions and content of the conversations between customer service representatives and customers. • **AI assistant enhancement**: Enable AI to remember the conversation history and identities of different users. • **Interview transcription**: Quickly convert interview recordings into text transcripts with speaker labels. • **Language learning**: Analyze the speech patterns and emotional expressions in conversations.

Main features

Intelligent speaker recognition

The system can remember the voice characteristics of each person. Once a person's voice is recognized, they can be automatically recognized in all subsequent recordings, and will no longer be displayed as 'unknown speaker'.

Dual-mode emotion detection

Combining general emotion AI and individual voice characteristics to provide more accurate emotion recognition. It can recognize 9 emotions: happy, angry, sad, neutral, fearful, surprised, disgusted, etc.

Real-time speech processing

Supports real-time recording on the web page and analyzes while recording. The voice activity detection automatically segments the conversation fragments, with a quick response.

AI assistant integration

Through the MCP protocol, AI assistants such as Claude and Flowise can directly access the conversation history and remember the identities and conversation content of different speakers.

Multi-language support

Supports speech-to-text for 99 languages, including mainstream languages such as Chinese, English, Japanese, and Korean.

Intelligent error correction learning

When you correct the system's recognition errors, the system will learn and improve, and will be more accurate the next time it encounters similar situations.

Batch processing

Supports uploading multiple audio files for batch processing at the same time, which is suitable for processing a large number of historical recordings.

Data backup and recovery

All speaker voice characteristics and settings can be backed up, which is convenient for migrating or recovering data.

Advantages

🎯 High accuracy: Combining multiple AI models, the recognition accuracy far exceeds that of a single system.

🧠 Continuous learning: The system will learn from your corrections and become more accurate over time.

⚡ Fast processing: Accelerated by GPU, it only takes a few minutes to process a one-hour recording.

🔄 Automatic update: After recognizing a speaker, all historical recordings will be automatically updated.

🔌 Easy to integrate: Provides standard APIs and MCP protocol, which is convenient for integration with other systems.

📱 Multi-terminal support: Covers web interfaces, API interfaces, and AI assistant integration.

Limitations

💻 Requires GPU: The best performance requires an NVIDIA graphics card, and the CPU mode is slower.

🎙️ Dependent on audio quality: Noisy environments or low-quality recordings will affect the recognition accuracy.

👥 Requires initial learning: New speakers require a certain amount of voice samples to be accurately recognized.

🌐 Network requirements: Real-time recording requires a stable network connection.

🔊 Handling of simultaneous speaking: It may not be able to perfectly separate when multiple people speak simultaneously.

How to use

Environment preparation

Ensure that your computer has an NVIDIA graphics card (recommended) or sufficient CPU performance. Install Docker or a Python environment.

Get an API key

Visit the HuggingFace website to register an account and obtain an access token (Token) for downloading AI models.

Configure the system

Copy the configuration file template and fill in your HuggingFace Token and other settings.

Start the system

Start the service, and the system will automatically download the required AI models (it may take a few minutes for the first startup).

Access the interface

Open a browser to access the system interface and start uploading audio or recording in real-time.

Train the system

When using it for the first time, enter the voice samples of known speakers through the 'Speaker management' function.

Usage examples

Team meeting records

Record weekly team meetings and automatically generate meeting minutes with speaker labels, and统计 each person's speaking duration and emotional changes.

Customer service quality monitoring

Analyze customer service call recordings, automatically identify customer service representatives and customers, detect customer emotional changes, and find areas that need improvement.

AI assistant memory enhancement

Enable AI assistants to remember the conversation history and preferences of family members and provide personalized responses.

Interview transcription and analysis

Convert expert interview recordings into text, automatically distinguish interviewers and interviewees, and analyze the emotions and key content of the interviewees.

Frequently Asked Questions

What kind of computer configuration is required?

What preparations are needed for the first use?

How accurate is the system's recognition?

How to improve the recognition accuracy?

Which languages are supported?

How is data security ensured?

How long of a recording can it handle?

How to integrate with an AI assistant?

Related resources

Official GitHub repository

Project source code, latest version, and issue feedback

Web front-end project

A beautiful web interface that provides complete user operation functions

HuggingFace model

Speaker separation model, the usage terms need to be accepted

API interaction documentation

Complete REST API documentation and test interface

MCP protocol documentation

Official specification of the Model Context Protocol

Question discussion area

Submit questions, feature suggestions, and technical discussions

Docker installation guide

Installation tutorial for Docker and Docker Compose

CUDA installation guide

Download and installation of the NVIDIA CUDA toolkit

🚀 MCP Speaker Diarization

An all-in-one complete package combining GPU-accelerated speaker diarization and recognition with web interface and REST API. Integrates pyannote.audio speaker diarization with faster-whisper transcription, designed for AI agent integration and hobby projects.

🚀 Quick Start

Prerequisites

Get a HuggingFace Token
- Create an account at huggingface.co.
- Generate a token at huggingface.co/settings/tokens.
- Accept the model terms:
  - pyannote/speaker-diarization-community-1.
  - pyannote/embedding.

Install NVIDIA Container Toolkit (for Docker deployment)

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Option 1: Docker Deployment (Recommended)

# Clone the repository
git clone <repository-url>
cd speaker-diarization-app

# Configure the environment
cp .env.example .env
# Edit .env and add your HF_TOKEN

# Build and run
docker-compose up --build

# Run in the background
docker-compose up -d

# View logs
docker-compose logs -f

Access the application:

API Documentation: http://localhost:8000/docs
API Endpoint: http://localhost:8000/api/v1
MCP Server: http://localhost:8000/mcp

For a web interface, see the separate Next.js frontend repository.

Option 2: Local Development (Python venv)

# Install system dependencies
sudo apt-get update
sudo apt-get install -y ffmpeg git portaudio19-dev

# Setup the Python environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Python packages
pip install -r requirements.txt

# Configure the environment
cp .env.example .env
# Edit .env and add your HF_TOKEN

# Run the application
./run_local.sh

# Or run manually:
# export HF_TOKEN="your_token_here"
# python -m app.main

First Run:

Models will auto-download (~3 - 5GB).
Startup may take 2 - 3 minutes for model loading.
GPU memory will be allocated (check with nvidia-smi).

✨ Features

Persistent Speaker Recognition: Enroll speakers once, recognize them across all future recordings and conversations (not just "SPEAKER_00, SPEAKER_01" labels).
Dual-Detector Emotion System: Combines general AI (emotion2vec+) with personalized voice profiles for dramatically improved emotion detection accuracy across 9 emotions (angry, happy, sad, neutral, fearful, surprised, disgusted, other, unknown).
Personalized Learning: The system learns each speaker's unique emotional voice patterns from corrections with weighted embedding merging (no re-enrollment needed).
Retroactive Intelligence: Identify one segment → all past segments with that voice automatically update.
Transcription: faster-whisper 1.2.1 (configurable models: tiny/base/small/medium/large-v3, supports 99 languages, CTranslate2 backend).
Live Streaming: Real-time recording with WebSocket streaming, VAD, and instant processing.
AI-Ready Architecture: Built-in MCP server enables seamless integration with AI assistants (Claude Desktop, Flowise, custom agents) providing the contextual memory needed for natural multi-party conversations.
REST API: Full programmatic access at /api/v1/* (see /docs for interactive documentation).
Backup/Restore: Export/import speaker profiles and voice settings.
Production Ready: Handles thousands of conversations, batch processing, live streaming, MP3 conversion, and scales efficiently.

📦 Installation

Hardware

GPU: NVIDIA GPU with CUDA 12.x support
- Tested on: NVIDIA RTX 3090 (24GB VRAM) - excellent performance
- VRAM Requirements (faster-whisper is very efficient):
  - Diarization + embeddings: ~2 - 3GB base requirement
  - Emotion detection: ~2GB (emotion2vec_plus_large)
  - Whisper model adds (choose based on available VRAM):
    - tiny/base: ~400 - 500MB (total: ~5GB minimum with emotion)
    - small: ~1GB (total: ~6GB recommended with emotion)
    - medium: ~2GB (total: ~7GB recommended with emotion)
    - large-v3: ~3 - 4GB (total: ~8 - 9GB recommended with emotion, default)
- Works on: Consumer GPUs (GTX 1060 6GB+, 1080, 2060, 3060, 3090, 4080, 4090, etc.)
CPU Fallback: Runs on CPU but significantly slower (GPU strongly recommended)
RAM: 8GB minimum, 16GB+ recommended
Storage: ~10GB for models, plus space for audio recordings

Software

Operating System: Linux (tested on Ubuntu), macOS (via Docker), Windows (via WSL2 + Docker)
Python: 3.11 or 3.12
CUDA: 12.4 (included in Docker image)
cuDNN: 9.x (auto-installed)
Docker (optional but recommended): 20.10+ with NVIDIA Container Toolkit

System Dependencies

ffmpeg: Audio processing and format conversion
git: HuggingFace model downloads
portaudio19-dev: Live microphone recording (optional)

💻 Usage Examples

Basic Usage

import requests

# Process an audio file
with open("meeting.wav", "rb") as f:
    response = requests.post(
        "http://localhost:8000/api/v1/process",
        files={"audio_file": f}
    )

conversation = response.json()

# Get the full transcript with speakers
for segment in conversation["segments"]:
    print(f"{segment['speaker_name']}: {segment['text']}")
    print(f"  Emotion: {segment['emotion_category']} ({segment['emotion_confidence']})")

Advanced Usage

# AI assistant receives a conversation
Assistant: "I heard multiple voices. Who were you speaking with?"
User: "That was my colleague Sarah"

# AI calls the MCP tool:
# identify_speaker_in_segment(segment_id=145, speaker_name="Sarah", auto_enroll=true)

# The system automatically:
# 1. Creates Sarah's speaker profile from segment 145
# 2. Updates ALL past segments with Sarah's voice
# 3. Future recordings recognize Sarah automatically

📚 Documentation

Screenshots

Example Next.js frontend interface (available at github.com/snailbrainx/speaker_identity_nextjs):

  </a>
  <br/>Settings - Voice Profile Management
</td>
<td align="center">
  <a href="https://raw.githubusercontent.com/snailbrainx/mcp-speaker-diarization/master/screenshots/2.png">
    
  </a>
  <br/>Process Audio - Upload Files
</td>
<td align="center">
  <a href="https://raw.githubusercontent.com/snailbrainx/mcp-speaker-diarization/master/screenshots/3.png">
    
  </a>
  <br/>Conversation Detail - Segments & Transcription
</td>

  </a>
  <br/>Conversations List
</td>
<td align="center">
  <a href="https://raw.githubusercontent.com/snailbrainx/mcp-speaker-diarization/master/screenshots/5.png">
    
  </a>
  <br/>Speaker Management - Enroll & Manage
</td>
<td align="center">
  <a href="https://raw.githubusercontent.com/snailbrainx/mcp-speaker-diarization/master/screenshots/6.png">
    
  </a>
  <br/>Live Recording - Real-time Transcription
</td>

Use Cases

AI Integration: Enable AI assistants to remember and distinguish multiple speakers across conversations.
Meeting Transcription: Automatic labeling with emotion context.
Research & Analysis: Multi-party conversation analysis with persistent identity.
Customer Support: Separate agents from customers with emotion tracking.

Technical Stack

Property	Details
Diarization	pyannote.audio 4.0.1 (`pyannote/speaker-diarization-community-1`)
Embeddings	pyannote.audio (`pyannote/embedding`)
Emotion Recognition	emotion2vec_plus_large via FunASR (ACL 2024, 9 emotion categories)
Transcription	faster-whisper 1.2.1 (configurable models: tiny/base/small/medium/large-v3, supports 99 languages, CTranslate2 backend)
Backend API	FastAPI 0.115.5 with WebSocket streaming support
ML Framework	PyTorch 2.5.1 with CUDA 12.4 support
Database	SQLAlchemy 2.0.36 with SQLite + Pydantic 2.11.0
Audio Processing	pydub, soundfile, ffmpeg
MCP Integration	MCP 1.21.0 for AI agent connectivity

Emotion Detection

Dual-detector system combining general AI with personalized voice profiles for dramatically improved accuracy.

How It Works

Two complementary detectors work together:

emotion2vec+ Detector (1024-D emotion embeddings)
- General emotion AI trained on large datasets.
- Works for all speakers (known/unknown).
- 9 categories: angry, happy, sad, neutral, fearful, surprised, disgusted, other, unknown.
Voice Profile Detector (512-D speaker embeddings)
- Learns each speaker's unique emotional voice patterns.
- Requires 3+ voice samples per emotion to activate.
- Checks general + all emotion-specific profiles (Andy, Andy_angry, Andy_happy, etc.).

Best match wins: If Andy_angry voice profile matches at 92% vs emotion2vec's 78% neutral, the voice detector wins.

Threshold Configuration

Environment Variables:

EMOTION_THRESHOLD=0.6 - Emotion matching sensitivity (0.3 - 0.9, higher = stricter).
SPEAKER_THRESHOLD=0.30 - Voice matching sensitivity (0.20 - 0.35, higher = stricter).

Both thresholds can be customized per-speaker or per-emotion via the API for fine-grained control.

Personalized Learning

Correct any segment's emotion → the system learns automatically:

Stores an emotion embedding (1024-D) for emotion2vec matching.
Stores a voice embedding (512-D) for voice profile matching.
Merges using weighted averaging (older samples have more weight).
Updates the general speaker profile too.
After 3+ corrections per emotion → the voice detector activates.

Manual correction = 100% confidence. No need to re-identify the speaker.

Performance

Speed: ~37ms per segment (+5ms for voice matching).
VRAM: ~2GB emotion2vec + ~1GB speaker embeddings (shared).
Activation: 3+ voice samples required per emotion.

Remote Access

If you're running the application on a remote server (e.g., a headless Ubuntu server with GPU), you can access the web interface via SSH port forwarding.

SSH Tunnel (Windows)

Using PowerShell or Command Prompt:

ssh -L 8000:localhost:8000 username@remote-server-ip

Using PuTTY:

Open PuTTY and enter your server hostname/IP.
Navigate to: Connection → SSH → Tunnels.
Add a forwarding rule:
- Source port: 8000.
- Destination: localhost:8000.
- Click "Add".
Return to the Session tab and connect.

After connecting:

Open a browser on your Windows machine.
Navigate to: http://localhost:8000/docs (API documentation).

SSH Tunnel (Linux/Mac)

ssh -L 8000:localhost:8000 username@remote-server-ip

Then access the API docs at http://localhost:8000/docs.

Configuration

All settings are configured via environment variables in the .env file:

Required

# HuggingFace token for model access
HF_TOKEN=your_huggingface_token_here

Optional (with optimized defaults)

# Database location
DATABASE_URL=sqlite:////app/volumes/speakers.db

# Speaker recognition threshold (0.0 - 1.0)
# Lower = more strict, fewer false positives
# Recommended: 0.30 for normal home usage (good balance of accuracy and matching)
# Alternative: 0.20 for stricter matching with movie audio/background music
SPEAKER_THRESHOLD=0.30

# Context padding for embedding extraction (seconds)
# Adds time before/after a segment for robust embeddings
# Optimal: 0.15s (67.4% matching + only 3 misidentifications in movie audio)
CONTEXT_PADDING=0.15

# Silence duration before processing a segment (seconds)
# For live recording only
# Lower = more responsive, Higher = more complete segments
SILENCE_DURATION=0.5

# Filter common Whisper hallucinations
# Set to false if real speech is being filtered
FILTER_HALLUCINATIONS=true

# Global emotion matching threshold (0.3 - 1.0)
# Higher = stricter matching (requires a closer match to the learned emotion profile)
# Lower = more lenient (accepts a wider range of emotional expressions)
# Default: 0.6 (balanced - good for most use cases)
EMOTION_THRESHOLD=0.6

# Whisper transcription model (faster-whisper with CTranslate2)
# Choose based on GPU capabilities:
# - tiny.en / tiny: ~400MB VRAM, fastest, lowest accuracy
# - base.en / base: ~500MB VRAM, very fast, basic accuracy
# - small.en / small: ~1GB VRAM, fast, good accuracy
# - medium.en / medium: ~2GB VRAM, slower, better accuracy
# - large-v3 / large-v2: ~3 - 4GB VRAM, slowest, best accuracy
WHISPER_MODEL=large-v3

# Whisper language setting
# - "en" = English only (default, fastest)
# - "auto" = Auto-detect language (99 languages supported)
# - Or specify: "es", "fr", "de", "zh", "ja", etc.
WHISPER_LANGUAGE=en

Recommended Settings

Default settings are optimized for normal home usage:

SPEAKER_THRESHOLD = 0.30: Good balance of accuracy and matching for home conversations.
CONTEXT_PADDING = 0.15: Optimal for audio with background noise/music.
SILENCE_DURATION = 0.5: Balances responsiveness with complete sentence capture.
WHISPER_MODEL = large-v3: Best accuracy, requires ~3 - 4GB VRAM. Use small (~1GB) or base (~500MB) for weaker GPUs.
WHISPER_LANGUAGE = en: English only (fastest). Use auto for multilingual auto-detection or specify a language code.

For stricter matching with movie audio or challenging conditions, reduce SPEAKER_THRESHOLD to 0.20.

How It Works

Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                          User Input                              │
│                (Upload Audio / Live Recording)                   │
└──────────────────┬───────────────────────────────────────┘
                           │
                           ▼
               ┌───────────────────────┐
               │  Audio Format         │
               │  Conversion           │
               │  (if needed)          │
               │                       │
               │  MP3/M4A → WAV        │
               │  Live: 48kHz chunks   │
               └───────────┬───────────┘
                           │
            ╔══════════════╧════════════════╗
            ║  PARALLEL PROCESSING          ║  ← ~50% faster!
            ║  ThreadPoolExecutor           ║     Both run
            ║  (2 workers)                  ║     simultaneously
            ╚══════════════╤════════════════╝
                           │
         ┌─────────────────┴─────────────────┐
         │                                   │
         ▼                                   ▼
┌────────────────────┐           ┌───────────────────────┐
│  Transcription     │           │  Diarization          │
│  (faster-whisper)  │           │  (pyannote.audio)     │
│                    │           │                       │
│  "What was said"   │           │  "Who spoke when"     │
│                    │           │                       │
│  • Speech → Text   │           │  • Detect speaker     │
│  • Word timestamps │           │    turns              │
│  • Confidence      │           │  • Assign labels      │
│    scores          │           │    (SPEAKER_00, etc.) │
│  • VAD filtering   │           │  • Time boundaries    │
│                    │           │                       │
│  ~40 - 100ms         │           │  ~40 - 100ms            │
│  (live segments)   │           │  (live segments)      │
└─────────┬──────────┘           └───────────┬───────────┘
          │                                  │
          └──────────────┬───────────────────┘
                         │
                         ▼
             ┌───────────────────────┐
             │  Segment Alignment    │
             │                       │
             │  Match transcription  │
             │  to speaker labels    │
             │  by timestamp overlap │
             └───────────┬───────────┘
                         │
         ┌───────────────┴───────────────┐
         │                               │
         ▼                               ▼
┌────────────────────┐      ┌────────────────────────┐
│ Embedding          │      │  Speaker Matching      │
│ Extraction         │      │  (Cosine Similarity)   │
│ (pyannote)         │      │                        │
│                    │      │  Compare embeddings    │
│ • Extract voice    │──────→  to known speakers     │
│   signature        │      │                        │
│ • 512-D vectors    │      │  Threshold: 0.20 - 0.30  │
│ • Context padding  │      │                        │
│   (0.15s)          │      │  Match or Unknown?     │
│ • Skip if <0.5s    │      │                        │
└────────────────────┘      └───────────┬────────────┘
                                        │
                         ┌──────────────┴──────────────┐
                         │                             │
                         ▼                             ▼
                ┌─────────────────┐         ┌──────────────────┐
                │  Known Speaker  │         │ Unknown Speaker  │
                │  "Alice"        │         │ "Unknown_01"     │
                │                 │         │                  │
                │  • Has ID       │         │  • No ID yet     │
                │  • Confidence   │         │  • Auto-enrolled │
                │    score        │         │  • Embedding     │
                │                 │         │    stored        │
                └────────┬────────┘         └────────┬─────────┘
                         │                           │
                         └──────────┬────────────────┘
                                    │
                                    ▼
                        ┌───────────────────────┐
                        │  EMOTION DETECTION    │
                        │  (Sequential)         │
                        │                       │
                        │  "How they felt"      │
                        └───────────┬───────────┘
                                    │
                        ┌───────────▼───────────┐
                        │  Step 1: Speaker      │
                        │  Matching             │
                        │                       │
                        │  • Extract 512-D      │
                        │    voice embedding    │
                        │  • Check ALL profiles:│
                        │    - Andy (general)   │
                        │    - Andy_angry       │
                        │    - Andy_happy       │
                        │    (if ≥3 samples)    │
                        │                       │
                        │  Returns:             │
                        │  • speaker_name       │
                        │  • matched_emotion    │
                        │    (or None)          │
                        └───────────┬───────────┘
                                    │
                        ┌───────────▼───────────┐
                        │  Step 2: emotion2vec+ │
                        │  (ALWAYS RUNS)        │
                        │                       │
                        │  • Extract 1024-D     │
                        │    emotion embedding  │
                        │  • 9 categories       │
                        │                       │
                        │  Returns:             │
                        │  • emotion: "neutral" │
                        │  • confidence: 0.78   │
                        │                       │
                        │  ~30ms per segment    │
                        └───────────┬───────────┘
                                    │
                        ┌───────────▼───────────┐
                        │  Decision: Did Step 1 │
                        │  find emotion profile?│
                        └───────┬───────────────┘
                                │
                      ┌─────────┴─────────┐
                 YES  │                   │ NO
                      ▼                   ▼
          ┌────────────────────┐  ┌──────────────────┐
          │  FAST PATH         │  │  Check profiles? │
          │  Override emotion  │  │                  │
          │                    │  │  If has profiles │
          │  Use Step 1 result │  │  (≥3 samples):   │
          │  emotion = "angry" │  │                  │
          │  conf = 0.92       │  │  → Dual-detector │
          │                    │  │    comparison    │
          │  Skip comparison   │  │                  │
          │                    │  │  Else:           │
          │  ~0ms (instant)    │  │  → Use emotion2vec│
          └────────┬───────────┘  └────────┬─────────┘
                   │                       │
                   │          ┌────────────▼────────────┐
                   │          │  SLOW PATH:             │
                   │          │  Dual-Detector Compare  │
                   │          │                         │
                   │          │  • emotion2vec match    │
                   │          │    (1024-D)             │
                   │          │  • Voice profile match  │
                   │          │    (512-D, ≥3 samples)  │
                   │          │                         │
                   │          │  Rules:                 │
                   │          │  1. Both agree → Avg    │
                   │          │  2. neutral → Trust it  │
                   │          │  3. Voice >85% → Voice  │
                   │          │  4. Disagree → neutral  │
                   │          │  5. Else → emotion2vec  │
                   │          │                         │
                   │          │  ~5ms additional        │
                   │          └────────┬────────────────┘
                   │                   │
                   └───────────────────┘
                                       │
                        ┌──────────────▼─────────────┐
                        │  Final Emotion             │
                        │                            │
                        │  With detector_breakdown:  │
                        │  • emotion2vec result      │
                        │  • voice profile result    │
                        │  • final decision + reason │
                        └────────────────────────────┘
                                    │
                                    ▼
                        ┌───────────────────────┐
                        │  Database Storage     │
                        │                       │
                        │  ConversationSegment: │
                        │  • text               │
                        │  • speaker_name       │
                        │  • speaker_id         │
                        │  • confidence         │
                        │  • emotion_category   │
                        │  • emotion_confidence │
                        │  • emotion_corrected  │
                        │  • emotion_misidentified│
                        │  • start/end times    │
                        │  • word-level data    │
                        └───────────┬───────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    │                               │
                    ▼                               ▼
          ┌──────────────────┐          ┌──────────────────────┐
          │  Auto-Clustering │          │  User Identifies     │
          │                  │          │  Unknown Speaker     │
          │  Group similar   │          │                      │
          │  Unknown speakers│          │  "Unknown_01 is Bob" │
          │  by embedding    │          │                      │
          │  similarity      │          │  → Embedding Merging │
          └──────────────────┘          │  → Retroactive       │
                                        │     Updates (all     │
                                        │     past segments)   │
                                        └──────────────────────┘
                                                   │
                                                   ▼
                                        ┌──────────────────────┐
                                        │  User Corrects       │
                                        │  Emotion             │
                                        │                      │
                                        │  "Actually angry,    │
                                        │   not neutral"       │
                                        │                      │
                                        │  → Extract & store:  │
                                        │    • 1024-D emotion  │
                                        │      embedding       │
                                        │    • 512-D voice     │
                                        │      embedding       │
                                        │  → Merge into        │
                                        │     SpeakerEmotion   │
                                        │     Profile (weighted│
                                        │     averaging)       │
                                        │  → Updates:          │
                                        │    • Andy_angry      │
                                        │      (emotion profile)│
                                        │    • General Andy    │
                                        │      (speaker profile)│
                                        │  → If changing       │
                                        │     emotion: recalc  │
                                        │     OLD profile too  │
                                        │  → Confidence = 100% │
                                        │     (manual confirm) │
                                        │  → After 3+ samples: │
                                        │     voice detector   │
                                        │     activates!       │
                                        └──────────────────────┘

Key Points:

Parallel Processing: Transcription (Whisper) and Diarization (Pyannote) run simultaneously using ThreadPoolExecutor, achieving ~50% speedup.
Processing Speed (per segment on GPU):
- Transcription + Diarization: ~40 - 100ms (parallel).
- Alignment + Voice Embedding Extraction: ~20 - 40ms.
- Speaker Matching (checks all profiles including emotion-specific): ~0.5ms.
- emotion2vec+ Extraction (ALWAYS runs): ~30ms.
- Decision Path:
  - Fast path override (if emotion profile matched): ~0ms instant.
  - Dual-detector comparison (if profiles exist): ~5ms additional.
  - Fallback to emotion2vec only: ~0ms (already extracted).
Audio Conversion: Automatic format conversion (MP3 → WAV) before processing; live recording saves 48kHz chunks.
Sequential Operations: Alignment → Voice Embedding Extraction (512-D) → Speaker Matching → emotion2vec Extraction (1024-D) → Decision (override or dual-detector or fallback).
Emotion Detection Flow:
1. Speaker matching checks ALL profiles (general + emotion-specific like Andy_angry).
2. emotion2vec ALWAYS extracts emotion (runs for every segment).
3. IF the speaker matched an emotion profile → Override emotion2vec (fast path).
4. ELSE IF the speaker has learned profiles (≥3 samples) → Dual-detector comparison (5 decision rules).
5. ELSE → Use the emotion2vec result only.
Dual-Detector System: Stores BOTH 1024-D emotion embeddings (emotion2vec) AND 512-D voice embeddings (speaker recognition) per emotion; the voice profile detector requires ≥3 samples to activate.
Personalized Learning: User corrections extract and store BOTH embedding types, merge using weighted averaging; changing emotions recalculates BOTH old and new profiles; confidence is set to 100% after manual correction.
Why Sequential?: The bottleneck (transcription + diarization) is parallelized. Post-processing (~35ms total) is fast enough that further parallelization adds complexity without meaningful speedup.
Sample Rates: Browser (48kHz) → Whisper/Pyannote (auto-resample) → Emotion2vec (16kHz) → Storage (WAV 48kHz for streaming, MP3 192k for uploads).

Processing Pipeline

Audio Input
- Upload: MP3/WAV files are automatically converted and saved to data/recordings/.
- Live: Browser microphone → streaming chunks are saved to data/stream_segments/.
Parallel Processing (faster than sequential)
- Diarization (pyannote): Detects speaker turns, outputs segments with anonymous labels (SPEAKER_00, SPEAKER_01, etc.).
- Transcription (Whisper): Converts speech to text with timestamps.
- Both run simultaneously using ThreadPoolExecutor.
Segment Alignment
- Match transcription segments to speaker labels by timestamp overlap.
- Uses the segment midpoint for matching: (start + end) / 2.
- Falls back to the closest segment if there is no exact overlap.
Embedding Extraction
- For each segment, extract a 512-dimensional voice signature using the pyannote embedding model.
- Context padding (0.15s) is added before/after for robustness with background noise.
- Minimum segment duration: 0.5 seconds.
Speaker Matching
- Compare the segment embedding to known speaker embeddings.
- Cosine similarity calculation (0.0 - 1.0).
- If similarity > threshold (default 0.30): Identified as a known speaker.
- If similarity ≤ threshold: Labeled as "Unknown_XX".
Unknown Speaker Handling
- Embedding verification: Check if multiple Unknown segments are the same person.
- Group similar unknowns (same threshold).
- Each unique voice gets a unique Unknown_XX identifier.
- Embeddings are stored for future auto-enrollment.
Auto-Enrollment (when the user identifies an unknown)
- The user provides a speaker name for any segment.
- If it's a new name: Creates a speaker profile automatically.
- Embedding merging: Averages embeddings from all segments of the same speaker.
- Retroactive updates: All past segments with the same Unknown label get updated.
- Continuous improvement: Each identification strengthens the speaker profile.

Voice Activity Detection (VAD)

Two independent VAD systems work together:

Live Recording VAD (energy-based)
- Calculates RMS energy: sqrt(mean(audio^2)).
- Threshold: 0.005 (configurable).
- Detects speech vs. silence in real-time.
- Shows a live indicator in the UI: "🟢 Speech Detected" or "⚪ Idle".
- After X seconds of silence (default 0.5s), triggers segment processing.
Transcription VAD (Whisper's built-in)
- Uses the Silero VAD model.
- Filters non-speech before transcription.
- Reduces hallucinations ("thank you.", "thanks for watching").
- Enabled via the vad_filter = True parameter.

Misidentification Correction

Mark as Misidentified: Exclude a segment from embedding calculations.
Reassign to Correct Speaker: Updates both speakers' embeddings.
Automatic Recalculation: Averages embeddings from all non-misidentified segments.
Prevents Embedding Corruption: Ensures speaker profiles remain accurate.

REST API & MCP Server

API Overview

Base URL: http://localhost:8000/api/v1 Interactive Docs: http://localhost:8000/docs (Swagger UI with test interface)

Key Endpoints:

System
- GET /status - Health check, GPU status, system stats.
Settings
- GET/POST /settings/voice - Runtime configuration (thresholds, padding, filtering).
- POST /settings/voice/reset - Reset to defaults.
Speakers
- GET /speakers - List all enrolled speakers with segment counts.
- POST /speakers/enroll - Enroll a new speaker with an audio sample.
- PATCH /speakers/{id}/rename - Rename a speaker (auto-updates all past segments).
- DELETE /speakers/{id} - Delete a speaker profile.
- DELETE /speakers/unknown/all - Delete all Unknown_* speakers.
Emotion Profiles
- GET /speakers/{id}/emotion-profiles - View learned emotion profiles.
- DELETE /speakers/{id}/emotion-profiles - Reset emotion profiles.
- GET/PATCH /speakers/{id}/emotion-threshold - Per-speaker emotion threshold.
- PATCH /speakers/{id}/emotion-profiles/{emotion}/threshold - Per-emotion threshold.
Conversations
- GET /conversations - List all conversations (paginated).
- GET /conversations/{id} - Get the full transcript with all segments.
- PATCH /conversations/{id} - Update conversation metadata.
- DELETE /conversations/{id} - Delete a conversation and its audio.
- POST /conversations/{id}/reprocess - Re-run diarization with current speakers.
- POST /conversations/{id}/recalculate-emotions - Recalculate emotions for all segments.
- POST /process - Upload and process an audio file.
Segments
- POST /conversations/{id}/segments/{seg_id}/identify - Identify a speaker (auto-updates all past).
- POST /conversations/{id}/segments/{seg_id}/correct-emotion - Correct and learn an emotion.
- PATCH /conversations/{id}/segments/{seg_id}/misidentified - Mark a speaker as misidentified.
- PATCH /conversations/{id}/segments/{seg_id}/emotion-misidentified - Mark an emotion as wrong.
- GET /conversations/segments/{seg_id}/audio - Download segment audio.
Streaming
- WS /streaming/ws - WebSocket for live recording.
Backup/Restore
- POST /profiles - Create a new backup profile.
- GET /profiles - List all backup profiles.
- GET /profiles/{name} - Get specific profile details.
- PATCH /profiles/{name} - Save the current state to a profile.
- DELETE /profiles/{name} - Delete a backup profile.
- POST /profiles/{name}/checkpoints - Create a checkpoint.
- POST /profiles/restore - Restore from a backup.
- GET /profiles/download/{name} - Download a backup JSON.
- POST /profiles/import - Import a backup JSON.

📖 Full documentation with examples: http://localhost:8000/docs

MCP Server Integration

Model Context Protocol (MCP) enables AI assistants to directly interact with the speaker diarization system.

MCP Endpoint: http://localhost:8000/mcp Protocol: JSON - RPC 2.0 over HTTP with Server - Sent Events Compatible With: Claude Desktop, Flowise, custom MCP clients

Available MCP Tools (11):

list_conversations - Get conversation IDs and metadata.
get_conversation - Get the full transcript with speaker labels.
get_latest_segments - Get recent segments across conversations.
identify_speaker_in_segment - Label an unknown speaker (auto-updates all past segments).
rename_speaker - Rename an existing speaker (auto-updates all past segments).
list_speakers - Get all enrolled speaker profiles.
delete_speaker - Remove a speaker profile.
delete_all_unknown_speakers - Cleanup Unknown_* speakers.
update_conversation_title - Set a conversation title.
reprocess_conversation - Re-run recognition with updated speaker profiles.
search_conversations_by_speaker - Find all conversations where a specific speaker appears.

Key Features:

Automatic Retroactive Updates: Identifying/renaming a speaker updates ALL past segments automatically.
No Reprocessing Needed: The system maintains speaker identity across sessions.
Auto-Enrollment: Can create speaker profiles from any segment.
Conversation Context: AI can retrieve the full "who said what" history.

Example MCP Client Configuration (Flowise/Claude Desktop):

{
  "mcpServers": {
    "speaker-diarization": {
      "url": "http://localhost:8000/mcp",
      "transport": "http"
    }
  }
}

Usage Example:

# AI assistant receives a conversation
Assistant: "I heard multiple voices. Who were you speaking with?"
User: "That was my colleague Sarah"

# AI calls the MCP tool:
# identify_speaker_in_segment(segment_id=145, speaker_name="Sarah", auto_enroll=true)

# The system automatically:
# 1. Creates Sarah's speaker profile from segment 145
# 2. Updates ALL past segments with Sarah's voice
# 3. Future recordings recognize Sarah automatically

AI Assistant Integration Examples

Integration Approaches

Option 1: REST API (Full Control)
- Your app manages audio recording and streaming.
- POST audio to /api/v1/process or use the WebSocket /streaming/ws.
- Receive segments with speaker labels and emotions.
- Query conversation history via /conversations endpoints.
Option 2: MCP Server (AI-Native)
- Connect Claude Desktop, Flowise, or custom MCP clients.
- The AI assistant directly calls 10 MCP tools for speaker management.
- Automatic retroactive updates when identifying/renaming speakers.
- Zero code - just configure the MCP endpoint.

Example Workflow

Scenario: An AI assistant having a multi-party conversation.

Unknown speaker detected

User: "Alright mate, how are you doing?"
Unknown: "Good mate, you?"

AI: "Who are you speaking to?"
User: "That's Nick"

AI identifies the speaker via MCP

# MCP tool call (automatic if using Claude/Flowise)
identify_speaker_in_segment(
    segment_id=145,
    speaker_name="Nick",
    auto_enroll=true
)

The system auto-updates all past segments
- Creates Nick's voice profile.
- Updates ALL previous Unknown segments with Nick's voice.
- Future recordings recognize Nick automatically.
AI remembers Nick in future conversations

Nick: "Hey, remember what we discussed yesterday?"
AI: "Yes Nick, you mentioned the project deadline..."

Advanced Features

Embedding Merging

When identifying unknown speakers or re-identifying existing speakers:

Never replaces embeddings (would lose historical data).
Always merges using averaging: (existing_embedding + new_embedding) / 2.
Continuous improvement: Each recording strengthens the speaker profile.
Handles variability: Averages across different audio conditions, emotions, etc.

Retroactive Identification

Rename any speaker → all past segments automatically update:

# User identifies Unknown_01 as "Alice" in conversation 5
curl -X POST "http://localhost:8000/api/v1/conversations/5/segments/123/identify?speaker_name=Alice&enroll=true"

# The system automatically:
# 1. Creates "Alice" speaker profile (if new)
# 2. Updates segment 123
# 3. Finds ALL segments with speaker_name="Unknown_01"
# 4. Updates ALL to speaker_name="Alice"
# 5. Merges embeddings from all segments
# 6. Returns the count of updated segments

Backup & Restore

Export and restore speaker profiles:

Backup:
- Exports all speakers and their embeddings to JSON.
- Includes segment assignments for full state recovery.
- Saves to backups/backup_YYYYMMDD_HHMMSS.json.
- Does NOT include audio files (only speaker data).
Restore:
- Reconstructs the speaker database from a backup.
- Restores embeddings and segment assignments.
- Useful for testing different configurations.
- Useful for migrating between deployments.

Ground Truth Labeling

Test and optimize recognition accuracy:

Manually label segments with true speaker identities.
Labels are stored separately (doesn't affect actual segments).
Run tests comparing predictions vs. labels.
Optimize threshold and padding parameters.
Current optimal settings are derived from this testing.

Data Persistence

Directory Structure

speaker-diarization-app/
├── data/
│   ├── recordings/              # Permanent audio storage
│   │   ├── conv_7_full.mp3     # Live recordings (MP3)
│   │   ├── uploaded_1_tommy_converted.wav  # Uploads
│   │   └── 20251109_160230_meeting.wav    # Timestamped uploads
│   │
│   ├── stream_segments/         # Live recording segments (temporary)
│   │   └── conv_7/
│   │       ├── seg_0001.wav
│   │       ├── seg_0002.wav
│   │       └── ...
│   │
│   └── temp/                    # Temporary segment extractions
│       └── segment_123_456.wav
│
├── volumes/
│   ├── speakers.db              # SQLite database
│   └── huggingface_cache/       # Downloaded models
│
├── backups/                     # Backup snapshots (JSON)
│   └── backup_20251109_120000.json
│
├── scripts/                     # Utility scripts
│   ├── migrate_temp_audio.py   # Fix audio paths
│   ├── diagnose_speakers.py    # Debug issues
│   └── ...
│
└── tests/                       # Test files
    └── test_*.py

Docker Volumes

All data persists via volume mounts in docker-compose.yml:

volumes:
  - ./volumes:/app/volumes          # Database + model cache
  - ./data:/app/data                # Audio files
  - ./backups:/app/backups          # Backup snapshots

What Persists:

✅ Speaker profiles and embeddings.
✅ All conversations and segments.
✅ Audio recordings.
✅ Downloaded models (~3 - 5GB).
✅ Backup snapshots.

What Doesn't Persist:

❌ Container state (rebuild-safe).
❌ Logs (use docker-compose logs -f to monitor).

Troubleshooting

Installation Issues

"HuggingFace token not found"
- Ensure HF_TOKEN is set in the .env file.
- Accept the model terms at HuggingFace (links in Prerequisites).
- Check that the token has no extra spaces/quotes.
"Unable to load libcudnn_cnn.so.9"
- Standalone: run_local.sh sets LD_LIBRARY_PATH automatically.
- Docker: The Dockerfile installs cuDNN via pip.
- Manual: pip install nvidia-cudnn-cu12==9.* nvidia-cublas-cu12.
Permission errors

sudo chown -R $USER:docker data/ volumes/ backups/

Docker GPU not detected

# Verify NVIDIA Container Toolkit is installed
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

# If it fails, reinstall the NVIDIA Container Toolkit

Processing Issues

"CUDA out of memory"
- Close other GPU applications.
- Process shorter audio segments.
- Enable transcription selectively (disable for diarization-only).
- Fallback: Run on CPU (set CUDA_VISIBLE_DEVICES="" - very slow).
Speaker not recognized
- Enrollment audio should be at least 10 - 30 seconds.
- Use clear audio with minimal background noise.
- Check the threshold: Lower = more strict (try the 0.20 - 0.35 range, default 0.30).
- Re-enroll with better quality audio.
"Audio file not found" errors
- Old uploads: Run python scripts/migrate_temp_audio.py.
- New uploads: Should auto-save to data/recordings/.
- Verify that the data/ directory is accessible.
Whisper hallucinations ("thank you.", "thanks for watching")
- Already filtered via energy thresholding and text filtering.
- Set FILTER_HALLUCINATIONS=true in the .env file.
- Ensure vad_filter = true in transcription (default).

Performance Issues

Slow processing
- Verify that the GPU is in use: Check nvidia-smi during processing.
- Docker: Ensure runtime: nvidia in docker-compose.yml.
- Check if CUDA is available: python -c "import torch; print(torch.cuda.is_available())".
- First run: Models download (~3 - 5GB), subsequent runs are much faster.
High memory usage
- Normal: Models load ~4 - 6GB VRAM.
- Transcription adds ~2 - 3GB.
- Multiple simultaneous processes multiply memory usage.
- Reduce the batch size or process sequentially.

Audio Issues

No audio playback in the UI
- Check if audio files exist: ls data/recordings/.
- Verify that the API endpoint returns audio: /api/v1/conversations/segments/{id}/audio.
- Check the browser console for errors.
- Try a different browser (tested: Chrome, Firefox, Safari).
Live recording not working
- Browser permission: Allow microphone access.
- Standalone: Install PortAudio: sudo apt-get install portaudio19-dev.
- Check browser microphone settings.
- Try a different browser.

🔧 Technical Details

Parallel Processing: Transcription (Whisper) and Diarization (Pyannote) run simultaneously using ThreadPoolExecutor, achieving ~50% speedup.
Processing Speed (per segment on GPU):
- Transcription + Diarization: ~40 - 100ms (parallel).
- Alignment + Voice Embedding Extraction: ~20 - 40ms.
- Speaker Matching (checks all profiles including emotion-specific): ~0.5ms.
- emotion2vec+ Extraction (ALWAYS runs): ~30ms.
- Decision Path:
  - Fast path override (if emotion profile matched): ~0ms instant.
  - Dual-detector comparison (if profiles exist): ~5ms additional.
  - Fallback to emotion2vec only: ~0ms (already extracted).
Audio Conversion: Automatic format conversion (MP3 → WAV) before processing; live recording saves 48kHz chunks.
Sequential Operations: Alignment → Voice Embedding Extraction (512-D) → Speaker Matching → emotion2vec Extraction (1024-D) → Decision (override or dual-detector or fallback).
Emotion Detection Flow:
1. Speaker matching checks ALL profiles (general + emotion-specific like Andy_angry).
2. emotion2vec ALWAYS extracts emotion (runs for every segment).
3. IF the speaker matched an emotion profile → Override emotion2vec (fast path).
4. ELSE IF the speaker has learned profiles (≥3 samples) → Dual-detector comparison (5 decision rules).
5. ELSE → Use the emotion2vec result only.
Dual-Detector System: Stores BOTH 1024-D emotion embeddings (emotion2vec) AND 512-D voice embeddings (speaker recognition) per emotion; the voice profile detector requires ≥3 samples to activate.
Personalized Learning: User corrections extract and store BOTH embedding types, merge using weighted averaging; changing emotions recalculates BOTH old and new profiles; confidence is set to 100% after manual correction.
Why Sequential?: The bottleneck (transcription + diarization) is parallelized. Post-processing (~35ms total) is fast enough that further parallelization adds complexity without meaningful speedup.
Sample Rates: Browser (48kHz) → Whisper/Pyannote (auto-resample) → Emotion2vec (16kHz) → Storage (WAV 48kHz for streaming, MP3 192k for uploads).

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Dependency Licenses

All major dependencies use permissive open-source licenses compatible with MIT:

pyannote.audio (4.0.1): MIT License
- Models require a HuggingFace token and terms acceptance.
- The models themselves remain open-source and MIT licensed.
faster-whisper (1.2.1): MIT License (SYSTRAN)
FastAPI (0.115.5): MIT License
Next.js (15.x): MIT License
PyTorch (2.5.1): BSD 3 - Clause License
SQLAlchemy (2.0.36): MIT License
Pydantic (2.11.0): MIT License
MCP (1.21.0): MIT License

Note: While the software licenses are permissive, pyannote's pretrained models require:

A HuggingFace account.
An access token.
Acceptance of the model terms of use.

This is an authentication requirement, not a licensing restriction. The models remain open-source.

Credits

This project builds upon exceptional open-source work:

pyannote.audio by Hervé Bredin - State-of-the-art speaker diarization and embedding models.
faster-whisper by SYSTRAN - Optimized Whisper implementation using CTranslate2.
OpenAI Whisper - The original speech recognition model.
FastAPI by Sebastián Ramírez - A modern web framework.

Thank you to these projects and their contributors for making this application possible.

Planned Features

Automatic Conversation Summarization and Titling

AI-powered conversation summarization when recording finishes.
Automatic title generation based on conversation content.
Triggers when the current conversation ends and a new one begins.
Replaces generic "Conversation 15" with meaningful titles like "Discussion about project deadline with Nick".
Helps with conversation discovery and context retrieval.

Vector Database Search for Transcriptions

Store transcription text in a vector database for semantic search.
Query conversations by topic or content, not just speaker.
Each vector entry references a conversation ID for easy retrieval.
Enables long-term memory and contextual conversation lookup.
Use cases:
- "What did we discuss about the budget last month?"
- "Find conversations where we talked about product features"
- "Show me all discussions related to the new project"

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Areas for contribution:

Additional language support (currently English-only).
Performance optimizations.
UI/UX improvements.
Documentation improvements.

Disclaimer

This software is provided "as-is" without warranty of any kind. The developers make no guarantees about the accuracy of speaker identification or transcription. While we've implemented best practices and extensive testing, speaker recognition is inherently probabilistic and may produce errors.

Use responsibly:

Verify important identifications manually.
Test thoroughly in your environment.
Respect privacy and obtain consent before recording.
This is a tool to assist, not replace, human judgment.

Some portions of this codebase were developed collaboratively with Claude Code (AI pair programming assistant). While thoroughly tested, we recommend reviewing the code before deploying in critical applications.

Questions or issues? Open an issue on GitHub or check existing issues for solutions.

Want to use this with AI agents? See the API Reference section for MCP integration guidance.