MCP Video Extraction Plus

This project expands the video speech recognition function. It originally only supported the local Whisper model, and now it newly supports the online speech recognition services of CapCut and Bcut, providing a flexible multi - service selection architecture.

Voice processing Image and video processing #Video speech recognition #Multi - service integration #Online speech transcription #Modular architecture .Python

rating : 2.5 points

downloads : 8.6K

update time : 2025-12-12

Open Site

What is MCP Video Extraction Plus?

MCP Video Extraction Plus is an intelligent video processing tool specifically designed to extract audio content from video files and convert it into editable text. It integrates multiple speech recognition technologies, allowing you to choose the most suitable recognition method according to your needs. Whether it's local processing or online services, it can efficiently and accurately complete the task of converting videos to text.

How to use MCP Video Extraction Plus?

It's very easy to use: First, configure your preferred speech recognition method, then provide a video or audio file, and the system will automatically process it and return the text result. You can choose local processing to protect your privacy or use online services for faster processing speed.

Applicable scenarios

It is suitable for various scenarios such as video subtitle generation, meeting record collation, educational video content extraction, podcast transcription, and multimedia content analysis. It is especially suitable for content creators, educators, researchers, and users who need to process a large amount of video materials.

Main features

Multi - mode speech recognition

Supports three recognition methods: the local Whisper model, the online service of CapCut, and the online service of Bcut, meeting the needs of different scenarios.

Intelligent timestamp

Automatically adds precise timestamps to each segment of text, facilitating positioning and editing.

Cache optimization

Supports result caching to avoid repeated processing of the same content and improve efficiency.

Flexible configuration

Easily adjust various parameters through configuration files or environment variables to meet different needs.

Progress tracking

Displays the processing progress in real - time, allowing you to clearly understand the current status.

Error recovery

Built - in perfect error handling mechanism to ensure the stability and reliability of the processing process.

Advantages

Multiple recognition methods are available, offering high flexibility

Local processing protects privacy, and online services are fast

Supports mixed recognition of Chinese and English

Simple configuration, easy to integrate into existing workflows

Open - source and free, with active community support

Limitations

Online services require an internet connection

Processing large files may take a long time

The recognition accuracy of some dialects or professional terms may be low

Some technical knowledge is required for configuration

How to use

Installation and configuration

First, install the necessary dependency packages, then edit the configuration file according to your needs, and select your preferred speech recognition method.

Select the recognition method

Set the asr.provider parameter in the configuration file. The available values are: whisper (local), jianying (CapCut), bcut (Bcut).

Run the extraction service

Start the video extraction service and prepare to process your video or audio file.

Submit the processing task

Submit your video file through the API or command line, and the system will automatically process it and return the text result.

Usage examples

Educational video subtitle generation

Convert an online course video into a text script with timestamps, facilitating students' review and search.

Meeting record collation

Quickly convert team meeting recordings into text records, improving meeting efficiency.

Multilingual video processing

Process videos containing mixed Chinese and English content to obtain an accurate bilingual transcription.

Frequently Asked Questions

What are the differences between the three recognition methods?

Which video formats are supported?

How is the processing speed?

Is programming knowledge required?

How to improve the recognition accuracy?

Related resources

GitHub repository

Project source code and latest updates

Detailed documentation

Complete usage instructions and API documentation

Whisper official documentation

Detailed information about the OpenAI Whisper model

Example configuration file

Complete configuration example

Community discussion

Exchange usage experiences with other users

🚀 MCP Video Extraction plus - Integration Modification Instructions for Speech Recognition Service

This project is forked from MCP Video Extraction. This modification expands the project's speech recognition service, originally supporting only Whisper (local model), to three options:

Whisper – OpenAI's local speech recognition model (original)
JianYing (CapCut) – ByteDance's JianYing/CapCut online speech recognition service
Bcut (Bilibili Cut) – Bilibili's online speech recognition service

📦 Newly Added Files

Core ASR Module

File	Description
`src/mcp_video_service/services/asr_data.py`	Definition of ASR data structures (ASRDataSeg, ASRData)
`src/mcp_video_service/services/status.py`	Enumeration of ASR task statuses
`src/mcp_video_service/services/base_asr.py`	Base abstract class for ASR, parent class of all ASR implementations
`src/mcp_video_service/services/jianying_asr.py`	Implementation of JianYing (CapCut) speech recognition
`src/mcp_video_service/services/bcut_asr.py`	Implementation of Bilibili Cut (Bcut) speech recognition
`src/mcp_video_service/services/__init__.py`	Initialization file for the Services package

📝 Modified Files

1. `src/mcp_video_service/services/video_service.py`

Major Changes:

Added support for ASR provider configuration (asr_provider).
Modified the __init__ method to support reading the ASR provider selection from the configuration file.
Added a new _create_asr_instance() method to create the corresponding ASR instance based on the configuration.
Modified the extract_text() method to support multiple ASR providers:
- If Whisper is selected, use the local model.
- If JianYing or Bcut is selected, call the online API.
Added import statements to support both relative and absolute imports.

Configuration Parameters:

self.config = {
    'asr': {
        'provider': 'whisper',  # Options: whisper, jianying, bcut
        'use_cache': False,
        'need_word_time_stamp': False,
    },
    'jianying': {
        'start_time': 0,
        'end_time': 6000,
    },
    # ... Other configurations
}

2. `config.yaml`

Newly Added Configuration Items:

# ASR (Automatic Speech Recognition) provider configuration
asr:
  provider: "whisper"  # Options: whisper, jianying, bcut
  use_cache: false
  need_word_time_stamp: false

# JianYing (CapCut) speech recognition configuration
jianying:
  start_time: 0      # Audio start time (milliseconds)
  end_time: 6000     # Audio end time (milliseconds)

# Bcut (Bilibili Cut) speech recognition configuration
# No special configuration for now, use default values

💻 Usage Examples

1. Use Whisper (Local Model, Default)

from mcp_video_service.services.video_service import VideoService

service = VideoService(config_path='config.yaml')
# Set in config.yaml: asr.provider = "whisper"

text = await service.extract_text('audio.mp3')

2. Use JianYing (Online Service)

# Modify config.yaml
# asr:
#   provider: "jianying"
#   use_cache: false
#   need_word_time_stamp: false
# jianying:
#   start_time: 0
#   end_time: 6000

service = VideoService(config_path='config.yaml')
text = await service.extract_text('audio.mp3')

3. Use Bcut (Online Service)

# Modify config.yaml
# asr:
#   provider: "bcut"
#   use_cache: true
#   need_word_time_stamp: false

service = VideoService(config_path='config.yaml')
text = await service.extract_text('audio.mp3')

4. Configure via Environment Variables

# Use JianYing
export ASR_PROVIDER=jianying
export ASR_USE_CACHE=false
export JIANYING_START_TIME=0
export JIANYING_END_TIME=6000

# Use Bcut
export ASR_PROVIDER=bcut
export ASR_USE_CACHE=true

🏗️ Architecture Design

Class Inheritance Relationship

BaseASR (Abstract Base Class)
├── JianYingASR
├── BcutASR
└── VideoService (Call whisper.load_model directly when using Whisper)

Key Features of BaseASR

Unified Interface: All ASR implementations inherit from BaseASR.
Cache Support: Generate cache keys through CRC32 checksums.
Rate Limiting: Built - in rate - limiting mechanism to prevent API overload.
Progress Callback: Support asynchronous progress callback functions.
Error Handling: Unified error handling and logging.

📊 Data Structures

ASRDataSeg (Segmented Data)

@dataclass
class ASRDataSeg:
    text: str          # Segmented text
    start_time: float  # Start time (milliseconds)
    end_time: float    # End time (milliseconds)

ASRData (Complete Data)

@dataclass
class ASRData:
    text: str                    # Complete text
    segments: List[ASRDataSeg]   # List of segments

📈 Status Management

The status of ASR tasks is managed through the ASRStatus enumeration:

Status	Progress	Description
UPLOADING	20	Uploading
SUBMITTING	40	Submitting
QUERYING_RESULT	60	Querying results
CREATING_TASK	40	Creating a task
TRANSCRIBING	60	Transcribing
COMPLETED	100	Completed

📥 Import Compatibility

All newly added ASR modules support both relative imports and absolute imports:

# Relative import (used within the package)
from .asr_data import ASRDataSeg
from .base_asr import BaseASR

# Absolute import (used when directly loading the module)
from asr_data import ASRDataSeg
from base_asr import BaseASR

This ensures that the modules work properly in different import scenarios.

📦 Dependencies

Newly Added Dependencies

requests - Used for HTTP requests (JianYing and Bcut APIs)

Existing Dependencies

yt-dlp - Video download
whisper - Local speech recognition (only required when using Whisper)
pyyaml - Configuration file parsing

🌐 Environment Variable Support

Environment Variable	Description	Default Value
`ASR_PROVIDER`	ASR provider	whisper
`ASR_USE_CACHE`	Whether to use cache	false
`ASR_WORD_TIME_STAMP`	Whether word - level timestamps are required	false
`JIANYING_START_TIME`	JianYing audio start time	0
`JIANYING_END_TIME`	JianYing audio end time	6000
`WHISPER_MODEL`	Whisper model size	base
`WHISPER_LANGUAGE`	Whisper language	auto