K8s Gpu MCP Server
An instant SRE diagnostic agent for NVIDIA GPU hardware diagnosis in Kubernetes clusters, providing real-time GPU hardware detection and fault troubleshooting functions through the MCP protocol.
rating : 2.5 points
downloads : 0
What is k8s-gpu-mcp-server?
This is a diagnostic tool specifically designed for NVIDIA GPUs in Kubernetes clusters. It integrates with AI assistants (such as Claude and Cursor) through the Model Context Protocol (MCP), allowing you to directly ask the AI assistant about GPU health status, temperature, errors, etc., without manually running complex command-line tools.How to use k8s-gpu-mcp-server?
Simply install and configure it once in your AI assistant (Claude Desktop or Cursor IDE). After that, you can ask GPU-related questions just like having a conversation with the assistant. For example: 'Check the GPU temperature of node gpu-worker-5' or 'Analyze recent GPU errors'.Applicable scenarios
When running GPU-intensive tasks such as AI training and inference in your Kubernetes cluster, if you encounter performance degradation, task failure, or abnormal GPU resources, you can use this tool to quickly diagnose the problem. It is particularly suitable for operation and maintenance teams, AI engineers, and researchers.Main features
Real-time GPU monitoring
Obtain key indicators such as GPU temperature, power consumption, memory usage, and utilization in real-time without installing an additional monitoring system.
Hardware health check
Comprehensively check the GPU hardware health status, including ECC errors, XID error code analysis, thermal throttling status, etc.
Kubernetes integration
Automatically associate GPU hardware with Kubernetes Pods, view which Pod is using which GPU, and the resource allocation situation.
AI assistant friendly
Designed specifically for AI assistants such as Claude and Cursor, complex GPU diagnostics can be performed through natural language.
Historical data recording
Built-in flight recorder continuously records GPU indicators, and the GPU status at historical time points can be queried.
Safe read-only mode
It runs in read-only mode by default and will not make any modifications to the GPU or the system, ensuring the security of the production environment.
Advantages
No need to learn complex commands: Interact through the natural language of the AI assistant
Rapid deployment: One-click installation and can be used within a few minutes
Low resource consumption: Only requires 15 - 20MB of memory when idle
Production-ready: Tested with real Tesla T4 GPUs
Open source and free: Based on the Apache 2.0 license
Multi-platform support: Supports Claude Desktop, Cursor IDE, etc.
Limitations
Only supports NVIDIA GPUs: Does not support AMD or other brand GPUs
Requires NVIDIA drivers: Depends on the NVML library and correctly installed drivers
Kubernetes environment: Mainly designed for K8s clusters, with limitations for single-machine use
Read-only diagnostics: The current version is mainly a diagnostic tool, and repair operations are limited
How to use
Installation and configuration
Add MCP server settings to the configuration file according to the AI assistant you are using (Claude Desktop or Cursor).
Start the AI assistant
Restart your AI assistant (Claude Desktop or Cursor IDE) to make the configuration take effect.
Start the conversation
Directly ask GPU-related questions in the AI assistant, and the assistant will automatically call the corresponding diagnostic tools.
View the results
The AI assistant will display the diagnostic results in a clear and understandable format, including problem analysis and suggestions.
Usage examples
Case 1: Diagnose training task failure
AI training tasks repeatedly fail on a specific node, and it is necessary to quickly locate GPU hardware problems.
Case 2: Monitor GPU temperature
The temperature in the computer room rises in summer, and it is necessary to monitor whether the GPUs are overheating and causing performance degradation.
Case 3: Troubleshoot resource contention
Multiple teams report insufficient GPU resources, and it is necessary to view the actual usage situation.
Case 4: Analyze historical problems
There was a brief GPU failure last night, and it is necessary to view the GPU status at the time of the failure.
Frequently Asked Questions
Do I need to have an NVIDIA GPU to use this tool?
Is this tool safe? Will it affect the production environment?
Which AI assistants are supported?
Do I need to install it on each Kubernetes node?
Will the diagnostic data be sent to the cloud?
How to update to the new version?
Related resources
GitHub repository
Source code, issue tracking, and contribution guidelines
Quick start guide
Detailed steps to get started in 5 minutes
Kubernetes deployment guide
A complete guide for deploying in a production K8s cluster
Model Context Protocol official website
Understand the technical details and specifications of the MCP protocol
Helm Chart repository
Use Helm to deploy to Kubernetes with one click

Markdownify MCP
Markdownify is a multi-functional file conversion service that supports converting multiple formats such as PDFs, images, audio, and web page content into Markdown format.
TypeScript
34.2K
5 points

Gitlab MCP Server
Certified
The GitLab MCP server is a project based on the Model Context Protocol that provides a comprehensive toolset for interacting with GitLab accounts, including code review, merge request management, CI/CD configuration, and other functions.
TypeScript
24.4K
4.3 points

Duckduckgo MCP Server
Certified
The DuckDuckGo Search MCP Server provides web search and content scraping services for LLMs such as Claude.
Python
71.7K
4.3 points

Notion Api MCP
Certified
A Python-based MCP Server that provides advanced to-do list management and content organization functions through the Notion API, enabling seamless integration between AI models and Notion.
Python
20.4K
4.5 points

Unity
Certified
UnityMCP is a Unity editor plugin that implements the Model Context Protocol (MCP), providing seamless integration between Unity and AI assistants, including real - time state monitoring, remote command execution, and log functions.
C#
31.0K
5 points

Figma Context MCP
Framelink Figma MCP Server is a server that provides access to Figma design data for AI programming tools (such as Cursor). By simplifying the Figma API response, it helps AI more accurately achieve one - click conversion from design to code.
TypeScript
64.3K
4.5 points

Minimax MCP Server
The MiniMax Model Context Protocol (MCP) is an official server that supports interaction with powerful text-to-speech, video/image generation APIs, and is suitable for various client tools such as Claude Desktop and Cursor.
Python
47.4K
4.8 points

Gmail MCP Server
A Gmail automatic authentication MCP server designed for Claude Desktop, supporting Gmail management through natural language interaction, including complete functions such as sending emails, label management, and batch operations.
TypeScript
22.0K
4.5 points


