Forge MCP Server: Optimize PyTorch Models with 32 Parallel AI Agents, MCP Tool Compatible with Multiple AI Assistants

Forge MCP Server

Forge MCP Server is a tool that automatically optimizes PyTorch models into high-performance CUDA/Triton kernels through 32 parallel AI agents, increasing inference speed by up to 14x and supporting multiple MCP-compatible AI programming assistants.

Developer tools Artificial intelligence chatbots #GPU optimization #AI agents #Performance acceleration #PyTorch .TypeScript

rating : 2.5 points

downloads : 7.5K

update time : 2026-03-13

Open Site

What is Forge MCP Server?

Forge MCP Server is a bridge connecting AI programming assistants with GPU optimization services. It allows you to automatically convert PyTorch code into optimized high-performance GPU kernels through AI assistants such as Claude, Cursor, and VS Code. The service uses 32 parallel AI agents to test and optimize code on real data center GPUs, ensuring optimal performance.

How to use Forge MCP Server?

Using Forge is very simple: 1) Install Forge MCP Server in your AI assistant; 2) Log in and authenticate with one click through the browser; 3) Submit PyTorch code or describe the operations you need; 4) Wait for Forge to optimize and return high-performance kernel code. The entire process is fully automated, eliminating the need to manually write CUDA code.

Applicable scenarios

Forge is most suitable for the following scenarios: accelerating custom PyTorch operations, optimizing the performance of existing GPU kernels, generating new high-performance kernels from scratch, converting research code into production-level performance, and achieving optimal performance on various GPU architectures. It is particularly suitable for deep learning researchers, AI engineers, and developers who need high-performance inference.

Main features

Automatic code optimization

Submit PyTorch code to automatically convert it into optimized Triton or CUDA kernels. Compare with the torch.compile(max-autotune) benchmark to ensure performance improvement.

Intelligent kernel generation

Automatically generate high-performance GPU kernel code by describing operation requirements in natural language, supporting multiple data formats and GPU architectures.

Optimization with 32 parallel AI agents

Use 32 Coder+Judge agents to explore different optimization strategies in parallel, including tensor core utilization, memory coalescing, shared memory tiling, and kernel fusion.

Real GPU benchmark testing

All kernels are compiled, tested for correctness, and analyzed for performance on real data center GPUs, supporting various GPUs such as B200, H200, H100, and A100.

Intelligent detection and suggestions

Automatically identify optimization opportunities in the code, such as custom operations, attention mechanisms, and fused operations, and provide optimization suggestions.

One-click browser authentication

There is no need to manage API keys. Log in securely in the browser through the OAuth 2.0 PKCE process, and the token is automatically refreshed.

Advantages

Up to 14x performance improvement, surpassing torch.compile(max-autotune)

Supports multiple mainstream AI programming assistants (Claude, Cursor, VS Code, etc.)

Tested on real data center GPUs, with reliable results

100% numerical correctness guarantee

Fast result return (in minutes rather than hours)

Obtain high-performance kernels without CUDA programming experience

Pay-as-you-go, with transparent costs

Free trial for one kernel optimization

Limitations

Requires an internet connection to access the optimization service

Each optimization consumes 1 credit (only charged for successful optimizations)

The maximum code input limit is 500KB

Only supports PyTorch code optimization

Requires modern GPU support (T4 and above)

How to use

Install Forge MCP Server

Add Forge MCP Server to the configuration file according to the AI assistant you are using. For example, edit the claude_desktop_config.json file in Claude Desktop.

Authentication and login

Call the forge_auth tool in the AI assistant, and the system will automatically open the browser to complete the login. You only need to log in once, and the token will be automatically refreshed.

Submit code for optimization

When the AI assistant detects PyTorch code that can be optimized, it will automatically call the forge_optimize tool. You can also manually describe the operations to be optimized.

Get optimization results

Wait for the optimization to complete (usually a few minutes), and the system will return the optimized kernel code, performance improvement data, and integration suggestions.

Integrate the optimized code

Integrate the returned optimized kernel code into your project, replacing the original PyTorch implementation, and enjoy the performance improvement.

Usage examples

Optimize custom attention mechanism

Researchers have developed a new attention variant, but the PyTorch implementation is slow. Use Forge to automatically optimize it into a high-performance Triton kernel.

Generate a fused normalization layer

It is necessary to fuse LayerNorm, GELU activation, and Dropout into a single efficient kernel to reduce memory access.

Accelerate existing model inference

The inference speed of the model in the production environment does not meet the requirements, and it is necessary to optimize the key computational bottlenecks.

Frequently Asked Questions

How long does Forge optimization take?

Will I be charged if the optimization fails?

Which GPUs are supported?

How to ensure the correctness of the optimized code?

How large a code can be optimized?

How to purchase credits?

Is local deployment supported?

Are there any licensing restrictions on the optimized code?

Related resources

Official documentation

Detailed technical documentation and benchmark test results of the Forge service

GitHub repository

Source code and issue tracking for Forge MCP Server

MCP protocol official website

Official documentation and specifications of the Model Context Protocol

RightNow AI console

Purchase credits, view usage statistics, and manage accounts

NVIDIA Inception program

RightNow AI is a member of the NVIDIA Inception program

🚀 Forge MCP Server

Swarm agents that turn slow PyTorch into fast CUDA/Triton kernels, from any AI coding agent.

🚀 Quick Start

Forge transforms PyTorch models into production - grade CUDA/Triton kernels through automated multi - agent optimization. Using 32 parallel AI agents with inference - time scaling, it achieves up to 14x faster inference than torch.compile(mode='max - autotune - no - cudagraphs') while maintaining 100% numerical correctness.

This MCP server connects any MCP - compatible AI coding agent to Forge. Your agent submits PyTorch code, Forge optimizes it with swarm agents on real datacenter GPUs, and returns the fastest kernel as a drop - in replacement.

✨ Features

What it does

Optimize existing kernels - Submit PyTorch code, get back an optimized Triton/CUDA kernel benchmarked against torch.compile(max - autotune).
Generate new kernels - Describe an operation (e.g. "fused LayerNorm + GELU + Dropout"), get a production - ready optimized kernel.
32 parallel swarm agents - Coder+Judge agent pairs compete to discover optimal kernels, exploring tensor core utilization, memory coalescing, shared memory tiling, and kernel fusion simultaneously.
Real datacenter GPU benchmarking - Every kernel is compiled, tested for correctness, and profiled on actual datacenter hardware.
250k tokens/sec inference - Results in minutes, not hours.
Smart detection - The agent automatically recognizes when your code would benefit from GPU optimization.
One - click auth - Browser - based OAuth sign - in. No API keys to manage.

Supported GPUs

All optimization and benchmarking runs on datacenter - grade hardware:

GPU	Architecture
B200	Blackwell
H200	Hopper
H100	Hopper
L40S	Ada Lovelace
A100	Ampere
L4	Ada Lovelace
A10	Ampere
T4	Turing

Supported clients

Client	Status
Claude Code	Fully supported
Claude Desktop	Fully supported
OpenCode	Fully supported
Cursor	Fully supported
Windsurf	Fully supported
VS Code + Copilot	Fully supported
Any MCP client	Fully supported via stdio

📦 Installation

Claude Code

macOS / Linux:

claude mcp add forge - mcp -- npx -y @rightnow/forge - mcp - server

Windows:

claude mcp add forge - mcp -- cmd /c npx -y @rightnow/forge - mcp - server

Claude Desktop

Add to your claude_desktop_config.json:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "forge": {
      "command": "npx",
      "args": ["-y", "@rightnow/forge - mcp - server"]
    }
  }
}

Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "forge": {
      "command": "cmd",
      "args": ["/c", "npx", "-y", "@rightnow/forge - mcp - server"]
    }
  }
}

VS Code / Copilot

Add to your .vscode/mcp.json (workspace) or user settings:

{
  "servers": {
    "forge": {
      "command": "npx",
      "args": ["-y", "@rightnow/forge - mcp - server"]
    }
  }
}

Windows: Use "command": "cmd" with "args": ["/c", "npx", "-y", "@rightnow/forge - mcp - server"]

Cursor

Add to your Cursor MCP settings (~/.cursor/mcp.json):

{
  "mcpServers": {
    "forge": {
      "command": "npx",
      "args": ["-y", "@rightnow/forge - mcp - server"]
    }
  }
}

Windows: Use "command": "cmd" with "args": ["/c", "npx", "-y", "@rightnow/forge - mcp - server"]

Windsurf

Add to your Windsurf MCP configuration:

{
  "mcpServers": {
    "forge": {
      "command": "npx",
      "args": ["-y", "@rightnow/forge - mcp - server"]
    }
  }
}

Windows: Use "command": "cmd" with "args": ["/c", "npx", "-y", "@rightnow/forge - mcp - server"]

OpenCode

Add to your opencode.json:

{
  "mcp": {
    "forge": {
      "command": "npx",
      "args": ["-y", "@rightnow/forge - mcp - server"]
    }
  }
}

💻 Tools

`forge_auth`

Authenticate with the Forge service. Opens your browser to sign in via the RightNow dashboard. Required before using any other tool.

Inputs:
- force (boolean, optional): Force re - authentication even if valid tokens exist.
Returns: Authentication status, email, plan type, and credit balance.

`forge_optimize`

Submit PyTorch code for GPU kernel optimization. 32 swarm agents generate optimized Triton or CUDA kernels, evaluate them on real datacenter GPUs, and return the best result with speedup metrics.

The agent will automatically use this tool when it detects:

PyTorch custom operations (torch.autograd.Function, custom forward/backward).
Manual CUDA kernels that could be faster.
Performance - critical tensor operations (attention, convolution, normalization, softmax).
Code with comments like "slow", "bottleneck", "optimize".
torch.compile() targets or triton.jit kernels.
Any nn.Module with significant compute in forward().
Matrix multiplication, reduction, or scan operations.
Custom loss functions with reduction operations.
Fused operation opportunities (e.g., LayerNorm + activation).
Inputs:
- pytorch_code (string, required): Complete PyTorch code to optimize. Max 500 KB.
- kernel_name (string, required): Short name for the kernel (e.g., "flash_attention").
- output_format (enum, optional): "triton" (default) or "native_cuda".
- target_speedup (number, optional): Target speedup multiplier. Default 2.0.
- max_iterations (number, optional): Max optimization iterations (1 - 100). Default 10.
- gpu (enum, optional): Target GPU. Default "H100". Options: B200, H200, H100, L40S, A100, L4, A10, T4.
- user_prompt (string, optional): Guidance for the optimizer (e.g., "focus on memory bandwidth").
Returns: Optimized kernel code, speedup metrics, latency comparison, iteration history.

`forge_generate`

Generate an optimized GPU kernel from scratch based on a natural - language specification. Forge creates a PyTorch baseline, then optimizes it into Triton or CUDA.

Inputs:
- operation (string, required): Operation name (e.g., "fused_attention", "softmax").
- description (string, required): Detailed description of what the kernel should do.
- input_shapes (number[][], required): Input tensor shapes (e.g., [[8, 512, 768]]).
- output_shape (number[], optional): Expected output shape.
- dtype (string, optional): Data type. Default "float16".
- output_format (enum, optional): "triton" (default) or "native_cuda".
- target_speedup (number, optional): Target speedup. Default 2.0.
- max_iterations (number, optional): Max iterations (1 - 100). Default 10.
- gpu (enum, optional): Target GPU. Default "H100".
- user_prompt (string, optional): Additional guidance.
Returns: Generated kernel code, speedup metrics, iteration history.

`forge_credits`

Check your current Forge credit balance.

Inputs: None.
Returns: Credit balance, total purchased, total used, plan type.

`forge_status`

Check the status of a running or completed optimization job.

Inputs:
- session_id (string, required): Session ID from forge_optimize or forge_generate.
Returns: Job status, current iteration, best speedup.

`forge_cancel`

Cancel a running optimization job.

Inputs:
- session_id (string, required): Session ID of the job to cancel.
Returns: Cancellation confirmation.

`forge_sessions`

List past optimization sessions with results.

Inputs:
- limit (number, optional): Number of sessions to return (1 - 100). Default 10.
- status (enum, optional): Filter by status: "all", "completed", "failed", "running". Default "all".
Returns: Table of sessions with task name, GPU, speedup, status, and date.

Tool Annotations

Tool	Read - only	Idempotent	Destructive
`forge_auth`	No	Yes	No
`forge_optimize`	No	No	No
`forge_generate`	No	No	No
`forge_credits`	Yes	Yes	No
`forge_status`	Yes	Yes	No
`forge_cancel`	No	No	Yes
`forge_sessions`	Yes	Yes	No

📚 Resources

URI	Description
`forge://auth/status`	Current authentication state (authenticated, token expiry, has refresh token)
`forge://credits`	Credit balance, usage, and plan information

📚 Prompts

`forge-optimize`

Guided workflow for optimizing a GPU kernel. Instructs the agent to:

Check credit balance.
Analyze the code for optimization targets.
Call forge_optimize with appropriate parameters.
Explain the results and suggest integration.

`forge-analyze`

Teaches the agent to scan a codebase for GPU optimization opportunities, ranked by expected impact:

Priority	Pattern
HIGH	Custom autograd functions, attention mechanisms, fused operations
MEDIUM	Standard `nn.Module` compositions, normalization + activation fusion
LOW	Element - wise operations, simple reductions

🔧 How It Works

┌──────────────┐     stdio      ┌──────────────────┐     HTTPS      ┌──────────────────┐
│  AI Agent    │ ──────────────>│  Forge MCP       │ ──────────────>│  Forge API       │
│  (Claude,    │                │  Server          │                │  (RightNow AI)   │
│   Cursor,    │<──────────────│                  │<──────────────│                  │
│   etc.)      │   MCP result   │  - OAuth + PKCE  │   SSE stream   │  - 32 swarm      │
└──────────────┘                │  - SSE streaming │                │    agents        │
                                │  - Token mgmt    │                │  - Real GPU      │
                                └──────────────────┘                │    benchmarking  │
                                                                    └──────────────────┘

Authenticate: The agent calls forge_auth, which opens your browser. Sign in once, tokens are stored locally at ~/.forge/tokens.json and auto - refresh.
Optimize: The agent sends your PyTorch code via forge_optimize. The MCP server POSTs to the Forge API and streams SSE events in real time.
Benchmark: 32 parallel Coder+Judge agents generate kernels, compile them, test correctness against the PyTorch reference, and profile performance on real datacenter GPUs.
Return: The MCP server collects all results and returns the optimized code, speedup metrics, and iteration history. The output is a drop - in replacement for your original code.

Each optimization costs 1 credit. Credits are only charged for successful runs (speedup >= 1.1x). Failed runs and cancelled jobs are not charged.

📚 Configuration

Authentication

No API keys needed. The server uses OAuth 2.0 with PKCE for secure browser - based authentication:

Agent calls forge_auth.
Your default browser opens to dashboard.rightnowai.co.
Sign in or create an account.
Authorization completes automatically.
Tokens are stored locally at ~/.forge/tokens.json (mode 0600).
Access tokens auto - refresh, you only sign in once.

Credits

Forge uses a pay - as - you - go credit system. Each optimization or generation run costs 1 credit.

Credits	Price	Per Credit
1 - 9	$15.00 each	$15.00
10+	25% off	$11.25
50	$562.50	$11.25
Enterprise	Custom volume pricing	Contact us

Free trial: optimize 1 kernel, no credit card required.

100% refund guarantee: if Forge doesn't beat torch.compile, you get your credit back.

Purchase credits at dashboard.rightnowai.co.

📊 Benchmarks

End - to - end latency on NVIDIA B200. Forge vs torch.compile(mode='max - autotune - no - cudagraphs'):

Model	torch.compile	Forge	Speedup
Llama - 3.1 - 8B	42.3ms	8.2ms	5.16x
Qwen2.5 - 7B	38.5ms	9.1ms	4.23x
Mistral - 7B	35.2ms	10.4ms	3.38x
Phi - 3 - mini	18.7ms	6.8ms	2.75x
SDXL UNet	89.4ms	31.2ms	2.87x
Whisper - large	52.1ms	19.8ms	2.63x
BERT - large	12.4ms	5.1ms	2.43x

See the full benchmarks at rightnowai.co/forge.

🔒 Security

Token Protection

No tokens in errors: All error messages are sanitized through regex filters that strip JWTs, Bearer tokens, hex tokens, and credential parameters before reaching the agent.
Local storage only: Tokens are stored at ~/.forge/tokens.json with file mode 0600 (owner read/write only).
Auto - refresh: Access tokens expire in 1 hour and auto - refresh using the stored refresh token.
PKCE flow: OAuth uses Proof Key for Code Exchange (SHA - 256), preventing authorization code interception.
No secrets in config: The MCP server requires zero environment variables or API keys.

Input Validation

PyTorch code input is capped at 500 KB to prevent memory exhaustion.
User prompts are capped at 10 KB.
All string inputs have maximum length validation via Zod schemas.
Numeric inputs have min/max bounds (e.g., max_iterations: 1 - 100).

Network Security

All API communication uses HTTPS.
Non - SSE requests have a 30 - second timeout to prevent hanging.
SSE streams have a 10 - minute timeout with automatic cleanup.
Token refresh uses a mutex to prevent race conditions from concurrent requests.

What the server can access

Network: Only dashboard.rightnowai.co and forge - api.rightnowai.co.
Filesystem: Only reads/writes ~/.forge/tokens.json.
No codebase access: The MCP server never reads your files. The agent passes code to it explicitly through tool parameters.

🛠️ Development

Build from source

git clone https://github.com/RightNow - AI/forge - mcp - server.git
cd forge - mcp - server
npm install
npm run build

Run locally

npm run dev

Type check

npm run typecheck

Debug with MCP Inspector

npx @modelcontextprotocol/inspector node dist/index.js

This opens a web UI where you can invoke each tool, inspect inputs/outputs, and debug the server interactively.

Project structure

forge - mcp - server/
├── src/
│   ├── index.ts              # Entry point (McpServer + StdioServerTransport)
│   ├── server.ts             # Registers all tools, resources, prompts
│   ├── constants.ts          # URLs, client IDs, timeouts, limits
│   ├── types.ts              # TypeScript interfaces + type guards + sanitization
│   ├── auth/
│   │   ├── oauth - client.ts   # PKCE flow, token refresh, access token management
│   │   └── token - store.ts    # ~/.forge/tokens.json read/write/clear
│   ├── api/
│   │   ├── forge - client.ts   # HTTP client for all Forge API endpoints
│   │   └── sse - consumer.ts   # SSE stream parser via native fetch + ReadableStream
│   ├── tools/                # 7 MCP tools
│   ├── resources/            # 2 MCP resources
│   └── prompts/              # 2 MCP prompts
├── .github/workflows/
│   ├── ci.yml                # Typecheck + build on push/PR
│   └── release.yml           # npm publish on version tags
├── package.json
├── tsconfig.json
└── tsup.config.ts