🚀 AutoPentest
An agentic pentesting MCP server that automates web application penetration testing using the full OWASP Web Security Testing Guide and PortSwigger Web Security Academy technique references.
Point it at a target — it crawls your app, maps every endpoint, then spawns role-specialized agents (Scout, Analyzer, Exploiter, Reporter) to test for XSS, SQLi, SSRF, SSTI, IDOR and more. No false positives — every finding is backed by real, reproducible evidence with quality gates enforcing proof at every phase. Includes 31 PortSwigger technique guides, adaptive WAF evasion for 12 vendors, cross-phase vulnerability chaining, and risk-weighted endpoint prioritization. Run it with Claude Code, the API, or go fully offline using Ollama models.
Think of it as: A senior pentester's methodology encoded into an MCP server — 109 OWASP tests, 31 PortSwigger attack technique guides, 68+ MCP tools, 27 security tools, 4 specialized agent roles, 7 structured phases, automated quality assurance, and a zero-context final review.
🚀 Quick Start
Prerequisites
Installation
git clone https://github.com/bhavsec/autopentest-ai.git
cd autopentest-ai
cd server && uv sync && cd ..
make setup
That's it. All 27 security tools are now installed and ready inside the Docker container.
Verify Installation
make verify-tools
Start Testing
claude
Then tell Claude what to test:
Run a full WSTG assessment against https://target.example.com
✨ Features
Comprehensive OWASP Coverage
- 109 WSTG test cases across 12 categories — from information gathering to API testing.
- Each test includes step-by-step CLI procedures, context-specific payloads, detection criteria, and severity rubrics.
- Tests are prioritized (MUST/SHOULD) with conditional triggers so nothing relevant is skipped.
31 PortSwigger Attack Technique Guides
- Sourced from PortSwigger Web Security Academy — detection methods, exploitation techniques, payloads, cheat sheets, and WAF bypass patterns.
- Organized by vulnerability class (SQLi, XSS, SSRF, JWT, OAuth, etc.) for direct use during testing.
- Integrated into every testing phase — agents automatically load the relevant technique guide before testing each vulnerability class.
- Database/platform-specific payload tables (Oracle vs MySQL vs PostgreSQL vs MSSQL for SQLi, Jinja2 vs Twig vs Freemarker for SSTI, etc.).
- WAF bypass patterns organized by bypass level (basic → intermediate → advanced).
27 Pre-Configured Security Tools
- All tools pre-installed in a single Docker image —
make setup and you're ready.
- Tools organized by phase: discovery, injection testing, authentication, cryptography, API testing.
- Automatic Burp Suite proxy integration for passive traffic monitoring.
Structured 7-Phase Workflow
- Phase 0: Application Discovery & Mapping
- Phase 1: Information Gathering & Reconnaissance
- Phase 2: Configuration & Deployment Testing
- Phase 3: Identity, Authentication, Authorization & Session Management
- Phase 4: Input Validation Testing (pipelined XSS/SQLi/SSRF pipelines)
- Phase 5: Error Handling, Cryptography, Business Logic, Client-Side & API Testing
- Phase 6: Coverage Verification & Reporting
- Phase 7: Final Judge Review & Remediation
Quality Assurance System
- Automated phase gates — each phase must pass quality checks before proceeding.
- Quality Reviewer subagent at every phase transition identifies gaps and suggests improvements.
- Final Judge — a zero-context agent reviews the entire engagement cold, like an external QA reviewer.
- Exhaustion gates — "not vulnerable" requires proof of sufficient testing effort (minimum techniques and bypass attempts).
Evidence-Based Findings
- Every finding requires reproducible curl commands and full request/response evidence.
- Three-tier classification: EXPLOITED (proven impact), POTENTIAL (blocked by control), FALSE_POSITIVE (control holds).
- Anti-hallucination framework — "no exploit = no finding" enforced at every level.
- Evidence checklists per vulnerability class verified before any finding is logged.
Role-Specialized Subagents
- 4 dedicated roles with focused prompt templates, tool guidance, and anti-patterns:
- Scout — reconnaissance only, maps attack surface without sending payloads (Phase 0 - 1).
- Analyzer — identifies potential sinks with canary/witness payloads, builds exploitation queues (Phase 2 - 5 analysis).
- Exploiter — consumes Analyzer output, proves exploitation with evidence, logs confirmed findings (Phase 4 exploitation).
- Reporter — quality review and Final Judge, reviews data without sending requests (QA + post-report).
- Validation checkpoint between analysis and exploitation prevents wasted effort.
- Each role has explicit allowed/restricted tool lists and input/output contracts.
Pipelined Exploitation (Phase 4)
- 3 independent two-stage pipelines run in parallel: XSS, Injection (SQLi/CMDi), SSRF/SSTI.
- Each pipeline: Analyzer (discover → analyze → queue) → validation checkpoint → Exploiter (exploit → log).
- Each pipeline loads its PortSwigger technique guide for detection methods, cheat sheets, and WAF bypass patterns.
- WAF intelligence shared across all pipelines.
- Context-aware witness payloads for 13 sink types.
Adaptive WAF Evasion
- Automatic WAF fingerprinting from response headers, body, and status codes — identifies 12 WAF vendors (Cloudflare, AWS WAF, Akamai, Imperva, ModSecurity, F5, FortiWeb, Sucuri, Barracuda, Wordfence, NAXSI, Citrix).
- Vendor-specific bypass payloads organized by complexity level (basic → intermediate → advanced).
- WAF intelligence shared across all agents via deliverable system.
- Agents automatically identify WAF on first block response and switch to tailored bypass payloads.
Cross-Phase Knowledge Graph
- Entity-relationship graph tracks endpoints, parameters, technologies, findings, cookies, domains, and user roles.
- Automated vulnerability chaining via BFS path finding with 7 predefined chain patterns:
- XSS + missing CSP, XSS + weak cookie (no HttpOnly), Open redirect + OAuth callback.
- IDOR + admin role, SSRF + cloud metadata, No lockout + no MFA, CORS + sensitive endpoint.
- Severity upgrades when chaining materially increases impact.
- Populated throughout testing, queried after Phase 4 for chain discovery.
Hierarchical Task Tree
- Persistent tree structure (phases as branches, tests as leaves) prevents LLM depth-first bias and context loss.
- Main agent maintains strategic macro view; subagents update only their assigned leaf nodes.
- Auto-propagation: when all children complete, parent auto-completes.
- Phase-level completion percentages for informed decision-making.
Endpoint Risk Prioritization
- Score and sort endpoints by risk for prioritized testing — highest risk tested first.
- Scoring factors: parameter count, technology risk indicators, taint chain confidence, tool convergence, auth requirements, injectable parameter names.
- Integrated into Phase 0 endpoint map generation.
Tool Output Parsing
- 13 built-in parsers for common CLI tools (nmap, nuclei, sqlmap, ffuf, httpx, whatweb, testssl, nikto, dalfox, katana, gau, wapiti, commix).
- Condenses raw tool output 3 - 5x while preserving key findings, endpoints, and errors.
- Configurable verbosity: summary (~15 lines), detailed (~50 lines), full (complete parsed output).
CLI Tool Results Verification
- Automatic validation of CLI tool output quality — detects empty output, proxy errors, permission issues, and suspicious results.
- 10 per-tool validators (nmap, nuclei, sqlmap, ffuf, feroxbuster, testssl, dalfox, wapiti, katana, httpx) with corrected command suggestions.
- When a tool produces empty or suspicious output, the validator suggests fixes (e.g., add
-Pn for nmap, remove proxy env vars, try different flags).
- Integrated into the tool execution workflow — agents call
verify_tool_result() after every CLI tool run.
Progressive Context Compression
- Phase summaries (~500 - 800 words) auto-generated when phase gates pass — capturing findings, coverage, tool results, and attack surface in compressed form.
- Prevents context degradation in long-running engagements by replacing raw historical data with structured summaries.
get_engagement_summary() combines all phase summaries into a single overview for injecting into new subagent prompts.
- Summaries stored as deliverables — accessible by any downstream agent without requiring full engagement history.
Counterfactual Analysis (Second-Pass Discovery)
- After an Analyzer completes with vulnerabilities found, a second Analyzer is spawned with instructions to "assume those vulns are patched".
- The counterfactual Analyzer searches for additional vulnerabilities: different endpoints, different parameters, different injection contexts, logic flaws.
- Results are appended to the existing exploitation queue (automatic merge with deduplication by endpoint+parameter and auto-incrementing IDs).
- Based on PenHeal ablation research showing +71% vulnerability coverage with counterfactual prompting.
Multi-Domain Support
- Automatic SSO/OAuth/OIDC/SAML detection and handling.
- Per-domain scope registration, crawling, and testing.
- Cookie jar management for cross-domain session persistence.
- 6-level authentication failure escalation (alternative grants → PKCE → headless browser → token extraction → user provision → unauthenticated).
Crash-Safe Engagement Management
- Append-only
findings.md and progress.log survive crashes.
- Git workspace checkpointing with rollback capability.
- Auto-resume on interruption —
resume-prompt.md auto-generated at every checkpoint with full context (target, credentials, current phase, remaining tests, scope). Paste into a new session to continue exactly where you left off.
- Mid-phase checkpoint granularity — tracks which tests within a phase are completed, not just phase-level state.
- Full audit trail of every MCP tool call with timestamps.
Professional Reporting
- Markdown reports with executive summary, findings by severity, test coverage matrix, and tool coverage.
- Per-category coverage percentages and gap analysis.
- Vulnerability chaining analysis documented.
- Final Judge observations and quality notes included.
📦 Installation
git clone https://github.com/bhavsec/autopentest-ai.git
cd autopentest-ai
cd server && uv sync && cd ..
make setup
💻 Usage Examples
Option A: Interactive Mode
Launch Claude Code and provide the target:
Run a full pentest against https://app.example.com
Credentials: admin / P@ssw0rd123
Claude will ask for any missing information (like credentials) and begin the 7 - phase workflow.
Option B: Config-Driven Mode (Recommended)
Create a YAML config file for repeatable, consistent assessments:
target:
url: https://app.example.com
scope:
- app.example.com
- api.example.com
exclude:
- cdn.example.com
authentication:
login_type: form
login_url: https://app.example.com/login
credentials:
username: testuser@example.com
password: secret123
login_flow:
- "Type $username into the email field"
- "Type $password into the password field"
- "Click the 'Sign In' button"
success_condition:
type: url_contains
value: "/dashboard"
rules:
avoid:
- description: "Do not test logout"
type: path
url_path: "/logout"
focus:
- description: "Prioritize API endpoints"
type: path
url_path: "/api"
reporting:
tester_name: "Security Team"
Then in Claude Code:
Load the config from configs/my-target.yaml and run the pentest
Option C: Targeted Testing
Run specific WSTG tests against specific endpoints:
Run WSTG-INPV-05 (SQL Injection) against https://app.example.com/search?q=
Test https://app.example.com for CORS misconfiguration (WSTG-CONF-13)
Run all authentication tests (WSTG-ATHN) against https://app.example.com
Option D: Resume an Interrupted Engagement
Resume engagement pentest-2026-02-11-myapp
📚 Documentation
Agent Role System
AutoPentest uses 4 specialized agent roles instead of generic subagents. Each role has a dedicated prompt template with focused tool guidance, input/output contracts, and anti-patterns.
| Role |
Template |
Purpose |
Phases |
| Scout |
templates/agent-roles/scout.md |
Reconnaissance and attack surface mapping |
Phase 0 - 1, source code discovery |
| Analyzer |
templates/agent-roles/analyzer.md |
Vulnerability discovery with canary/witness payloads |
Phase 2 - 5 analysis |
| Exploiter |
templates/agent-roles/exploiter.md |
Exploitation proof with evidence |
Phase 4 exploitation |
| Reporter |
templates/agent-roles/reporter.md |
Quality review and Final Judge |
Phase transitions, post-report |
How the Pipeline Works
Phase 4 (highest-impact testing) uses a two-stage pipeline per vulnerability class:
┌──────────────────────────────────────────────────────────────┐
│ Pipeline 1: XSS │
│ │
│ Analyzer (75 turns) Exploiter (75 turns) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Discover endpoints │ │ Load Analyzer queue │ │
│ │ Send canary payloads│─────▶│ Attempt exploitation│ │
│ │ Build exploit queue │ gate │ Prove impact │ │
│ │ Save deliverable │ │ Log findings │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ ▲ │
│ validate_exploitation_queue() │
└──────────────────────────────────────────────────────────────┘
Three pipelines (XSS, Injection, SSRF/SSTI) run in parallel. The validation checkpoint between Analyzer and Exploiter ensures only well-formed exploitation queues proceed.
Role Boundaries
Each role has explicit tool restrictions enforced through prompts:
- Scouts cannot call
log_finding() or send attack payloads.
- Analyzers can log configuration findings (missing headers, weak cookies) but not injection-class findings.
- Exploiters cannot create new queues — they consume what the Analyzer produced.
- Reporters cannot send HTTP requests to the target — they review data only.
For CTF challenges and small apps (<3 input endpoints), a legacy monolithic pipeline is available as a fallback.
Testing Phases
Phase 0: Application Discovery & Mapping
The critical foundation phase. Claude autonomously:
- Pre-flight checks — verifies target reachability, detects redirects and cross-domain auth.
- Launches 10+ background tools in parallel (katana, ffuf, nuclei, whatweb, gau, nmap, feroxbuster, wapiti, httpx).
- Recursive crawling — follows links to depth 2 - 3, parses HTML/JS for endpoints.
- Directory brute-forcing — common paths + technology-specific wordlists.
- Tool result ingestion — reads all background tool outputs and merges into unified endpoint map.
- Builds structured endpoint inventory with parameters, auth requirements, and priority rankings.
Output: A complete endpoint map organized by domain, ready for systematic testing.
Phase 1 - 2: Reconnaissance & Configuration
- Server fingerprinting, technology detection, metadata review.
- Security header analysis (HSTS, CSP, CORS, X-Frame-Options).
- TLS configuration testing, admin interface discovery.
- HTTP methods testing, file extension handling.
Phase 3: Authentication, Authorization & Session Management
- Role/privilege lattice built before testing (maps guards, middleware, and bypass tests).
- IDOR testing with multiple alternate IDs per endpoint.
- CSRF testing on every state-changing endpoint.
- Session fixation, hijacking, and token analysis.
- JWT vulnerability testing (if applicable).
- OAuth/OIDC weakness testing (if applicable).
Phase 4: Input Validation (Highest Impact)
Three independent two-stage pipelines run in parallel, each using the Analyzer→Exploiter role split:
| Pipeline |
Vulnerability Classes |
Tools |
Technique Guides |
| XSS Pipeline |
Reflected XSS, Stored XSS, DOM XSS |
dalfox, Playwright |
XSS, DOM |
| Injection Pipeline |
SQL Injection, Command Injection, NoSQL Injection |
sqlmap, commix, nosqli |
SQLI, CMDI, NOSQLI |
| SSRF/SSTI Pipeline |
SSRF, SSTI, Path Traversal |
sstimap, ssrfmap |
SSRF, SSTI, PTRAV |
Each pipeline: Analyzer (discover → analyze → build exploitation queue) → validation checkpoint → Exploiter (attempt exploitation → prove impact → log findings). WAF evasion intelligence is shared across all pipelines.
Phase 5: Error Handling, Crypto, Business Logic, Client-Side & APIs
- Stack trace and error message disclosure.
- TLS/SSL testing via testssl.sh.
- Business logic bypass (workflow circumvention, request forgery).
- Client-side testing (clickjacking, open redirects, DOM manipulation).
- GraphQL and REST API testing.
- Vulnerability chaining analysis across all findings.
Phase 6: Reporting
- Coverage verification (test coverage + tool coverage).
- Finding deduplication and severity calibration.
- Markdown report generation with executive summary, findings, coverage matrices.
Phase 7: Final Judge Review
A zero-context agent reviews the entire engagement cold — no knowledge of testing decisions or difficulties. It examines:
- Coverage integrity — rubber-stamped tests, missing endpoints.
- N/A cascade detection — categories with excessive "not applicable" markings.
- Finding quality — evidence completeness, severity consistency, chaining opportunities.
- Tool utilization — tools run but output never reviewed, lazy skip reasons.
- Missed attack surface — untested endpoints, untested parameters, untested domains.
The verdict (PASS/CONDITIONAL_PASS/FAIL) triggers specific remediation actions before the report is delivered.
Security Tools
Discovery & Reconnaissance (Phase 0)
| Tool |
Purpose |
Key Flags |
| katana |
Web crawler with JS rendering |
-jc for JavaScript crawling |
| httpx |
HTTP probing, tech detection |
-tech-detect -status-code -title |
| ffuf |
Directory/parameter fuzzing |
-w wordlist -mc all -fc 404 |
| feroxbuster |
Recursive directory enumeration |
--smart --auto-tune |
| nuclei |
Template-based vuln scanner |
-t cves/ -t misconfigurations/ |
| nikto |
Web server misconfiguration |
-Tuning 1234567890 |
| whatweb |
Technology fingerprinting |
--aggression 3 |
| nmap |
Port and service scanning |
-sV -sC --top-ports 1000 |
| gau |
Historical URL discovery |
--blacklist png,jpg,gif |
| subfinder |
Subdomain enumeration |
-silent -all |
Injection Testing (Phase 4)
| Tool |
Purpose |
Key Flags |
| sqlmap |
SQL injection (all techniques) |
--batch --risk 3 --level 5 |
| dalfox |
XSS scanning & exploitation |
--skip-bav --deep-domxss |
| commix |
Command injection |
--batch --all |
| sstimap |
Server-Side Template Injection |
-u <url> |
| ssrfmap |
SSRF exploitation |
-r request.txt |
| nosqli |
NoSQL injection |
-u <url> |
| crlfuzz |
CRLF injection / HTTP splitting |
-u <url> |
| smuggler |
HTTP request smuggling |
-u <url> |
Authentication & Session (Phase 3)
| Tool |
Purpose |
Key Flags |
| hydra |
Credential brute-force |
-L users.txt -P pass.txt |
| jwt_tool |
JWT token analysis & exploitation |
-t <token> -M at |
Cryptography & APIs (Phase 5)
| Tool |
Purpose |
Key Flags |
| testssl.sh |
TLS/SSL configuration testing |
--severity HIGH --sneaky |
| graphql-cop |
GraphQL security testing |
-t <url> |
| websocat |
WebSocket testing |
ws://<url> |
Infrastructure (Phase 2)
| Tool |
Purpose |
| corscanner |
CORS misconfiguration scanning |
| dnsreaper |
Subdomain takeover detection |
Browser Automation
| Tool |
Purpose |
| Playwright |
DOM XSS proof, clickjacking, JS-rendered login, client-side storage inspection |
WSTG Knowledge Base
109 test cases across 12 OWASP categories, each with CLI-specific procedures:
| Code |
Category |
Tests |
Examples |
| INFO |
Information Gathering |
10 |
Search engine discovery, server fingerprinting, metadata review |
| CONF |
Configuration & Deployment |
14 |
Security headers, CORS, CSP, HSTS, admin interfaces |
| IDNT |
Identity Management |
5 |
Role definitions, registration, account enumeration |
| ATHN |
Authentication |
11 |
Default creds, lockout, auth bypass, MFA, password policy |
| ATHZ |
Authorization |
5 |
Directory traversal, auth bypass, privilege escalation, IDOR |
| SESS |
Session Management |
11 |
Cookie attributes, CSRF, session fixation/hijacking, JWT |
| INPV |
Input Validation |
20 |
XSS, SQLi, CMDi, SSTI, SSRF, path traversal, XXE, LDAP |
| ERRH |
Error Handling |
2 |
Error messages, stack traces |
| CRYP |
Cryptography |
4 |
TLS config, padding oracle, weak encryption |
| BUSL |
Business Logic |
10 |
Workflow bypass, request forgery, file upload, rate limits |
| CLNT |
Client-Side |
14 |
DOM XSS, clickjacking, open redirects, WebSockets, storage |
| APIT |
API Testing |
3 |
GraphQL, REST, SOAP |
Each test file includes:
- Step-by-step CLI procedures (curl commands, tool invocations).
- Payloads organized by bypass level (basic, intermediate, advanced).
- Detection criteria with severity assessment rubrics.
- Remediation guidance with references.
PortSwigger Technique Guides
31 attack technique reference guides sourced from PortSwigger Web Security Academy, organized by vulnerability class for direct use during real pentesting engagements.
What's Included
| Code |
Category |
WSTG Mapping |
Key Content |
| SQLI |
SQL Injection |
INPV-05 |
UNION/blind/error/time-based/OOB techniques, database-specific cheat sheets (Oracle, MySQL, PostgreSQL, MSSQL), WAF bypass |
| XSS |
Cross-Site Scripting |
INPV-01, INPV-02, CLNT-01 |
Reflected/stored/DOM contexts, tag & event handler payloads, CSP bypass, filter evasion |
| CMDI |
OS Command Injection |
INPV-12 |
Separator characters, blind techniques (time-delay, OOB), OS-specific payloads |
| SSTI |
Server-Side Template Injection |
INPV-18 |
Jinja2/Twig/Freemarker/Velocity/ERB detection & exploitation, sandbox escapes |
| SSRF |
Server-Side Request Forgery |
INPV-19 |
URL scheme tricks, IP obfuscation, DNS rebinding, cloud metadata, filter bypass |
| PTRAV |
Path Traversal |
INPV-04 |
Encoding variations, null byte injection, wrapper bypass |
| XXE |
XML External Entities |
INPV-07 |
File retrieval, SSRF via XXE, blind XXE with OOB, parameter entities |
| AUTHN |
Authentication |
ATHN-01 to ATHN-07 |
Brute force, 2FA bypass, password reset poisoning, credential stuffing |
| AUTHZ |
Access Control |
ATHZ-01 to ATHZ-04 |
IDOR, privilege escalation, horizontal/vertical bypass, referer-based controls |
| JWT |
JSON Web Tokens |
SESS-10 |
Algorithm confusion (none/HS256→RS256), kid injection, JWK/JKU exploitation |
| OAUTH |
OAuth 2.0 |
ATHZ-05 |
Authorization code theft, open redirect, scope upgrade, CSRF on OAuth flows |
| CSRF |
Cross-Site Request Forgery |
SESS-05 |
Token bypass, SameSite bypass, referer validation bypass |
| SMUGGLE |
HTTP Request Smuggling |
INPV-15 |
CL.TE, TE.CL, TE.TE, HTTP/2 downgrade, request tunneling |
| DOM |
DOM-Based Vulnerabilities |
CLNT-01 |
Sources/sinks, DOM clobbering, prototype pollution gadgets |
| CORS |
Cross-Origin Resource Sharing |
CONF-13, CLNT-07 |
Origin reflection, null origin, subdomain trust exploitation |
| NOSQLI |
NoSQL Injection |
INPV-05 |
MongoDB operator injection, JavaScript injection, blind extraction |
| GRAPHQL |
GraphQL |
APIT-01 |
Introspection, field suggestion, batching attacks, authorization bypass |
| RACE |
Race Conditions |
BUSL-04 |
Limit overrun, TOCTOU, single-endpoint races, last-frame sync |
| UPLOAD |
File Upload |
BUSL-08, BUSL-09 |
Extension bypass, content-type manipulation, web shells, polyglot files |
| HOST |
Host Header Injection |
INPV-17 |
Password reset poisoning, cache poisoning, routing-based SSRF |
Plus 11 more: CLICK, WS, CACHEPOIS, CACHEDEC, DESER, INFO, BUSL, PROTO, API, LLM, SKILLS.
How They're Used
Technique guides are integrated into every testing phase via the get_technique_guide() MCP tool:
Phase 2 → CORS guide for CONF-13 testing
Phase 3 → AUTHN, AUTHZ, CSRF, JWT, OAUTH guides for auth/session testing
Phase 4 → SQLI, XSS, CMDI, SSTI, SSRF, PTRAV, XXE guides for input validation
Phase 5 → DOM, CLICK, GRAPHQL, RACE, UPLOAD guides for client-side & business logic
Each parallel testing agent automatically loads its relevant technique guide before testing, providing:
- Detection payloads — what to inject to identify the vulnerability.
- Exploitation techniques — organized by attack method with step-by-step procedures.
- Cheat sheets — database/platform-specific syntax tables for quick reference.
- WAF bypass patterns — encoding, obfuscation, and filter evasion strategies.
Adding Custom Guides
See for instructions on adding new technique guides to the knowledge base.
Quality Assurance System
AutoPentest has a multi-layered QA system that prevents shallow testing:
1. Phase Gates (Automated)
After each phase, phase_gate_check() validates:
- All MUST-priority tests were executed.
- Minimum coverage thresholds are met.
- Tool coverage is adequate.
- No critical gaps exist.
Blocked phases cannot proceed until all issues are resolved.
2. Quality Reviewer (Per-Phase)
A subagent spawned at every phase transition that:
- Checks for 16 known anti-patterns (rubber-stamping, N/A cascades, finding inflation).
- Identifies untested endpoints and parameters.
- Suggests vulnerability chaining opportunities.
- Recommends alternative approaches for blocked tests.
3. Final Judge (Post-Report)
A zero-context agent that reviews the completed engagement with fresh eyes:
- Analyzes coverage integrity across all domains.
- Detects N/A cascades and their root causes.
- Validates finding quality and evidence completeness.
- Identifies missed attack surface.
- Issues a verdict: PASS, CONDITIONAL_PASS, or FAIL.
4. Exhaustion Gates
Marking a vulnerability as "not exploitable" requires proof of effort:
| Vuln Class |
Min Techniques |
Min Bypass Attempts |
| XSS |
3 |
5 |
| SQL Injection |
3 |
5 |
| Command Injection |
3 |
5 |
| SSTI |
2 |
3 |
| SSRF |
3 |
5 |
| Path Traversal |
3 |
5 |
5. Evidence Checklists
Before logging any finding, evidence requirements are verified:
- Reproducible curl command.
- Full HTTP request and response.
- Proof of actual exploitation (not theoretical impact).
- Correct classification tier (EXPLOITED vs POTENTIAL).
6. Live Engagement Logging
Every MCP tool call is automatically logged to engagements/<eid>/logs.txt with full arguments, results, and execution duration. Run tail -f logs.txt in a separate terminal to watch all agent activity in real time. 100% coverage via automatic tool wrapper — no manual instrumentation needed.
7. Phase Gate Timing
Phase gates enforce minimum 60 - second intervals between calls (15s in CTF mode), preventing premature phase completion. Inter - gate work verification warns if fewer than 3 work events occur between consecutive gates.
Benchmarking
AutoPentest includes integration with the XBOW Validation Benchmarks — 104 CTF-style Docker challenges used as the industry standard for benchmarking AI pentest agents.
Benchmark Scores (Reference)
| Agent |
Score |
Source |
| Shannon |
96.2% |
KeygraphHQ (2024) |
| PentestGPT |
86.5% |
USENIX Sec 2024 |
Usage
cd benchmarks/xbow && make setup
make solve ID=XBEN-001-24
make solve ID=XBEN-001-24 RAW=1
make solve-tag TAG=sqli
make solve-all
make solve-all RAW=1
make score
make compare
The solver has two modes:
- autopentest (default): Runs Claude Code from the project root, loading
.mcp.json (MCP server with 68+ tools) and CLAUDE.md (pentest methodology). Measures AutoPentest's full capability.
- raw (
RAW=1): Runs bare Claude Code with no MCP server or methodology. Baseline for measuring AutoPentest's value - add over raw LLM capability.
Each challenge is a Docker Compose app with a flag injected at build time. Flag extraction from Claude's output determines pass/fail. Results are scored per - challenge, per - tag, and per - difficulty - level.
CTF Mode
For CTF challenges and small apps, enable CTF mode for relaxed quality gates:
mode: ctf
target:
url: https://target.com
CTF mode reduces phase gate timing (15s vs 60s), skips QA Reviewer requirements, and halves completion thresholds — while maintaining finding quality and evidence standards.
Example Report
A complete example report from a pentest against PortSwigger's Gin & Juice Shop (a deliberately vulnerable application) is included in the repository:
View Full Report
What the Report Includes
The report demonstrates AutoPentest's output against a real target with 23 findings across all severity levels:
| Severity |
Count |
Examples |
| Critical |
2 |
UNION-based SQL injection with full data extraction, access control bypass via X-Original-URL header |
| High |
5 |
Reflected XSS via JS string escape bypass, IDOR on order details, XXE with local file read, DOM XSS via prototype pollution |
| Medium |
6 |
Missing security headers, no account lockout, missing CSP, CRLF injection, DOM-based open redirect |
| Low |
5 |
Infrastructure info disclosure, EOL AngularJS, insecure ALB cookies, weak TLS config |
| Informational |
5 |
Consolidated duplicates and secondary evidence for primary findings |
Report Structure
1. Executive Summary — Target scope, finding summary, domain architecture
2. Detailed Findings — Each finding with description, evidence (curl commands), and remediation
3. Vulnerability Chaining — Cross-finding analysis (e.g., XSS + no CSP = severity upgrade)
4. Test Coverage Matrix — Per-category WSTG coverage (100% across 12 categories)
5. Tool Coverage Matrix — 27/27 tools tracked, 8 actively run
Sample Finding (SQL Injection)
From the report — a Critical SQL injection finding with full exploitation evidence:
FINDING-017: SQL Injection in /catalog category parameter — Full Data Extraction
Severity: Critical
WSTG Reference: WSTG-INPV-05
The category parameter is vulnerable to UNION-based SQL injection.
The attacker can:
1. Inject a single quote to cause a 500 error (confirming injection)
2. Use UNION SELECT with 8 columns to extract arbitrary data
3. Enumerate tables: PRODUCTS, TRACKING, USERS
4. Extract credentials from the USERS table
Evidence (reproducible curl command):
curl -sk "https://ginandjuice.shop/catalog?category='+UNION+SELECT+1,USERNAME,PASSWORD,
1,1,USERNAME,1,USERNAME+FROM+USERS+LIMIT+10--"
Every finding includes reproducible curl commands, full request/response evidence, and actionable remediation guidance.
Configuration
Engagement Config (YAML)
Config-driven pentests skip interactive questions and ensure consistency:
target:
url: https://app.example.com
scope: [app.example.com, api.example.com]
authentication:
login_type: sso
login_url: https://app.example.com/login
credentials:
username: testuser
password: secret123
sso:
provider: keycloak
auth_domain: auth.example.com
realm: myrealm
client_id: my-app
rules:
avoid:
- { type: path, url_path: "/logout", description: "Skip logout" }
- { type: endpoint, method: DELETE, url_path: "/api/admin/*", description: "No destructive admin ops" }
focus:
- { type: path, url_path: "/api", description: "Prioritize API" }
reporting:
tester_name: "Security Team"
MCP Server Configuration
The .mcp.json file registers two MCP servers:
{
"mcpServers": {
"wstg-pentest": {
"command": "uv",
"args": ["--directory", "./server", "run", "server.py"]
},
"playwright": {
"command": "npx",
"args": ["-y", "@playwright/mcp"]
}
}
}
Burp Suite Integration (Optional)
For passive traffic monitoring through Burp Suite Professional:
- Start Burp Suite and enable the proxy on all interfaces (
0.0.0.0:8080).
- The Docker container automatically routes traffic through
host.docker.internal:8080.
- All HTTP requests appear in Burp's proxy history for manual review.
Multi-Domain Testing
AutoPentest has first-class support for applications with multiple domains (e.g., a SPA frontend + API backend + SSO provider):
Automatic Detection
During Phase 0, AutoPentest detects cross-domain authentication by following login redirects:
app.example.com → redirects to → auth.example.com/login
→ after login → app.example.com/callback
All domains are automatically registered in scope with their type (app, auth_provider, api, cdn).
Per-Domain Testing
Every WSTG test is evaluated per domain — not just the primary:
- Discovery tools (katana, ffuf, nuclei) run against all domains.
- Input validation tools (sqlmap, dalfox) target endpoints on every domain with server-side processing.
- A test is "not applicable" only when no domain has the tested feature.
Cross-Domain Authentication
Supported SSO protocols:
- OAuth 2.0 / OIDC (Authorization Code, PKCE, Password Grant, Client Credentials).
- SAML (SP-initiated flow).
- Keycloak, Auth0, Okta, Azure AD.
- Custom SSO (redirect chain following with cookie jar).
Authentication escalation procedure (6 levels) ensures testing can proceed even with complex auth flows.
Crash Recovery
AutoPentest is designed to survive interruptions:
Automatic Checkpointing
- Phase gates auto-save checkpoints on PASS.
git_checkpoint() creates git snapshots of the engagement workspace.
- Append-only logs (
findings.md, progress.log) survive crashes.
Auto-Resume via resume-prompt.md (Recommended)
Every checkpoint and phase gate automatically generates engagements/<eid>/resume-prompt.md — a complete, self-contained prompt with everything a fresh session needs:
- Target URL, authentication credentials, and scope domains.
- Current phase and which specific tests remain (mid-phase precision).
- Cookie jar status and re-authentication instructions.
- Avoid/focus rules and endpoint map references.
To resume after an interruption:
- Open a new Claude Code session.
- Paste the contents of
engagements/<eid>/resume-prompt.md.
- Claude picks up exactly where it left off — no manual context needed.
Resume from Checkpoint (Alternative)
Resume engagement pentest-2026-02-11-myapp
This restores:
- All findings and test tracking data.
- Coverage statistics and phase gate results.
- Scope registrations and deliverables.
- Mid-phase remaining tests (not just phase-level state).
- Instructions for what to do next.
Manual Checkpoints
Save at any time:
Save a checkpoint before starting Phase 4 exploitation
Rollback on Failure
If a phase produces bad results, roll back to the previous checkpoint:
Roll back the engagement to the last checkpoint
Project Structure
autopentest-ai/
├── CLAUDE.md # Master pentest workflow (drives Claude Code)
├── .mcp.json # MCP server configuration
├── Dockerfile # Multi-stage Docker build (27 tools)
├── docker-compose.yml # Docker Compose alternative
├── Makefile # setup, start, stop, verify-tools, shell
│
├── server/
│ ├── server.py # FastMCP server (68+ MCP tools)
│ ├── task_tree.py # Hierarchical task tree (6 MCP tools)
│ ├── tool_parsers.py # Tool output parsing (2 MCP tools, 13 parsers)
│ ├── endpoint_priority.py # Endpoint risk prioritization (2 MCP tools)
│ ├── waf_evasion.py # Adaptive WAF evasion (3 MCP tools, 12 vendors)
│ ├── knowledge_graph.py # Cross-phase knowledge graph (5 MCP tools)
│ ├── tool_verification.py # CLI tool results verification (1 MCP tool, 10 validators)
│ ├── context_compression.py # Progressive context compression (2 MCP tools)
│ └── pyproject.toml # Python dependencies
│
├── knowledge-base/
│ ├── web-security-testing-guide/ # OWASP WSTG knowledge base (109 test procedures)
│ │ ├── 01-information-gathering/ # 10 tests (WSTG-INFO-01 → 10)
│ │ ├── 02-configuration/ # 14 tests (WSTG-CONF-01 → 14)
│ │ ├── 03-identity-management/ # 5 tests (WSTG-IDNT-01 → 05)
│ │ ├── 04-authentication/ # 11 tests (WSTG-ATHN-01 → 11)
│ │ ├── 05-authorization/ # 5 tests (WSTG-ATHZ-01 → 05)
│ │ ├── 06-session-management/ # 11 tests (WSTG-SESS-01 → 11)
│ │ ├── 07-input-validation/ # 20 tests (WSTG-INPV-01 → 20)
│ │ ├── 08-error-handling/ # 2 tests (WSTG-ERRH-01 → 02)
│ │ ├── 09-cryptography/ # 4 tests (WSTG-CRYP-01 → 04)
│ │ ├── 10-business-logic/ # 10 tests (WSTG-BUSL-01 → 10)
│ │ ├── 11-client-side/ # 14 tests (WSTG-CLNT-01 → 14)
│ │ └── 12-api-testing/ # 3 tests (WSTG-APIT-01 → 03)
│ └── portswigger-academy/ # 31 PortSwigger attack technique guides
│ ├── sql-injection.md # UNION, blind, error-based, OOB, WAF bypass
│ ├── cross-site-scripting.md # Reflected, stored, DOM, CSP bypass, filter evasion
│ ├── ssrf.md # URL schemes, cloud metadata, DNS rebinding
│ ├── ssti.md # Jinja2, Twig, Freemarker sandbox escapes
│ ├── jwt.md # Algorithm confusion, kid injection, JWK exploitation
│ ├── oauth.md # Auth code theft, redirect exploitation, scope upgrade
│ └── ... (31 total) # One per vulnerability class
│
├── templates/ # Testing guides and procedures
│ ├── input-validation-guide.md # Phase 4 step-by-step procedures
│ ├── testing-strategies.md # Test matrices, chaining, parallel strategy
│ ├── cli-tools-guide.md # Tool setup and Docker management
│ ├── tools.md # Per-tool command reference
│ ├── quality-gates.md # Phase quality checklists and anti-patterns
│ ├── cross-domain-auth-guide.md # SSO/OIDC/SAML procedures
│ ├── source-code-analysis.md # Security-focused code review template
│ ├── pipelined-testing.md # Phase 4 pipelined exploitation strategy
│ ├── agent-roles/ # Role-specialized subagent templates
│ │ ├── README.md # Role index and selection guide
│ │ ├── scout.md # Reconnaissance role (Phase 0-1)
│ │ ├── analyzer.md # Vulnerability discovery role (Phase 2-5)
│ │ ├── exploiter.md # Exploitation proof role (Phase 4)
│ │ └── reporter.md # QA review + Final Judge role
│ ├── shared/
│ │ ├── honesty-framework.md # Anti-hallucination guardrails
│ │ ├── exploit-classification.md # Three-tier finding classification
│ │ ├── reproducibility.md # Evidence format requirements
│ │ └── scope-rules.md # Avoid/focus rule templates
│ └── wordlists/ # Tech-specific fuzzing wordlists
│
├── benchmarks/
│ └── xbow/ # XBOW benchmark suite (104 CTF challenges)
│ ├── runner.py # Challenge orchestration
│ ├── solver.py # Automated solver (Claude Code CLI)
│ ├── Makefile # solve, solve-all, score, compare
│ └── results/ # Run reports
│
├── docs/
│ ├── ROADMAP.md # Competitive analysis + improvement roadmap
│ └── adding-knowledge-base-resources.md # Guide for adding new technique guides
│
├── configs/
│ ├── example-config.yaml # Example engagement configuration
│ └── config-schema.md # YAML schema documentation
│
├── scripts/
│ ├── install-tools.sh # Docker build + container start
│ ├── browser-auth.py # Headless Chromium auth (JS-rendered logins)
│ ├── pkce-auth.py # OAuth 2.0 PKCE flow automation
│ └── status.sh # Engagement status dashboard
│
└── engagements/ # Runtime output (git-ignored)
└── <engagement-id>/
├── logs.txt # Live engagement log (tail -f to watch)
├── findings.md # Append-only findings log
├── progress.log # Timestamped event log
├── resume-prompt.md # Auto-resume prompt (paste into new session)
├── report.md # Final pentest report
├── cookies.txt # Cross-domain cookie jar
└── tool-output/ # Raw CLI tool outputs
🔧 Technical Details
Why AutoPentest?
Manual penetration testing is thorough but slow. Automated scanners are fast but shallow. AutoPentest bridges the gap:
| Capability |
Manual Pentest |
Automated Scanner |
AutoPentest |
| Full OWASP WSTG coverage |
Depends on tester |
Partial |
109 tests |
| Business logic testing |
Yes |
No |
Yes |
| Multi-step exploitation |
Yes |
Limited |
Yes |
| Vulnerability chaining |
Yes |
No |
Yes |
| Evidence-based findings |
Yes |
Template output |
Reproducible curl commands |
| Consistent quality |
Varies |
Yes |
Phase gates + Final Judge |
| Speed |
Days |
Minutes |
Hours |
| Cross-domain auth (SSO/OIDC) |
Manual setup |
Usually fails |
Automated handling |
Architecture
┌─────────────────────────────────────────────────────────────┐
│ LLM Orchestrator (Claude) │
│ │
│ Reads CLAUDE.md workflow, manages phases, │
│ spawns role-specialized subagents │
└──────────┬──────────┬──────────┬──────────┬─────────────────┘
│ │ │ │
┌─────▼────┐ ┌───▼─────┐ ┌──▼───────┐ ┌▼─────────┐
│ Scout │ │Analyzer │ │Exploiter │ │ Reporter │
│ (recon) │ │ (vuln │ │ (proof) │ │ (QA / │
│ │ │ disc.) │ │ │ │ judge) │
└──────────┘ └─────────┘ └──────────┘ └──────────┘
│ │ │ │
│ MCP │ │ MCP │
▼ ▼ ▼ ▼
┌──────────────────────────┐ ┌──────────────────────┐
│ WSTG MCP Server │ │ Playwright MCP │
│ (68+ tools) │ │ (Browser Testing) │
│ │ │ │
│ ◦ 109 WSTG tests │ │ ◦ DOM XSS proof │
│ ◦ 31 technique guides │ │ ◦ Clickjacking │
│ ◦ Task tree │ │ ◦ JS-rendered auth │
│ ◦ Knowledge graph │ └──────────────────────┘
│ ◦ WAF evasion │
│ ◦ Tool output parser │
│ ◦ Results verification │ docker exec
│ ◦ Context compression │ │
│ ◦ Endpoint priority │ ▼
│ ◦ Quality gates │ ┌──────────────────────┐
│ ◦ Report generation │ │ autopentest-tools │
└──────────────────────────┘ │ (Docker Container) │
│ │
│ 27 security tools: │
│ nuclei, sqlmap, │
│ dalfox, katana, │
│ ffuf, nmap ... │
│ │
│ Burp proxy │
│ passthrough │
└──────────────────────┘
How it works:
- Claude Code reads
CLAUDE.md for the complete pentest methodology and orchestrates the 7 - phase workflow.
- Role-specialized subagents (Scout, Analyzer, Exploiter, Reporter) execute focused tasks with dedicated prompt templates, tool guidance, and anti-patterns.
- WSTG MCP Server (68+ tools) provides OWASP test procedures, 31 PortSwigger technique guides, hierarchical task tree, knowledge graph, WAF evasion, endpoint prioritization, results verification, context compression, quality gates, and report generation.
- Docker Container runs all 27 security tools — traffic optionally routes through Burp Suite for passive monitoring.
- Playwright MCP handles browser-based testing (DOM XSS, clickjacking, JS-rendered login pages).
📄 License
This tool is intended for authorized security testing only. Only use AutoPentest against applications you have explicit permission to test. Unauthorized access to computer systems is illegal. The authors are not responsible for any misuse of this tool.
Always ensure you have:
- Written authorization from the application owner.
- A clearly defined scope of what can and cannot be tested.
- An understanding of the testing environment (production vs staging).
- Appropriate avoid rules configured for destructive or sensitive endpoints.
Built with Model Context Protocol