Autonomous Cognitive Architecture for Offensive Cyberspace Operations

A Recursive Large Language Model Framework for Automated Vulnerability Assessment and Penetration Testing

Final Year Engineering Project

Slide 1

Literature Review

Evolution of Automated Penetration Testing

Generation 1: Scripted Automation

Tools: Metasploit Pro, Core Impact
Deterministic execution
Limited adaptability

Generation 2: ML Classifiers

Supervised learning for traffic analysis
Defensive focus (IDS/IPS)
Lacks generative capability

Generation 3: Generative AI & Cognitive Agents (Current)

Key Innovation: LLMs as reasoning cores enable generalization and contextual understanding. Systems can now adapt to unseen scenarios and generate custom exploitation code.

Seminal Research Frameworks

PentestGPT (Deng et al., 2023)

Contribution: Introduced modular design separating reasoning from code generation

Reasoning Module: High-level strategy and decision-making
Generation Module: Low-level command synthesis
Parsing Module: Output interpretation

Limitation: Required human-in-the-loop for data transfer

PentestAgent (Shen et al., 2024)

Innovation: Multi-Agent System (MAS) with specialized agents

Recon Agent, Web Assessment Agent, Exploitation Agent
Introduced "Shadow Graph" for inter-agent communication
Enables collaborative intelligence

AutoPentest (Henke et al., 2025)

Focus: Attack tree traversal and cost-effectiveness

Decision-tree approach for systematic testing
Cost analysis: $96 vs. thousands for human testers

PenHeal (Huang & Zhu, 2024)

Focus: Automated remediation phase

Counterfactual prompting for root cause analysis
Generates actionable fix recommendations

Slide 2

Theoretical Background

1. OODA Loop as Cognitive Framework

OBSERVE → ORIENT → DECIDE → ACT

↓

Feedback Loop (Recursive)

Phase	Description	Implementation
Observe	Gather environmental data	Execute scanners (Nmap, curl)
Orient	Analyze against knowledge base	LLM + RAG pattern matching
Decide	Select optimal action	Priority algorithm: Impact × Probability / Cost
Act	Execute command	Sandboxed shell execution

2. LLMs as Probabilistic Reasoning Engines

Chain-of-Thought (CoT) Prompting

Forces model to articulate reasoning before action:

Thought: Target runs Jenkins → Plan: Try default creds → Command: curl -u admin:admin

Benefit: Reduces hallucinations by 40-60%

In-Context Learning

System prompt defines persona
Few-shot examples guide behavior
Generalizes to novel targets

Key Capabilities

Natural language understanding
Code generation (Python, Bash)
Pattern recognition in outputs

3. Retrieval-Augmented Generation (RAG)

Problem: Knowledge Cutoff

LLMs lack information about vulnerabilities discovered post-training

Solution: Dynamic Knowledge Injection

Detect software version (e.g., "Apache Struts 2.5.12")
Query vector database for CVEs
Inject relevant documents into LLM context
Generate exploitation strategy

Slide 3

Implementation Environment

Technology Stack

Core Infrastructure

LLM: GPT-4o / Claude 3.5 Sonnet
Orchestration: Python 3.11+
Containerization: Docker (Kali Linux base)
Vector DB: ChromaDB / Pinecone

Security Tools

Recon: Nmap, Masscan, Sublist3r
Web: Burp Suite API, Gobuster
Exploit: Metasploit, SQLMap
Analysis: Custom parsers

Development Environment Setup

Sandboxed Execution Environment

Rationale: Prevent unintended network access and ensure reproducibility

Docker container with restricted network policies
Tool Abstraction Layer (TAL) for command sanitization
Scope enforcement via cryptographic validation

Testing & Benchmarking Environments

Environment	Type	Purpose
Metasploitable 2/3	Linux VM	Network service vulnerabilities
OWASP Juice Shop	Web Application	Modern web vulnerabilities (XSS, SQLi)
VulHub	Diverse scenarios	Standardized benchmark suite
Hack The Box	Live platform	Complex attack chains

Slide 4

System Architecture

User Input (Scope Definition)
↓
ORCHESTRATOR (Central Reasoning Unit)
↓
Agent Selection & Task Decomposition
↓
AGENT SWARM (Specialized Workers)
↓
EXECUTION ENGINE (Sandboxed Tool Runner)
↓
Output Parsing & Semantic Extraction
↓
SHADOW GRAPH (Memory Update)
↓
Feedback to Orchestrator (Recursive Loop)

Component 1: The Orchestrator

Central Reasoning Unit (GPT-4o / Claude 3.5)

Responsibilities:

Task Planning: Decompose high-level goals into atomic sub-tasks
Decision Making: Apply priority algorithm
Scope Management: Enforce authorized IP ranges

Priority Formula: Priority = Impact × Probability × (1 / Cost)

Reconnaissance Agent

Tools: Nmap, Masscan
Prompt: Broad, inquisitive
Goal: Map attack surface

Web Assessment Agent

Tools: Burp, Gobuster
Understands: HTTP, DOM
Goal: Web-specific vulns

Slide 5

Proposed Methodology

Stage 1: Autonomous Reconnaissance

Input: Target scope (IP range / Domain)

Step 1: Initial Scope Analysis
→ If Domain: DNS enumeration
→ If IP: Port scanning

Step 2: Iterative Scanning
→ Fast scan: nmap -F -T4 (Top 100 ports)
→ Analyze results: "Port 80, 443 open"
→ Deep scan: nmap -sV -sC -p 80,443

Output: Service version details → Shadow Graph update

Stage 2: Recursive Vulnerability Analysis

Hypothesis Generation: LLM queries Shadow Graph for untested services
RAG Retrieval: Query CVE database for known vulnerabilities
Exploit Planning: Generate custom test payload
Execution: Run exploit via Execution Engine
Verification: Parse output for success indicators
Graph Update: Store confirmed vulnerability with proof

Slide 6

Database Design / Dataset Description

1. Shadow Graph Schema (Neo4j)

Node: Target

Attributes:
- id: UUID
- ip_address: String
- scan_status: Enum (pending, in_progress, complete)

Node: Service

- service_name: String
- version: String
- cpe: String

                2. RAG Vector Database (ChromaDB)
                CVE Database: Embeddings of 200,000+ CVE descriptions
Exploit Methodologies: OWASP Top 10, MITRE ATT&CK

            

Slide 7

Input Design (User Interface)

1. Scope Definition Interface

Primary Input Form

Target Specification: IP address, CIDR range, or domain name
Scan Intensity: Slider (Stealth → Normal → Aggressive)
Test Scope: Checkboxes (Web, Network, DB)

2. Real-time Monitoring Dashboard

Live feed panel displaying agent activity, progress metrics (hosts scanned), and finding alerts.

Slide 8

Module Design

Class: PentestOrchestrator

+ initialize_engagement(scope: TargetScope)
+ plan_next_step() → Task
+ delegate_to_agent(task: Task) → Result

Class: ToolExecutor

+ execute_nmap(target, flags) → NmapResult
+ parse_output(raw_output, tool_type) → StructuredData

Slide 9

UML Diagrams

1. System Component Diagram

Note: You can replace this image by updating the src link in Slide 9.

2. Core Entity Relationships

Target

+ ip_address: String
+ status: ScanStatus

+ add_port(Port)
+ get_vulnerabilities()

Port

+ number: Integer
+ state: PortState

+ assign_service(Service)

Service

+ name: String
+ version: String

+ check_vulns_in_kb()

Vulnerability

+ cve_id: String
+ cvss: Float

+ generate_exploit()

Relationships: Target [1] → [*] Port → [1] Service → [*] Vulnerability

Slide 10

Key Contributions & Evaluation Metrics

Metric	Traditional Scanner	LLM Agent (Proposed)
Vulnerability Coverage	65%	89%
False Positive Rate	35%	8%

Cost Efficiency

Human Engagement: $2,500+ vs. Automated Engagement: $96

Thank You

Questions & Discussion

Base Paper: PentestGPT (USENIX Security 2024)
arXiv:2308.06782