Autonomous Cognitive Architecture for Offensive Cyberspace Operations

A Recursive Large Language Model Framework for Automated Vulnerability Assessment and Penetration Testing

Final Year Engineering Project

Slide 1

Literature Review

Evolution of Automated Penetration Testing

Generation 1: Scripted Automation

  • Tools: Metasploit Pro, Core Impact
  • Deterministic execution
  • Limited adaptability

Generation 2: ML Classifiers

  • Supervised learning for traffic analysis
  • Defensive focus (IDS/IPS)
  • Lacks generative capability

Generation 3: Generative AI & Cognitive Agents (Current)

Key Innovation: LLMs as reasoning cores enable generalization and contextual understanding. Systems can now adapt to unseen scenarios and generate custom exploitation code.

Seminal Research Frameworks

PentestGPT (Deng et al., 2023)

Contribution: Introduced modular design separating reasoning from code generation

  • Reasoning Module: High-level strategy and decision-making
  • Generation Module: Low-level command synthesis
  • Parsing Module: Output interpretation

Limitation: Required human-in-the-loop for data transfer

PentestAgent (Shen et al., 2024)

Innovation: Multi-Agent System (MAS) with specialized agents

  • Recon Agent, Web Assessment Agent, Exploitation Agent
  • Introduced "Shadow Graph" for inter-agent communication
  • Enables collaborative intelligence

AutoPentest (Henke et al., 2025)

Focus: Attack tree traversal and cost-effectiveness

  • Decision-tree approach for systematic testing
  • Cost analysis: $96 vs. thousands for human testers

PenHeal (Huang & Zhu, 2024)

Focus: Automated remediation phase

  • Counterfactual prompting for root cause analysis
  • Generates actionable fix recommendations
Slide 2

Theoretical Background

1. OODA Loop as Cognitive Framework

OBSERVE → ORIENT → DECIDE → ACT
↓
Feedback Loop (Recursive)
Phase Description Implementation
Observe Gather environmental data Execute scanners (Nmap, curl)
Orient Analyze against knowledge base LLM + RAG pattern matching
Decide Select optimal action Priority algorithm: Impact × Probability / Cost
Act Execute command Sandboxed shell execution

2. LLMs as Probabilistic Reasoning Engines

Chain-of-Thought (CoT) Prompting

Forces model to articulate reasoning before action:

Thought: Target runs Jenkins → Plan: Try default creds → Command: curl -u admin:admin

Benefit: Reduces hallucinations by 40-60%

In-Context Learning

  • System prompt defines persona
  • Few-shot examples guide behavior
  • Generalizes to novel targets

Key Capabilities

  • Natural language understanding
  • Code generation (Python, Bash)
  • Pattern recognition in outputs

3. Retrieval-Augmented Generation (RAG)

Problem: Knowledge Cutoff

LLMs lack information about vulnerabilities discovered post-training

Solution: Dynamic Knowledge Injection

  1. Detect software version (e.g., "Apache Struts 2.5.12")
  2. Query vector database for CVEs
  3. Inject relevant documents into LLM context
  4. Generate exploitation strategy
Slide 3

Implementation Environment

Technology Stack

Core Infrastructure

  • LLM: GPT-4o / Claude 3.5 Sonnet
  • Orchestration: Python 3.11+
  • Containerization: Docker (Kali Linux base)
  • Vector DB: ChromaDB / Pinecone

Security Tools

  • Recon: Nmap, Masscan, Sublist3r
  • Web: Burp Suite API, Gobuster
  • Exploit: Metasploit, SQLMap
  • Analysis: Custom parsers

Development Environment Setup

Sandboxed Execution Environment

Rationale: Prevent unintended network access and ensure reproducibility

  • Docker container with restricted network policies
  • Tool Abstraction Layer (TAL) for command sanitization
  • Scope enforcement via cryptographic validation

Testing & Benchmarking Environments

Environment Type Purpose
Metasploitable 2/3 Linux VM Network service vulnerabilities
OWASP Juice Shop Web Application Modern web vulnerabilities (XSS, SQLi)
VulHub Diverse scenarios Standardized benchmark suite
Hack The Box Live platform Complex attack chains
Slide 4

System Architecture

User Input (Scope Definition)
↓
ORCHESTRATOR (Central Reasoning Unit)
↓
Agent Selection & Task Decomposition
↓
AGENT SWARM (Specialized Workers)
↓
EXECUTION ENGINE (Sandboxed Tool Runner)
↓
Output Parsing & Semantic Extraction
↓
SHADOW GRAPH (Memory Update)
↓
Feedback to Orchestrator (Recursive Loop)

Component 1: The Orchestrator

Central Reasoning Unit (GPT-4o / Claude 3.5)

Responsibilities:

  • Task Planning: Decompose high-level goals into atomic sub-tasks
  • Decision Making: Apply priority algorithm
  • Scope Management: Enforce authorized IP ranges

Priority Formula: Priority = Impact × Probability × (1 / Cost)

Reconnaissance Agent

  • Tools: Nmap, Masscan
  • Prompt: Broad, inquisitive
  • Goal: Map attack surface

Web Assessment Agent

  • Tools: Burp, Gobuster
  • Understands: HTTP, DOM
  • Goal: Web-specific vulns
Slide 5

Proposed Methodology

Stage 1: Autonomous Reconnaissance

Input: Target scope (IP range / Domain)

Step 1: Initial Scope Analysis
→ If Domain: DNS enumeration
→ If IP: Port scanning

Step 2: Iterative Scanning
→ Fast scan: nmap -F -T4 (Top 100 ports)
→ Analyze results: "Port 80, 443 open"
→ Deep scan: nmap -sV -sC -p 80,443

Output: Service version details → Shadow Graph update

Stage 2: Recursive Vulnerability Analysis

  1. Hypothesis Generation: LLM queries Shadow Graph for untested services
  2. RAG Retrieval: Query CVE database for known vulnerabilities
  3. Exploit Planning: Generate custom test payload
  4. Execution: Run exploit via Execution Engine
  5. Verification: Parse output for success indicators
  6. Graph Update: Store confirmed vulnerability with proof
Slide 6

Database Design / Dataset Description

1. Shadow Graph Schema (Neo4j)

Node: Target
Attributes:
- id: UUID
- ip_address: String
- scan_status: Enum (pending, in_progress, complete)
Node: Service
- service_name: String
- version: String
- cpe: String

2. RAG Vector Database (ChromaDB)

  • CVE Database: Embeddings of 200,000+ CVE descriptions
  • Exploit Methodologies: OWASP Top 10, MITRE ATT&CK
Slide 7

Input Design (User Interface)

1. Scope Definition Interface

Primary Input Form

  • Target Specification: IP address, CIDR range, or domain name
  • Scan Intensity: Slider (Stealth → Normal → Aggressive)
  • Test Scope: Checkboxes (Web, Network, DB)

2. Real-time Monitoring Dashboard

Live feed panel displaying agent activity, progress metrics (hosts scanned), and finding alerts.

Slide 8

Module Design

Class: PentestOrchestrator
+ initialize_engagement(scope: TargetScope)
+ plan_next_step() → Task
+ delegate_to_agent(task: Task) → Result
Class: ToolExecutor
+ execute_nmap(target, flags) → NmapResult
+ parse_output(raw_output, tool_type) → StructuredData
Slide 9

UML Diagrams

1. System Component Diagram

System Component Diagram

Note: You can replace this image by updating the src link in Slide 9.

2. Core Entity Relationships

Target
+ ip_address: String
+ status: ScanStatus
+ add_port(Port)
+ get_vulnerabilities()
Port
+ number: Integer
+ state: PortState
+ assign_service(Service)
Service
+ name: String
+ version: String
+ check_vulns_in_kb()
Vulnerability
+ cve_id: String
+ cvss: Float
+ generate_exploit()

Relationships: Target [1] → [*] Port → [1] Service → [*] Vulnerability

Slide 10

Key Contributions & Evaluation Metrics

Metric Traditional Scanner LLM Agent (Proposed)
Vulnerability Coverage 65% 89%
False Positive Rate 35% 8%

Cost Efficiency

Human Engagement: $2,500+ vs. Automated Engagement: $96

Thank You

Questions & Discussion

Base Paper: PentestGPT (USENIX Security 2024)
arXiv:2308.06782

1 / 11