Two-stage reinforcement learning framework for training LLMs to perform autonomous penetration testing and CTF challenges
---
name: pentest-r1-autonomous-penetration-testing
description: Two-stage reinforcement learning framework for training LLMs to perform autonomous penetration testing and CTF challenges
triggers:
- train a penetration testing AI model
- use pentest-r1 for autonomous security testing
- setup reinforcement learning for pentesting
- run pentest-r1 offline training
- configure pentest-r1 online RL environment
- train LLM on CTF challenges
- build autonomous penetration testing agent
- setup intercode-ctf docker environment
---
# Pentest-R1 Autonomous Penetration Testing
> Skill by [ara.so](https://ara.so) — Security Skills collection
Pentest-R1 is a two-stage reinforcement learning framework that trains Large Language Models for autonomous penetration testing. It combines offline RL on expert walkthroughs with online RL in interactive CTF environments to develop robust attack reasoning capabilities.
## Installation
### Prerequisites
- Python 3.11.11
- Docker (for Stage 2 and reproducible environments)
- NVIDIA Container Toolkit (for GPU support)
- CUDA 12.4 runtime (optional, for GPU acceleration)
### Basic Setup
```bash
git clone https://github.com/KHenryAegis/Pentest-R1.git
cd Pentest-R1
pip install -r requirements.txt
```
### Docker Environment (Recommended for Reproducibility)
```bash
# Build the reproducible research environment
source setup-docker.sh
# Run container with optimized cache mounting
docker run --rm -it \
--name pentest-r1 \
-v "$(pwd)":/root/Pentest-R1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/triton:/root/.cache/triton \
-v ~/.cache/torch_extensions:/root/.cache/torch_extensions \
-w /root/Pentest-R1 \
--gpus all \
--net=host \
pentest-r1:ubuntu22.04
```
## Core Training Pipeline
### Stage 1: Offline Reinforcement Learning
Stage 1 trains the base LLM on a curated dataset of 500+ real-world expert penetration testing walkthroughs.
```bash
python grpo_stage1.py
```
**Key configuration in `grpo_stage1.py`:**
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from unsloth import FastLanguageModel
import torch
# Model configuration
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct"
max_seq_length = 4096
dtype = None # Auto-detect
load_in_4bit = True # Use 4-bit quantization
# Load model with unsloth optimization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
# PEFT configuration for efficient training
from peft import LoraConfig, get_peft_model
peft_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)
```
### Stage 2: Online Reinforcement Learning
Stage 2 fine-tunes the model in interactive CTF environments with real-time feedback.
**Setup CTF environment:**
```bash
cd train_ctf_env
docker build -t intercode-ctf .
cd ..
```
**Run Stage 2 training:**
```bash
python grpo_multi_turn_stage2.py
```
## Key Components
### Data Loading and Preprocessing
```python
from datasets import load_dataset
# Load expert walkthrough dataset
dataset = load_dataset("json", data_files="path/to/expert_walkthroughs.jsonl")
# Example dataset format
# {
# "challenge": "SQL Injection in login form",
# "steps": [
# {"action": "reconnaissance", "command": "sqlmap -u http://target/login", "reasoning": "..."},
# {"action": "exploit", "command": "sqlmap --dump", "reasoning": "..."}
# ],
# "flag": "CTF{...}"
# }
def preprocess_function(examples):
"""Format data for training"""
prompts = []
responses = []
for challenge, steps in zip(examples["challenge"], examples["steps"]):
prompt = f"Challenge: {challenge}\nWhat are the steps to solve this?"
response = "\n".join([
f"Step {i+1}: {step['reasoning']}\nCommand: {step['command']}"
for i, step in enumerate(steps)
])
prompts.append(prompt)
responses.append(response)
return {"prompt": prompts, "response": responses}
tokenized_dataset = dataset.map(preprocess_function, batched=True)
```
### Reward Model Configuration
```python
class PentestRewardModel:
"""Reward model for evaluating penetration testing actions"""
def __init__(self):
self.success_reward = 1.0
self.partial_reward = 0.5
self.failure_penalty = -0.1
def calculate_reward(self, action, environment_feedback):
"""Calculate reward based on action outcome"""
if "flag" in environment_feedback.lower():
return self.success_reward
elif "error" in environment_feedback.lower():
return self.failure_penalty
elif "progress" in environment_feedback.lower():
return self.partial_reward
return 0.0
reward_model = PentestRewardModel()
```
### Interacting with CTF Environment
```python
import docker
class CTFEnvironment:
"""Wrapper for InterCode-CTF Docker environment"""
def __init__(self, image_name="intercode-ctf"):
self.client = docker.from_env()
self.image_name = image_name
self.container = None
def start(self, challenge_id):
"""Start a CTF challenge container"""
self.container = self.client.containers.run(
self.image_name,
detach=True,
environment={"CHALLENGE_ID": challenge_id},
network_mode="host",
remove=True
)
return self.container
def execute_command(self, command):
"""Execute a command in the container"""
if not self.container:
raise RuntimeError("Container not started")
exec_result = self.container.exec_run(command)
return {
"stdout": exec_result.output.decode('utf-8'),
"exit_code": exec_result.exit_code
}
def cleanup(self):
"""Stop and remove container"""
if self.container:
self.container.stop()
self.container = None
# Usage example
env = CTFEnvironment()
env.start(challenge_id="sql_injection_001")
result = env.execute_command("sqlmap -u http://localhost/login --batch")
print(result["stdout"])
env.cleanup()
```
## Training Configuration
### GRPO (Group Relative Policy Optimization) Settings
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./pentest-r1-checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=5e-5,
warmup_steps=100,
logging_steps=10,
save_steps=500,
save_total_limit=3,
fp16=True, # Mixed precision training
report_to="wandb", # Optional: integration with Weights & Biases
remove_unused_columns=False,
)
```
### Environment Variables
```python
import os
# Model and training configuration
os.environ["HF_TOKEN"] = os.getenv("HUGGINGFACE_TOKEN") # For model downloads
os.environ["WANDB_API_KEY"] = os.getenv("WANDB_API_KEY") # For logging
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # GPU selection
# Cache directories (mounted in Docker)
os.environ["HF_HOME"] = "/root/.cache/huggingface"
os.environ["TRITON_CACHE_DIR"] = "/root/.cache/triton"
```
## Common Patterns
### Multi-Turn Reasoning
```python
class MultiTurnAgent:
"""Agent for multi-turn penetration testing reasoning"""
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
self.conversation_history = []
def generate_action(self, observation):
"""Generate next action based on current observation"""
# Build prompt with conversation history
prompt = self._build_prompt(observation)
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
top_p=0.95
)
action = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
self.conversation_history.append({
"observation": observation,
"action": action
})
return action
def _build_prompt(self, observation):
"""Build prompt with conversation history"""
prompt = "You are a penetration testing expert. Analyze and exploit:\n\n"
for turn in self.conversation_history[-3:]: # Last 3 turns
prompt += f"Observation: {turn['observation']}\n"
prompt += f"Action: {turn['action']}\n\n"
prompt += f"Current Observation: {observation}\n"
prompt += "Next Action:"
return prompt
```
### Evaluation Loop
```python
def evaluate_on_ctf_challenges(model, tokenizer, challenge_set):
"""Evaluate model on a set of CTF challenges"""
agent = MultiTurnAgent(model, tokenizer)
results = []
for challenge in challenge_set:
env = CTFEnvironment()
env.start(challenge["id"])
solved = False
max_turns = 20
for turn in range(max_turns):
# Get current state
observation = env.execute_command("cat /challenge/description.txt")
# Generate action
action = agent.generate_action(observation["stdout"])
# Execute action
result = env.execute_command(action)
# Check for success
if "CTF{" in result["stdout"]:
solved = True
break
results.append({
"challenge_id": challenge["id"],
"solved": solved,
"turns": turn + 1
})
env.cleanup()
return results
```
## Troubleshooting
### CUDA Out of Memory
```python
# Use gradient checkpointing
model.gradient_checkpointing_enable()
# Reduce batch size
training_args.per_device_train_batch_size = 1
training_args.gradient_accumulation_steps = 8
# Use 8-bit or 4-bit quantization
load_in_4bit = True
```
### Docker Container Issues
```bash
# Check container logs
docker logs intercode-ctf
# Verify network connectivity
docker run --rm --net=host intercode-ctf ping -c 4 localhost
# Rebuild with no cache if issues persist
docker build --no-cache -t intercode-ctf train_ctf_env/
```
### Unsloth Version Compatibility
The research used `unsloth==2025.5.10`. If unavailable, the closest match is specified in `requirements.txt`:
```txt
unsloth_zoo==2025.5.11
unsloth @ git+https://github.com/unslothai/unsloth.git@45f26cda996ec0b9a2e28cb18a03251095aa29e8
```
### Model Loading Errors
```python
# If model fails to load, try without quantization first
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=torch.float16,
load_in_4bit=False,
)
# Clear cache if needed
import shutil
shutil.rmtree(os.path.expanduser("~/.cache/huggingface"), ignore_errors=True)
```
## Performance Optimization
### Cache Mounting for Faster Training
Always mount HuggingFace, Triton, and PyTorch caches when using Docker:
```bash
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/triton:/root/.cache/triton \
-v ~/.cache/torch_extensions:/root/.cache/torch_extensions
```
### Distributed Training
```python
# For multi-GPU training
training_args.ddp_find_unused_parameters = False
training_args.local_rank = int(os.environ.get("LOCAL_RANK", -1))
# Launch with torchrun
# torchrun --nproc_per_node=4 grpo_stage1.py
```
## References
- Paper: [arXiv:2508.07382](https://arxiv.org/abs/2508.07382)
- Base model: Meta-Llama-3.1-8B-Instruct
- Framework: Unsloth for optimized fine-tuning
- Environment: InterCode-CTF for interactive training
Creator's repository · aradotso/security-skills