if new_service_exploited: reward += 10 elif new_host_pivoted: reward += 50 elif privilege_escalation: reward += 100 elif detection_raised: reward -= 20 elif time_step > max_steps: reward -= 200 # Episode timeout penalty
The "Deep" aspect replaces traditional Q-tables (which cannot handle millions of possible network states) with deep neural networks that approximate value functions. For AutoPentest-DRL, the typical architecture includes: autopentest-drl
The agent learns basics: scan → detect vulnerable service → execute correct exploit. Rewards are given immediately. Real penetration testing requires stealth to avoid crashing
Real penetration testing requires stealth to avoid crashing services or alerting SOC (Security Operations Center) teams. Most DRL reward functions do not incorporate a "stealth budget." An agent trained to maximize compromise speed will often choose the loudest, fastest exploit, which is useless in a red-team engagement requiring low-and-slow tactics. The Autopentest-DRL framework works as follows:
: The framework integrates Nmap for initial vulnerability scanning and Metasploit to execute the suggested exploits automatically .
The Autopentest-DRL framework works as follows: