Could AI Learn to Pentest Web Apps on Its Own?

Intro

Cybersecurity often feels like a high-stakes game of chess played on a sprawling, ever-changing board. Defenders patch holes, monitor traffic, and chase down elusive weaknesses, while attackers scout for the tiniest misstep to exploit. For decades, penetration testing has been a human-led craft: security experts map a network, probe forms, test credentials, and chase down vulnerabilities with a mix of intuition and discipline. But what if a machine could learn to run that playbook, not by rote, but by discovering the best sequence of moves on its own?

The answer, at least in part, comes from a Madrid-based team at GMV, a company dedicated to AI and big data research. In a study led by Daniel López-Montero and colleagues—José L. Álvarez-Aldana, Alicia Morales-Martínez, Marta Gil-López, and Juan M. Auñón-García—the researchers trained a reinforcement learning agent to automate security testing of web applications. Their aim wasn’t to replace human experts but to automate the time-consuming, decision-heavy core of pentesting: which tool to use, when to use it, and how aggressively to pursue a given vulnerability. Importantly, the project is backed by GMV’s Department of Artificial Intelligence and Big Data in Tres Cantos, Madrid, with support from Spain’s national cybersecurity agency, INCIBE. The authors show a path to smarter, faster, and more scalable testing—one that learns from simulated worlds before facing the messy realities of real websites—and they’re careful to frame this as a tool for strengthening defenses rather than an autonomous attacker in the wild.

Automating the pentest hunt

The core idea is deceptively simple: teach a robot to decide which action to take next when exploring a website, with the goal of uncovering as many critical vulnerabilities as possible while spending as little time and computing as possible. The action space is huge. At each URL, the agent can perform a mix of actions drawn from five broad categories aligned with common pentesting techniques: crawling to discover more links, form detection to reveal inputs that can be manipulated, SQL injection attempts, brute-force attempts for credentials, and cross-site scripting payloads. The researchers stack all those options into actions per URL, and then multiply by the number of URLs the agent has found. The total action space grows as the agent explores, which is exactly the kind of combinatorial explosion traditional software struggles with.

To make this tractable, the team deliberately narrows the focus to a curated toolkit of well-known attack patterns and configurations. This mirrors real-world practice: pentesters often start with a toolbox of reliable techniques rather than trying every possible cyber experiment at once. The agent’s objective isn’t simply to “break in” but to maximize the discovery of meaningful vulnerabilities fast. The reward function embodies this: it rewards discovering new, high-impact information while penalizing costly actions in terms of time and resources. The reward is bi-objective, balancing the allure of a powerful vulnerability against the cost of the action that uncovers it. As the authors put it, the agent learns a strategy that optimizes the trade-off between depth and speed, a crucial skill for security testing in fast-moving environments.

Highlight: the clever trick is teaching the agent to value the quality of discoveries while keeping a lid on wasted effort, a balance that mirrors how human pentesters triage findings during a long engagement.

Training in a safe, synthetic world

Training a capable pentesting agent in the real world would be dangerous and impractical. Instead, López-Montero and coauthors build richly structured simulated environments that mimic the key dynamics of a live website without risking real systems. The simulated world starts with a random web topology—rooted in a tree-like structure with realistic branching behavior—and then populates each page with vulnerabilities and tool configurations. The researchers use a probabilistic engine to assign status codes, form fields, and vulnerabilities, so the agent learns from diverse, repeatable scenarios while avoiding the chaos of a live site.

Why simulate so aggressively? Because you want speed, variety, and safety. The paper notes that simulating and parallelizing action execution can yield up to a 3000x speedup compared to real-world testing. That makes it feasible to conduct thousands of learning episodes, letting the agent practice and refine a strategy far beyond what a human could reasonably attempt in the same timeframe. The team trains the agent on two publicized, intentionally vulnerable environments—the Damn Vulnerable Web Application (DVWA) and DockerLabs—before testing its mettle on real-world targets. The simulated world isn’t a fake toy; it’s a carefully engineered stand-in designed to cover a broad spectrum of common web vulnerabilities and configurations.

Highlight: speed and diversity matter—simulated worlds let researchers run many lifetimes of expert-like trial-and-error, something you can’t usually do with a real site without risking outages or breaking rules.

Geometric deep learning to tame the chaos

Even in a simulation, the problem remains huge: dozens of URLs, hundreds of potential actions per URL, and a history of results that matters for future decisions. A naive neural network would quickly buckle under the weight of this combinatorial mess. The researchers confront this head-on with a twist drawn from geometric deep learning: they impose permutation invariances and equivariances that reflect the symmetrical structure of a website’s pages. In plain terms, if you reorder the list of URLs, the agent’s decision-making should reorder its own planned actions in the same way. The critic (the component that assesses how good a given state is) is designed to be permutation-invariant; the actor (which proposes the next actions) is permutation-equivariant. This isn’t just mathematical fancy—it dramatically reduces the number of parameters the network must learn and lets the model generalize better across different layouts of the same underlying problem.

The architecture resembles a graph neural network, but with a network of isolated nodes representing each URL rather than a full web graph with interconnections. Each URL gets its own small module (a multilayer perceptron) that processes local observations, and then an aggregation step combines these per-URL signals into a global impression of the website. The critic sums the per-URL signals, while the actor concatenates the per-URL outputs to decide which action to take where. This design preserves the symmetry of the problem—treating pages as interchangeable units—while still enabling the model to act as if the environment were a single, coherent web world.

All told, this geometric-prior approach yields a lean model: the best-performing configuration clocks in at 69,304 parameters. That’s tiny by modern deep-learning standards, but it’s precisely what you want when you’re tackling a moving, high-dimensional control problem. The team frames the agent as a policy that maps the current state of all URL blocks to a distribution over 134 possible actions per URL, and they show how permutation-aware networks can process this information efficiently while preserving the mathematical properties the problem demands.

Highlight: symmetry-aware design isn’t cute math trivia here—it’s the practical trick that makes a giant decision space learnable, letting a small network reason about thousands of pages as if they were interchangeable pieces of a larger puzzle.

From simulation to real-world testing

The researchers don’t stop at virtual experiments. After training the agent in the synthetic environments, they evaluate its performance on real, deliberately vulnerable targets: DVWA and DockerLabs again, but now as testbeds for generalization. The experiments use a fairly robust hardware setup—a Tesla T4 GPU, a multi-core CPU, and substantial memory—and a learning budget that amounts to about one million timesteps, with episodes spaced across hundreds of steps to allow the agent to form longer chains of reasoning rather than short, impulsive bursts.

3 key findings emerge. First, the on-policy PPO algorithm consistently outperforms the alternatives (SAC and DQN) in both training and validation phases, delivering a stable improvement across the board. Second, the agent is able to discover most of the vulnerabilities it’s capable of exploiting within the bounds of the simulated tools, and it does so by learning to sequence actions that build on one another—starting with reconnaissance, moving to parameter discovery, and then escalating to deeper exploits. Third, some vulnerabilities remain out of reach not because the agent misunderstands the problem, but because the available toolset in the tested environment cannot exploit them. In other words, the agent’s ceiling is partly defined by the tools it has at its disposal, not just its own learning prowess.

Across the experiments, the team also tracks how often the agent uses different tools. SQL injection payloads show up frequently, which makes sense given their relatively higher payoff under many configurations. But the agent isn’t reckless; it balances potential gains against costs wired into the reward function—an echo of how skilled pentesters juggle risk, time, and impact in real projects.

Highlight: the results aren’t a claim of flawless automation, but a demonstration that a well-constructed learning agent can learn, adapt, and improve in ways that mirror seasoned human practice—while running thousands of trials in a fraction of the time.

What this means for security and the future of testing

If you’ve followed the arc of AI and cybersecurity, this work feels like a natural next step in a long line of automation efforts. It doesn’t pretend that one tool can replace human judgment or that a robot can outpace the most cunning attacker overnight. Instead, it shows how a carefully designed learning agent can assume much of the mechanical burden of pentesting—planning actions, assessing outcomes, and learning from every probe—so human testers can focus on higher-value reasoning, complex defense design, and risk prioritization. In practice, this could translate into more frequent, thorough, and objective security checks integrated into development pipelines, especially as teams push toward continuous integration and continuous deployment models that demand rapid feedback loops about security posture.

There are caveats, of course. Simulated environments, while invaluable, are approximations. The authors acknowledge the potential biases introduced by simulator design and the fact that the agent’s capabilities are tethered to the tools provided in the training regime. Translating a trained agent from a sandbox to a production security program will require robust safeguards, ongoing validation, and coordination with human experts to ensure that automated testing doesn’t cause collateral disruption. The work’s funding from INCIBE and the authors’ emphasis on responsible deployment reflect this sober stance toward real-world use.

Another frontier is expanding the action repertoire beyond the five categories currently used, and integrating more sophisticated language-assisted perception to interpret unstructured data like HTML and JavaScript. The authors hint at coupling the agent with large language models to extract and encode richer signals from web content, potentially unlocking more nuanced testing strategies. They also point to a hybrid future where model-based reinforcement learning could complement the current model-free approach, guiding the agent with an internal model of typical web architectures to accelerate learning even further.

The work is a reminder that the most powerful demonstrations of AI in security often come not from flashy pivots or single-shot breakthroughs, but from building a cohesive system that learns to reason about a messy, interactive environment and then translates that reasoning into practical, auditable actions. It’s about turning deep learning into a kind of disciplined, scalable curiosity—one that can roam through a web app’s landscape, map its vulnerabilities, and report back in a way that helps organizations shore up defenses before real attackers arrive.

The study is a collaborative effort led by Daniel López-Montero and colleagues at GMV’s Department of Artificial Intelligence and Big Data in Madrid, with support from the Spanish National Cybersecurity Institute (INCIBE). The authors underscore that the end goal is not to substitute human testers but to augment them: to accelerate discovery, to reduce maintenance costs, and to bring a disciplined, repeatable approach to pentesting in an era where software moves fast and attackers never sleep. If the path they chart holds, automated pentesting could become a standard, scalable part of a defender’s toolkit—an intelligent partner that learns from every scan and helps keep the digital world a little safer.

Conclusion

What López-Montero and his team have built is less a robot apocalypse and more a pragmatic scaffolding for a smarter defense. By combining reinforcement learning with geometry-aware neural networks and a richly constructed simulated world, they demonstrate how an agent can learn to navigate the labyrinth of web vulnerabilities with a disciplined tact and a sense of strategic timing. It’s a glimpse of a future where automated testers run alongside humans in security operations, handling the repetitive grind while humans teach the system where it should push and where it should hold back. In the end, it’s about turning a sprawling, adversarial landscape into a sequence of informed, deliberate steps—and that’s an achievement worth watching as it matures.