When AI Agents Try to Extend Real Research

In labs and on laptops, researchers chase a new kind of lab partner: a tool that can read a paper, translate its methods into code, set up experiments, and tell us what happened. The dream sounds like science fiction until a flood of AI systems, many built on large language models, makes it feel almost within reach. The recent paper introducing REXBENCH, a benchmark to test whether coding agents can autonomously implement research extensions, is a taste of that near-future reality. It isn’t about chatty AI writing a paper; it’s about AI writing, running, and evaluating the next experiment.

REXBENCH is the product of a cross-continental collaboration—University College London, Boston University, and the University of Vienna—led by Nicholas Edwards and Yukyung Lee. The aim is to stress-test a specific, realistic capability: given a research idea, can an agent modify an existing codebase in a controlled way so that the experiment actually tests the idea and yields numbers that match a manually implemented gold standard? The answer, at least in this first round, is sobering: the best systems still need a lot of human guidance, and their noses for stepping beyond what they already know remain faint.

A Benchmark for Autonomous Research

REXBENCH is not a single task but a suite of 12 realistic research extension tasks. Each task takes a published paper and its codebase as a starting point and adds a precise extension instruction—think of it as a formal prompt that asks, for example, what would happen if a different model were used, or if a different dataset were substituted, or if a new evaluation metric were adopted. The goal isn’t just to write code that compiles; it’s to implement a propagation of the original idea that can be meaningfully tested and that yields numbers within a defined target range. In short, it’s a test of how well an AI agent can transform hypothesis into experiment into evidence.

The design is deliberate. The researchers wanted a test that approximates how a real scientist might extend prior work, rather than simply regurgitating a paper or reproducing a cookbook recipe. The extensions are scoped enough to be solvable in principle but nuanced enough to require reading code, understanding data flows, and aligning outputs with domain-specific criteria. Each extension is paired with background material and expert-written instructions, which keeps the bar high but concrete enough for automatic evaluation. A crucial feature is data contamination control: the gold solutions live in private repositories, and the evaluation infrastructure is private as well, so solvers can’t lean on memorized tricks from public web data. The result is a cleaner, harder test of genuine capability rather than memorization.

The backbone of the evaluation is a simple but powerful idea: the agent patches the codebase to implement the extension, the patched code runs on a controlled VM, and the system checks whether the final numbers line up with the gold solution’s outcomes. The setup also accounts for randomness by running multiple seeds. Three agent frameworks—aider, Claude Code, and OpenHands—were used with several large-language-model backbones—from Claude 3.7 Sonnet to o1 and o4-mini from OpenAI, and the DeepSeek family. The metrics boil down to three numbers: final success rate (did the end result land in the target range?), execution success rate (did the code run without crashing?), and file recall (did the agent touch the same files as the gold solution?). It’s a careful, numeric lens on a problem that often feels like a creative sprint rather than a mechanical one.

The Reality Check How Far Can Coding Agents Go?

The headline finding is sobering but instructive. In the main experiment, the best-performing pairings—OpenHands paired with Claude 3.7 Sonnet and Claude Code using Claude 3.7 Sonnet as the backbone—achieved an average final success rate of about 25 percent. In other words, on one in four extensions, the agent managed to implement something that, when run, matched the gold standard’s outcome range. Other backbones, such as OpenAI’s o1 or the DeepSeek-R1 family, fared substantially worse, hovering near zero in many tasks. Execution often happened even when the final numbers did not, underscoring a divide between producing runnable code and delivering empirically correct results. It’s as if a student could type a passable patch, but the experiment still wouldn’t pass the exam because the numbers don’t check out.

What’s striking is not only the rate but the pattern of difficulty. Many failures were explicit: the agent produced an empty diff, syntax errors, or timeouts. But a substantial portion were implicit: the code ran, but the resulting data drifted from the gold standard due to subtle logic mistakes or hyperparameters that didn’t line up with the experimental design. A recurring and telling motif was overthinking—some models produced long, meandering reasoning traces that burned compute time without yielding progress in code. That pattern isn’t just a quirk; it reveals a mismatch between how these models reason in prose and how they must reason when we ask them to patch a real software system under strict constraints.

The authors pushed beyond the surface numbers with hints experiments. They introduced two levels of human guidance: brief localization hints that point to where to edit or what parameter to adjust, and a second level that lays out a step-by-step implementation plan. Hints helped, sometimes substantially, but not uniformly. In one standout case—OpenHands paired with Claude 3.7 Sonnet—the final success rate climbed from the mid-teens into the high thirties with guidance. Yet in other contexts, even the second level of hints didn’t yield gains, sometimes steering the solution into a different, plausible path that didn’t match the gold standard. The result is a nuanced portrait: hints can lift performance, but there’s no universal recipe for turning a near-miss into a win.

Beyond the numbers, the study surfaces a broader pattern about the limits of current AI coding agents. The best performers showed real signs of understanding: their patches targeted the right files, and their edits often moved the project in the right direction. But translating intent into robust, reproducible experiments remains an uphill climb. The REXBENCH team also analyzed the cost side of the equation, showing that some combinations are more time- and compute-efficient than others. The Pareto frontier they mapped reveals a trade-off between speed, cost, and success that is familiar to any practitioner who has watched a long software project spiral through iterations. In short, we’re watching the rough edges of a tool that could someday accelerate science, but we’re not yet at a point where a machine can reliably replace a thoughtful human planning and debugging process.

The Reality Check: What the Numbers Tell Us

One of the most compelling takeaways is how fragile progress can be. The benchmark’s design isolates the challenge: it requires agents to understand a research question, inspect and modify code, run experiments, and assess whether the outcomes meet quantitative targets. That combination is rare even for human researchers, and it’s precisely what makes REXBENCH a tough litmus test. The fact that even the best performers stumble on roughly three-quarters of extensions—despite months of progress in natural-language reasoning, code synthesis, and tool use—speaks to the stubbornness of real-world research tasks. It’s one thing to write a script that performs a standard ML task; it’s another to navigate a complex, existing pipeline and generate a verifiable, numerically correct extension.

When the team digs into the failure modes, a few patterns emerge that are worth noting for anyone tracking progress in AI-assisted science. Syntax errors and missing files are not surprising in a setting that requires precise patching of real codebases. But the implicit failures—the miscalibration of a learning rate, the misalignment of a data split, or a slight difference in random seeds—are more instructive. They show that even small deviations in the experimental setup can derail the numeric checks that matter. And then there is the human factor: even small decisions about how to structure the extension or how to interpret the original paper can ripple through to the final results. The authors argue for more robust, process-level verification metrics in future benchmarks, to help separate genuine capability from lucky alignment or surface-level accuracy.

The study also provides a candid look at where AI-assisted science might make the most difference. The authors suggest that when an extension does not demand sweeping rewrites of a codebase—when it can be achieved with targeted edits and careful testing—the probability of success climbs. That insight matters: it hints at a practical path to building AI systems that can responsibly contribute to research, by focusing on specific, well-scoped extensions rather than trying to rewrite large swaths of a project in one go. It also reinforces the point that replicability and transparent testing are essential if we’re to trust results produced by AI-driven experiments in the wild.

The Road Ahead for AI-Assisted Science

REXBENCH’s authors are not simply delivering a verdict; they’re laying down a roadmap. The central research question—can a coding agent autonomously implement a research extension—gets a clear, data-driven answer: progress is real, but fragile, and current systems still require substantial human scaffolding. That may feel like a cautionary note, but it serves a practical purpose: it tells researchers where to invest their next efforts. The paper’s call for broader community participation is exactly the kind of push science needs: a shared repository of more challenging extensions across a wider array of domains, with standardized evaluation that protects against data leakage and ensures reproducibility.

What would it take to move the needle? Several threads seem promising. First, more sophisticated planning and debugging capabilities—systems that can decompose a complex extension into a sequence of smaller, verifiable steps and pause for sanity checks before proceeding—could reduce the “overthinking” failure mode. Second, stronger integration with tooling that supports robust testing and debugging in real-world codebases would help agents avoid silent misalignments between intent and outcome. Third, expanding beyond a single programming language or framework and encouraging cross-domain tasks could push agents to develop more generalizable reasoning skills rather than memorized patterns. The REXBENCH framework is designed to adapt to these directions, which is why the authors see it as a starting point for a broader, community-driven effort.

For science and society, the stakes are meaningful. If AI-assisted experimentation becomes more reliable, it could accelerate discovery, improve replication, and help researchers explore ideas at scale. But there’s also a cautionary flag: more powerful automation can magnify the impact of mistakes if verification isn’t rigorous. The authors are upfront about this risk and advocate for robust, automated evaluation pipelines to prevent the sort of trust erosion that can accompany unseen errors in scientific results. In a world where a future generation of AI scientists could patch experiments and publish results with less human intervention, the path to responsible deployment will hinge on transparent proofs, reproducible code, and a culture of verification that keeps pace with the tools we build.

Ultimately, REXBENCH is a milestone, not a verdict. It marks how far we’ve come in teaching machines to act like researchers and how far we still have to go before AI can be trusted as an autonomous participant in real scientific work. The paper, a collaboration between UCL, BU, and the University of Vienna, with Nicholas Edwards and Yukyung Lee among the lead authors, provides a clear, rigorous, and human-centered account of the gap between promise and practice. If the field keeps measuring itself this honestly, we’ll steadily move toward tools that reliably accelerate experimentation while preserving the critical discipline of scientific scrutiny. In the end, the future of AI in science will be written not only in clever prompts or flashy demonstrations but in benchmarks that demand real, reproducible extensions and in a community willing to build toward that standard together.