The problem of teaching machines to think across languages, images, and the real world isn’t new, but MARBLE takes a rare swing at the gnarly middle ground where perception and planning collide. It isn’t just about solving a riddle or recalling a fact. It asks an artificial agent to chart a careful, step-by-step path through a multimodal landscape—where vision, text, and physics all pull in different directions—and to do so in a way that a human can follow, critique, and improve. If you’ve ever watched a good chess player think five moves ahead, MARBLE aims to force AI to do something analogous with many kinds of information at once, not just a single move or a single kind of input.
The MARBLE benchmark—short for Multimodal Reasoning Benchmark for Language Models—arrives from the laboratories of EPFL and ETH Zurich, where researchers led by Yulun Jiang, Yekun Chai, Maria Brbić, and Michael Moor built a test that is intentionally hard to game. It isn’t content with whether an AI can spit out the right answer; it wants to know if the AI can generate, evaluate, and refine a chain of reasoning across multiple modalities. The two central tasks, M-Portal and M-Cube, push a model to plan and act in spatially constrained environments that resemble real-life puzzles more than trivia questions. The upshot is blunt: even big, well-funded multimodal models stumble when you ask them to reason through multi-step plans in context-rich spaces.
That bluntness isn’t a sign of failure so much as a diagnostic revelation. MARBLE’s experiments expose bottlenecks—perception, segmentation of visual input, the ability to keep track of intermediate steps, and the capacity to backtrack when a plan doesn’t line up with the constraints of space and physics. This isn’t a cosmetic critique of AI’s smarts; it’s a blueprint for where the field needs to strengthen the connective tissue between seeing, thinking, and acting. And because MARBLE is designed to measure the process as well as the product, it helps researchers separate a model that can imitate reasoning from one that can actually reason across modalities—and adjust its plan on the fly when new information arrives.
What MARBLE is and why it matters
MARBLE doesn’t just test a model’s final answer. It asks the model to produce an interpretable chain of thought that would, in principle, lead to a correct solution if followed in a real environment. The two tasks are different in flavor but united by a core demand: break down messy multimodal prompts into interpretable, stepwise plans that obey spatial and physical constraints. In M-Portal, a model puzzles through a Portal 2-inspired map with a textual guide and a sequence of visuals, then crafts a plan that could navigate through rooms, portals, and obstacles. In M-Cube, the model takes six interlocking 3D pieces and decides which piece goes on which face of a cube, oriented correctly so that every edge lines up. These tasks demand not just pattern recognition but careful sequencing, hypothesis testing, and error correction under constraints.
The people behind the project aren’t doing this as a toy. The study’s authors argue that progress in multimodal language models has leaned heavily on superficial accuracy on short prompts. MARBLE flips that script by rewarding the right reasoning path, even when the final answer is hard to pin down or multiple solutions exist. The designers want a decoding of thought through decisions, missteps, and corrections—something closer to how human problem-solving unfolds when you’re solving a puzzle that blends space, physics, and perception.
And there’s a practical significance here. Real-world AI systems—robots, assistive devices, autonomous agents, or even software that has to “see” and reason about a scene—must manage a stream of sensory data and convert it into deliberate actions. If we want AI to collaborate with people in dynamic, multimodal environments, it needs to show its thinking in a way that humans can audit, refine, and trust. MARBLE isn’t just about making models smarter; it’s about making their thinking visible and accountable in situations where perception and planning are inseparable.
Two hard problems: M-Portal and M-Cube
MARBLE splits the challenge into two arenas that test different angles of multimodal reasoning. M-Portal is a spatial reasoning and planning diagnosis drawn from the physics-grounded puzzles of Portal 2. The model is fed a map, a textual description, and a set of visual snapshots. Its task is to generate a chain-of-thought style plan—step by step, and in a physically plausible order—that would lead to exiting the map. The evaluation doesn’t merely check whether the model gets the exit; it checks whether the model’s plan would actually work when the steps are carried out in the simulated environment. This emphasis on plan fidelity and the correct sequencing makes M-Portal a stern test of long, multi-step reasoning across modalities.
M-Cube, by contrast, is a 3D puzzle: six jigsaw-like pieces, each with bumps and notches on their edges, must be arranged to form a perfect 5×5×5 cube. The model must decide which piece goes on which face, how to rotate or flip each piece, and ensure the edges align with complementary patterns. The problem scales combinatorially, because there are 6 faces, 6! ways to assign pieces, and eight orientation states for each piece. The dataset even provides a “perception” prompt that reduces the input to a 5×5 array, where each cell is a 0 or 1 representing gaps and bumps. The catch is that the model must still infer a correct global assembly from local edge patterns, a classic example of how perception and reasoning co-create a solution in physical space.
In practice, the MARBLE team synthetic-creates many instances to probe both perception and planning, and then subjects a broad panel of multimodal models to the test. They also introduce a solution validator for M-Cube, an analytic checker that can say, yes, this arrangement fits together, or where the edges clash. This feedback loop mirrors how a human solver might test a draft solution, observe where it fails, and revise the approach. The validator becomes not just a scoring tool but a cognitive assistant that helps the model refine its strategy across rounds of feedback.
One design choice stands out: even though M-Cube and M-Portal are conceptually distinct, both insist on a chain-of-thought that is coherent across modalities. The plan must link the visual cues with textual context and physical constraints in a way that remains self-consistent as the model elaborates each step. That alignment is the essence of MARBLE’s value. It creates a diagnostic lens into whether a model can keep track of a plan while juggling multiple streams of information and constraints, something that most existing benchmarks don’t pressure models to do.
What the results reveal about current models
The empirical findings are humbling in a productive way. The study evaluated twelve state-of-the-art multimodal language models, ranging from open-source offerings to high-profile closed-weight systems. Across the board, performance on M-Portal’s plan-correctness task hovered near random chance, with F1 scores around 6-7 percent. Even the best models were far from reliable: on the easier subtask of fill-the-blanks, only a handful of models nudged past the random baseline, and the top performer—GPT-o3—recorded a still-modest 17.6 percent accuracy. The takeaway is stark: when you ask a model to monitor a chain of reasoning over many steps and across modalities, today’s technology struggles to maintain coherence, let alone correctness.
The M-Cube results are even more striking. All the advanced models collapsed to 0 percent accuracy on the hard version (CUBE), even when they could spend tens of thousands of tokens “thinking.” The simplified variant (CUBE-easy) allowed some models to do better than random, with GPT-o3 achieving a striking 72 percent accuracy. But that performance dropped off sharply as the task reintroduced the full complexity of perception and the full space of possible assemblies. Perception, in particular, emerged as a stubborn bottleneck: when researchers tested whether models could translate a rendered 3D piece into a clean 5×5 array, even the best performers hovered around 70 percent cell-level accuracy—meaning they reliably misread a sizable portion of the grid. The whole-piece accuracy stayed at zero for all models. In other words, even when given a manageable chunk of the problem, models struggle to extract the right structured information from vision alone, and that deficiency propagates into the planning steps that follow.
Beyond perception, the study also quantifies the explorable search space and the mental gymnastics required to navigate it. The M-Cube problem in its full form packs 188,743,680 possible solutions, a number that dwarfs what most AI systems routinely handle. Even a reduced variant with fewer missing pieces and the option to ignore flips yields millions of possibilities. In an experiment that disentangles perception from reasoning by translating visuals to text arrays, DeepSeek-R1 managed 57 percent accuracy with one missing piece, but performance collapsed to 0 percent as missing pieces increased. The upshot is clear: the cognitive load isn’t just about seeing; it’s about exploring and pruning a gargantuan space of possibilities while maintaining a coherent line of thought across every move.
One of the more hopeful threads in the paper is the role of feedback and tooling. The authors show that a solution validator can meaningfully improve performance when used with a capable model. In the CUBE-easy setting, GPT-o4-mini improved from about 10 percent to 28 percent accuracy after five rounds of validator-assisted interactions, especially when the feedback was detailed rather than binary. That suggests a path forward: give models interactive feedback loops, diagnostic signals, and the ability to use external tools to verify and refine their reasoning. Unfortunately, the same validator tool did not lift performance on the full CUBE task, underscoring how stubborn the hardest multimodal reasoning problems remain—even with help from a checker and iterative improvement cycles.
The marbles of perception and planning
One of MARBLE’s most provocative discoveries is not a new magic trick for AI but a map of bottlenecks that keep modern models from scaling the heights of human-like multimodal planning. Perception—how a model interprets and encodes images into structured, usable information—emerges as a recurrent limiter. It’s not enough to identify that an edge has a bump; a model must translate that into a robust, edge-to-edge map that aligns with the corresponding textual cues. If that translation slips even slightly, the downstream reasoning chain is anchored to a shaky foundation, and complex planning becomes brittle or incoherent. The paper’s near-zero full-piece accuracy on M-Cube is a stark illustration of what happens when perception fails before the planning begins, like trying to assemble a 3D puzzle with blurry or misread pieces.
Reasoning, too, is exposed as a fragile process. The sheer combinatorial explosion in M-Cube’s solution space means that any small misstep can cascade into a chain of wrong moves, and without a reliable internal audit trail, a model may never recover. In M-Portal, where steps unfold over dozens of actions, maintaining a correct, physically plausible sequence is equally difficult. The experiments reveal that even when models generate long chains of thoughts, the fidelity of those steps, their alignment with the map’s constraints, and their ability to adapt when something doesn’t fit all remain poor in the current generation of multimodal models. This isn’t just about being clever; it’s about being consistent across many moving parts, something human problem-solvers do almost without thinking, but AI systems still struggle to emulate reliably.
And yet MARBLE is refreshingly specific about where improvements can happen. The authors argue for architecture and training practices that support true multimodal, multi-step reasoning rather than room-temperature cleverness on a single task. They point toward interactive, tool-augmented approaches—think of a model that can call a solver, attempt a plan, test it against a validator, receive precise diagnostic feedback, and then revise in light of that feedback. In the long arc of AI development, this is a reminder that progress in “smarter” systems may hinge less on cranking up raw size and more on teaching machines to reason with their own inputs in a disciplined, human-like loop.
Looking ahead: what this means for AI in the real world
MARBLE’s conclusions aren’t a death sentence for multimodal AI; they’re a roadmap. If a vision system, a language model, and a controller must operate in the same moment, the next generation of systems will need to demonstrate not just what they know but how they reason across modalities. Robotics, autonomous vehicles, healthcare assistants, and educational tools all stand to gain from models that can map a visual scene into a coherent plan, justify each step, and adjust as physical constraints reveal themselves. The potential is enormous, but so is the risk of brittle behavior if we don’t build in robust ways to check, revise, and ground the model’s thinking in the real world.
There are practical implications for how researchers design benchmarks and how companies deploy multimodal AI. MARBLE amplifies the need for long-horizon planning in embodied settings, a scene where many AI systems still stumble. It also emphasizes the importance of clear failure modes: when perception misreads a detail, or when a plan doesn’t account for a critical constraint, there should be a mechanism to catch the error early and adapt. The validator concept—essentially a diagnostic tool that can highlight conflicts and guide correction—feels less like a gimmick and more like a design principle for durable AI. It’s a nudge toward systems that can-be-told to reconsider and replan, in a way that mirrors human problem solving under uncertain information.
Yet the study is careful to acknowledge a broader caveat: MARBLE’s scenarios are puzzle-rich and abstracted from everyday social and ethical dimensions. That’s a deliberate design, not a limitation. The authors want to avoid socio-political risk while focusing on a core cognitive capability. Still, as these models travel from puzzles to practical applications, we’ll need to think carefully about how to translate the language of a cube or a Portal map into safe, robust, and interoperable behaviors in the real world. The path from puzzle bench to real-world AI is long and winding, but benchmarks like MARBLE illuminate the bends and bumps along the way, offering a shared yardstick for progress.
Who built MARBLE and where to look next
MARBLE is the product of collaboration between EPFL and ETH Zurich, with the authors Yulun Jiang, Yekun Chai, Maria Brbić, and Michael Moor at the helm. The work is as much about giving the community a meaningful, diagnostic yardstick as it is about presenting a completed set of difficult challenges. The authors don’t pretend to have solved multimodal reasoning; they instead provide a lens to understand where current models fall short and what kind of capabilities future systems must acquire to begin approaching human-level multimodal planning in physically constrained environments. This is a call for more than bigger models; it’s a call for smarter thinking about how to train, test, and deploy AI that can see, reason, and act in concert.
As researchers push MARBLE forward, two themes seem likely to shape the next wave of work. First, the integration of perception and reasoning will demand models that learn to extract stable, structured representations from messy sensory input, not just patterns that explain a single image or a single sentence. Second, the ability to use tools—whether validators, solvers, or physics engines—will become a core competency, akin to how people consult reference materials, test ideas, and adjust plans in the real world. The vision is not a gadget-filled dream of omnipotent AI, but a more grounded, testable trajectory toward systems that can reason through the world with the same disciplined, multi-step approach that humans bring to tough problems.
For readers and developers excited to see where this goes, MARBLE’s release offers more than data. It offers a philosophy: that we should care about how a thought is formed, not just whether it arrives at the right destination. When a model can explain the steps it took, and those steps make sense in a spatially aware, physically constrained environment, we gain not only trust but a clearer map of where to invest our research energies next. In a field racing toward embodied AI and more capable generalist agents, MARBLE hands us a compass for navigating the hard but transformative terrain ahead.
The minds behind MARBLE
The MARBLE benchmark is credited to a team grounded in two European powerhouses of science—EPFL in Lausanne and ETH Zurich—with leadership that highlights a new generation of AI researchers who bridge theoretical depth and practical, puzzle-grounded evaluation. The principal investigators and authors—Yulun Jiang, Yekun Chai, Maria Brbić, and Michael Moor—bring a mix of expertise in multimodal systems, cognitive reasoning, and robotics-oriented evaluation. Their work is an invitation to the field: design models that can reason step by step across a spectrum of inputs and constraints, and build benchmarks that reveal both strengths and gaps with surgical clarity. In other words, MARBLE is as much about what we learn from failure as what we celebrate in success, and it’s a reminder that even the most advanced machines still have a long way to go before they can truly plan in a world that looks and acts like ours.