A Few Layers Quietly Fuel AI Math Reasoning

Hidden inside the soaring performances of large language models is a stubborn mystery: where does math reasoning actually live in the network? For readers who have tracked AI progress, it’s tempting to think improvements come from sweeping changes across the whole brain of the model, like a software update that rewires every neuron to think faster. But a new study flips that assumption on its head. It shows that the brain-like architecture of these models—transformers—hides its reasoning in a small handful of layers, and those same layers keep their job even after the model is tuned, distilled, or reinforced with human feedback. The research, conducted at New York University Abu Dhabi, is led by Aadim Nepal and collaborators, including Safal Shrestha, Anubhav Shrestha, Minwu Kim, and Keith Ross. Their message is deceptively simple: when it comes to mathematical reasoning, not all layers are created equal, and the ones that matter most were already being formed long before any post-training tweaks happened.

That observation matters for how we think about improving AI, how we test its limits, and how we audit its reasoning. It’s one thing to say a model has become better at math; it’s another to peel back the interior to discover which wedges of the network actually carry the burden. The authors push us to imagine the transformer as a stack of processors, where only a few of the middle layers act like cognitive gears for multi-step math tasks. They ask whether post-training methods—instruction tuning, distillation, and reinforcement learning with feedback—shift those gears or simply polish the existing gears that were already grinding away in the base model. The answer, according to their experiments, is closer to the latter: the core structure of reasoning layers endures, while post-training refines behavior elsewhere. This reframes where we should focus our efforts if we want to push AI toward more robust math problem solving.

Where Math Reasoning Hides

To probe the mystery, the team used a careful, almost forensic technique they call zero-ablation. Think of it like temporarily removing a layer’s thoughts while leaving the rest of the brain intact. They do this by zeroing out all parameters within a target transformer layer—both the attention sublayer and the feed-forward MLP sublayer—so that layer’s contribution vanishes, yet the network remains structurally intact through its residual connections. The model’s output then becomes a test: does it still solve math problems if layer ℓ is silenced? If accuracy tanks, that layer was likely doing essential work for the task at hand. If the model stays relatively lit, that layer may not be critical for the task. The researchers apply this across two popular families of models, Qwen and Llama, and across four variants each: the base pre-trained model, an instruction-tuned version, a knowledge-distilled model, and a reinforcement-learning-with-verifiable-rewards (RLVR) trained variant.

What they find is striking. For mathematical reasoning tasks such as GSM8K and MATH500, a small, distinct set of layers emerge as the “critical layers.” In Qwen, ablating layer 23 causes a dramatic drop in performance, even though linguistic fluency can survive the hit. In the Llama family, the heavy hitters sit at layers 15 and 18, with the exact layers showing up consistently across base, instruct, distilled, and RLVR variants. In other words, the model’s knack for stepping through multi-step math problems is concentrated in these architectural slices. Remove them, and accuracy plunges by as much as 60–80 percent relative to the unablated baseline. Remove other layers, and the effect is far milder. The effect is so pronounced that it survives the different post-training paradigms the authors test—a hint that the core mechanism is not something post-training creates from scratch, but something pre-training embeds in the network’s fabric.

Crucially, the same pattern does not appear for non-math tasks. When they repeat the exercise with TriviaQA, a factual-recall benchmark, there are no singular, critical layers. Instead, removing any single layer tends to produce a small, broad drop in performance, with no gem-like “cornerstone” layer whose silencing collapses the task. The contrast is meaningful: mathematical reasoning relies on a stable, layer-specific architecture, whereas factual recall appears more distributed and resilient to layer deletions. And that stability—across base and post-training variants—is the paper’s central claim about how reasoning grows and persists inside these models.

The Hidden Geometry of Reasoning

To illuminate what those critical layers are actually doing, the authors turn to a representational analysis they call Normalized Mutual Information or NMI. It’s a way to quantify how the internal representations—the way the model groups tokens and concepts—change layer by layer. The idea is elegant: start from Layer 0, which tends to cluster tokens by broad, simple families (numbers, operators, variables, etc.). Then watch how those clusters morph as you move deeper into the network. If a layer leaves the baseline clustering largely intact, it’s preserving the old structure. If a layer drives a dramatic reorganization, it’s performing a major transformation that could be building the reasoning machinery necessary for the task.

What the data show is consistent with the ablation results. The layers identified as critical for math live in a region where the model’s token clusters diverge most from the Layer 0 baseline—the so-called elbow region in the NMI profile. For Qwen, that elbow sits around layers 20–25; for Llama, around layers 13–18. In these layers, tokens representing numbers, operators, parentheses, and intermediate results begin to intermix in new ways, mirroring the mental gymnastics of solving a multi-step problem. It’s as if the network is reorganizing its internal map to connect disparate token families—an essential capability for reasoning that ties together numbers with equations, constraints with solutions, and steps with justifications. The authors interpret this elbow as a signature of the network learning to relate different families of tokens, which is precisely what math problems demand. In TriviaQA, by contrast, the NMI remains steadier across layers, aligning with the observation that no single layer becomes a choke point for this purely factual task.

Beyond the big-picture insight, the team also peels back the onion a little further. When they probe the sub-components of the critical layers, they find a nuanced split: early in the stack, both the attention mechanism and the MLP block contribute to the math-ready transformations, but in some layers the MLP carries most of the weight. In Qwen’s layer 23, removing either part noticeably hurts performance; in Llama’s layer 18, the MLP ablation alone nearly wipes out the layer’s effect. That pattern dovetails with a broader view in the literature that MLPs become important workhorses for higher-level transformations inside transformers, while attention heads can be highly specialized or redundant. It’s as if the math-reasoning gears are distributed, but their most critical gearwork happens to be centered in those few layers where the MLP and attention cooperate most deeply.

Why This Matters for Building and Auditing AI

So what does it mean for the future of AI development, safety, and education? For one, this study reframes where we should invest when we want to push a model toward better mathematical reasoning. If the core reasoning capabilities are carved into a small, persistent set of layers that survive across post-training methods, then chasing performance gains by blindly expanding or hammering the entire network may be less effective than carefully strengthening those existing layers through targeted architectural or pre-training improvements. The insight invites a more disciplined strategy: map out whether your model’s math prowess sits in the same layers as others you care about, and then tailor your data, prompts, or training signals to reinforce those layers’ ability to form robust, generalizable relationships between tokens of different families.

The implications extend to evaluation and safety as well. If reasoning hinges on a stable spine of layers, one could envision diagnostic tools that test a model’s reliance on those layers, helping detect when a model’s reasoning might break down on adversarial inputs or out-of-distribution problems. It also raises practical questions about transfer: should we expect improvements in math reasoning only if we improve the pre-training phase that forges these layers, or can post-training methods tune the surrounding circuitry to support more reliable execution without reconfiguring the core gears? The answer, at least in this study, leans toward the former: post-training refines behavior, but it does not rewrite the foundational layer-importance map that math reasoning depends on.

There’s a cautionary note, too. The analysis rests on a particular experimental methodology—zero-ablation on two families of models and a specific pair of mathematical benchmarks. The authors are clear that their NMI angle is exploratory and that broader generalization awaits future work. Still, the convergence of results across model families and post-training variants is compelling enough to warrant a shift in how researchers think about the interior of the transformer as a cognitive landscape, not just a computational factory. If the math-reasoning spine is real and persistent, then the next generation of AI could become more trustworthy at solving multi-step problems, as long as we respect and strengthen the architecture that already encodes it.

Where the study came from matters, too. The work comes from the Department of Computer Science at New York University Abu Dhabi, with Aadim Nepal as first author and Keith Ross listed as corresponding author. The team’s framing is anchored in a larger movement to understand the inner machinery of large language models—an ecosystem in which interpretability and mechanistic understanding increasingly matter as much as raw accuracy. Their focus on the pre-training forged-in layers, and the observation that post-training methods do not recast the layer-importance map for math, offers a clean narrative: we may not need a revolution in every layer to get better math, but we do need a deeper respect for the layers that already matter most. If we treat those layers as the model’s cognitive core, then progress begins to look less like a blind optimization and more like a careful, architectural pincer movement—strengthening the spine while teaching the limbs to work in harmony.