AI Word Problems Shine on Patterns, Not on Understanding

When your math homework pops up in a chat window, the AI sometimes feels almost like a helpful tutor. It can spin out long explanations, recite facts, and even offer step-by-step answers. But a new big-picture look at how these systems tackle word problems asks a tougher question: do they actually understand what the problem is about, or are they just really good at following patterns they learned during training?

The study behind this question comes from a collaboration anchored in universities and research labs across Europe and North America. The team—led by Anselm R. Strohmaier of the University of Education Freiburg, with coauthors from KU Leuven, Technical University of Munich, and Portland State University—surveyed how large language models (LLMs) handle mathematical word problems. Their finding isn’t a defeat for AI in education; it’s a sober map of where these tools shine and where they stumble when real-world sense-making matters. In short: LLMs are superb pattern solvers, but they don’t necessarily “get” the world the problems describe. This distinction matters if we want AI to help students learn to reason, not just to spit out numbers.

How humans solve word problems—and why that matters for AI

Math word problems pretend to be about the real world, and the way humans tackle them is a little like reading a short scene and then building a model of it in your mind. Students draw on a “situation model”—a mental picture of who is doing what to whom in a context—before translating that scene into a mathematical setup. The goal isn’t just to crunch numbers; it’s to understand what the numbers mean in the world described by the text, to decide what to measure, and to justify why a particular equation makes sense.

LLMs, by contrast, don’t build worldviews. They don’t glimpse a situation and reason through it the way people do. They’re statistical artisans: given a prompt, they predict the next word or token that is most likely to come next, based on enormous amounts of training data. Some researchers call this “mathematical reasoning” in computer science, but the mathematics education community tends to reserve that term for genuine justification, argument, and modeling. The paper’s authors argue we should be precise: for LLMs, word problems are another type of token sequence to predict, not a scenario to model and argue about.

That difference matters because a problem’s difficulty isn’t just about arithmetic. It’s about whether you have to question the world described, decide what counts as relevant information, and choose the right mathematical lens. When the problem is straightforward—an s-problem that can be solved by a few arithmetic steps—the AI can mimic the human process well enough to produce a credible result. But when a problem requires sense-making—what mathematicians call a modelling cycle—the AI’s edge fades. The AI may produce an answer that looks plausible, yet it’s often not grounded in a coherent, real-world interpretation.

A landscape of problems: what counts as a real challenge for AI

The authors divide word problems into two broad families. The first is s-problems, or standard problems, where you can reach a correct numerical answer by applying the given numbers through a straightforward sequence of arithmetic steps. The second family is p-problems, short for problematical problems, where you must consider the context, check for realism, and sometimes even decide that a question can’t be answered with the information provided. Real-world sense-making—checking whether a scenario is possible or whether data are compatible with the situation—is the hinge on which p-problems turn.

To study AI on these problems, the researchers pulled together a kind of fault line map of problem sets that have been used to test LLMs. Some of the oldest beds of rock in this landscape include MAWPS, ASDiv, and SVAMP, which are collections designed to probe pattern matching and basic reasoning in math word problems. Then came GSM8k, a widely used benchmark that deliberately forced models to step beyond single calculations and show their multiple-step reasoning. The most recent wave of corpora adds linguistic variety and more delicate real-world contexts—partly to test whether AI can cope when the surface looks familiar but the underlying world behaves differently.

Crucially, most of these corpora have a bias toward s-problems. They were built in a moment when people weren’t sure an AI could tackle complex word problems at all, so the design favored problems that could be solved by templates, patterns, or simple sequences. That means the current AI “challenge set” often tests whether a model can follow a recipe rather than whether it can ground a solution in the messy realities humans must navigate when modelling a real situation.

How the testing happened: five models, hundreds of problems

To push past the scattershot snapshots of earlier work, the team ran a careful, apples-to-apples evaluation. They picked five contemporary OpenAI models—ranging from GPT-3.5 Turbo to GPT-5, including the GPT-4 family variants and a few that emphasize different reasoning strategies. They ran each model on 287 word problems drawn from four categories: GSM8k and SVAMP (the math-education standard-bearers), a mathematics-education corpus that includes modelling-style tasks, and a classical set of p-problems drawn from longstanding work in the field. The problems varied in type: some were standard arithmetic progressions; others demanded contextual reasoning; some presented odd or even nonsensical contexts that should make a sane solver pause and question the problem statement.

The researchers didn’t rely on the models’ own self-reported processes. They fed each problem the same prompt and then graded the outputs by hand, classifying each answer into five categories: wrong, solved, noticed, addressed, and declined. In other words, they looked not just at whether the answer was numerically correct, but at whether the model demonstrated awareness of potential context issues and whether it adapted its approach when the problem’s world didn’t quite fit the data.

Two things stand out from the results. First, on s-problems—the ones that look like a recipe with numbers—the models did astonishingly well. In modern models, correctness was near-perfect across most problem sets. One model even achieved a perfect score on all PISA-derived problems when those items were presented in a slightly different context. Second, and more sobering, is what happened on p-problems. Here the performance diverged: the models’ ability to give an acceptable answer depended heavily on the problem type. Contextual problems could sometimes be solved sensibly if the model tried to reason about the real-world implications. But weird problems, non-sensical contexts, or questions with missing information frequently tripped the AI up—sometimes producing answers that felt precise but were logically inconsistent or contextually wrong. The newer, more capable models reduced some of these errors, but they did not eliminate the fundamental mismatch between pattern-based generation and world-grounded reasoning.

What this means for classrooms and learning

So where does this leave educators and students who want to use AI as a learning ally with mathematics? The core takeaway is not that AI is useless in schools; it’s that AI’s strength lies in pattern recognition, not in genuine sense-making about real-world contexts. In other words, the AI can imitate the surface of problem solving with impressive fluency, but it does not necessarily cultivate the deeper habits of modelling and justification that teachers prize in mathematical thinking.

This distinction matters for two big reasons. First, if students are guided by AI to produce quick numerical answers without asking whether the context makes sense, they may miss opportunities to develop critical thinking about when a problem is well-posed and when information is missing or misleading. Second, if teachers lean on AI tools that present convincing but contextually flawed solutions as credible, students may mistake confident rhetoric for real understanding. That isn’t a failure of the student; it’s a misalignment between the tool’s design and the educational aim of sense-making and modelling.

To address this gap, the authors argue for a new generation of benchmarks that actually test modelling, reflection, and robustness. They also call for classroom practices that foreground sense-making as a central objective—using AI not as a black-box solver but as a collaborator that invites critique and justification. In practice, that could mean pairing AI-generated steps with teacher-guided prompts that require students to justify why a particular modelling path makes sense in a real-world scenario, or using AI to surface multiple plausible modelling approaches and then debating their validity in context.

Rethinking the role of AI in math education

One vivid way to think about these findings is to imagine AI as a high-powered search engine for patterns rather than a teacher’s thinking partner. It can scan thousands of possible arithmetic paths quickly, it can restructure a problem’s surface wording to make it easier to see, and it can show you how a solution could unfold in terms of steps. But grounding those steps in a meaningful, real-world story—checking whether a proposed model matches what would actually happen in the world—is something humans still do better, and teaching down that path remains essential.

That doesn’t mean you should discard AI in math class. It means you should design use cases that leverage what AI does well—rapid pattern recall, reliability on well-posed, context-free steps, and helpful feedback on arithmetic reasoning—while keeping space for students to exercise sense-making, model construction, and justification. The study’s authors emphasize that if we want AI to support genuine mathematical learning, we need to train and evaluate models in ways that privilege sense-making, not just numerical accuracy.

There’s also a broader message for researchers: if the benchmarks remain weighted toward decontextualized problems, AI will optimize for that narrow corner of mathematical thinking. Real-world problem solving—whether in science, engineering, finance, or everyday life—almost never comes in a perfectly posed, fully contextualized package. So we should diversify the problem sets we use to probe AI—and the educational tasks we design to accompany it—so that the tools we build and teach with reflect the messiness and nuance of real thinking.

Looking ahead: a future of thoughtful, risk-aware AI tutors

Ultimately, this scoping review doesn’t declare AI an enemy of learning. It reframes what “solving word problems” means in an age when machines are fast at counting but not always wise about meaning. The work suggests a practical path forward: keep AI as a powerful partner for arithmetic and procedural practice, but pair it with human guidance that centers sense-making and modelling. In classrooms, that could look like AI-generated practice with built-in prompts that explicitly check for realism, or as a tool that helps students test their own models by asking them to predict what would happen if a number changed or if a scenario altered in a subtle way.

For researchers, the message is equally clear. Build and test with problem sets that require students and AI to justify, critique, and reason about real-world constraints. Embrace evaluation methods that look beyond the final answer to the quality of the reasoning process. And remember that language models, no matter how polished, reflect the patterns they’ve seen—patterns that may or may not align with sound mathematical sense-making. The goal is not to reinvent education around AI, but to align AI’s strengths with human thinking so that students come away with deeper mathematical understanding, not just polished outputs.

The study was conducted by researchers from the University of Education Freiburg, KU Leuven, Technical University of Munich, and Portland State University, with Anselm R. Strohmaier as the lead author. It sketches a landscape many teachers already know well: machines can imitate the rhythm of problem-solving, but the heart of sense-making—seeing the world clearly, modelling it, and arguing why a solution fits that world—belongs to human thinkers.