Ask a language model to pick the right word in a sentence and you’ll often get a clean, confident answer. But confidence isn’t proof of understanding. The field has long used Winograd-style puzzles to probe whether AI actually uses common sense, or just clever statistical tricks. The narrative around these tests has grown more tangled as models get bigger and training data gets noisier. A team from the University of Antwerp’s CLiPS research center—Ine Gevers, Victor De Marez, Luna De Bruyne, and Walter Daelemans—takes a hard look at what these benchmarks really reveal. Their study doubles down on curiosity: what happens when you twist the sentences just a little, and does that reveal whether the models truly reason or merely memorize patterns?
Their answer is provocative. They built WinoWhat, a parallel paraphrase of the WinoGrande validation set. In WinoWhat, each original sentence is rewritten so that the blank is at the end of the sentence, and the task remains the same in spirit. The result is a cleaner lens for testing coreference and bridging, the kinds of reasoning that require linking world knowledge to the text. Then they take a close look at five common sense knowledge categories—physical, social, numerical, spatial, and temporal—to see which kinds of knowledge trip up AI and whether some categories are easier than others. The headline takeaway isn’t just that paraphrasing hurts performance; it’s that the entire picture of what LLMs can do—how much they truly reason versus how much they memorize—needs rethinking.
What makes this work especially compelling is what it asks us to reassess: the trust we place in benchmark scores as a stand‑in for understanding. If a model chugs along when a sentence is worded one way but stumbles when it’s paraphrased, what we’re really seeing is sensitivity to surface form, not robustness of reasoning. And if the effect holds across model families from 2‑billion‑parameter minis to 70‑billion‑parameter behemoths, it’s a strong signal that the field needs benchmarks designed to test genuine cognitive ability, not artifacts of language data or evaluation quirks.
Paraphrase as a stress test for reasoning
The Winograd Schema Challenge (WSC) originally posed a deceptively small puzzle: a pronoun in a two‑clause sentence must be linked to the correct noun, but the answer hinges on world knowledge and how the sentence is structured. WinoGrande expanded this idea into a much larger adversarial benchmark, crafted to resist easy shortcuts or trivial correlations. The Antwerp team notices a core tension: if models perform well, is that because they’re really using common sense, or because they learned to game the data? Their approach is to flip the script just enough to reveal the difference.
They created WinoWhat by paraphrasing every instance in the WinoGrande validation set, then moving the target option to the end of the sentence. This slight reformatting makes the task friendlier to decoder‑only models, but more importantly, it tests whether a model can solve the same underlying puzzle when the surface form of the sentence changes. The authors also classified each instance by five common sense knowledge categories: physical, social, numerical, spatial, and temporal. The goal wasn’t to cherry‑pick a single slippery category; it was to map a landscape of knowledge types to see which kinds of reasoning cats tend to chase—and which ones trip them up.
In addition to the paraphrase, the team asked a serious meta‑question: does the model rely on memorized content from training data? To answer this, they created two small test suites designed to isolate memorization effects. One suite comes from the classic Winograd Schema Challenge, which is likely to have leaked into public training data; the other is the WinoGrande test itself, which has private labels and is less likely to have appeared in pretraining. By comparing performance across these sets, they sought to separate genuine reasoning from data leakage.
What the results reveal about generalization, not memorization
Across model families—Gemma 2, LLaMA 2, and OPT—with sizes spanning from a few billion to several tens of billions of parameters—the pattern is striking. On the original WinoGrande validation set, larger models generally perform better, but the gains aren’t uniform across knowledge types. When the same sentences are paraphrased for WinoWhat, every model shows a drop in accuracy. The drop isn’t tiny or limited to a corner of the dataset; it shows up across the board, across model families, and across all five knowledge categories. In other words, the paraphrase undermines the models’ apparent coreference and bridging capabilities in a robust, system‑wide way.
Even more telling is the cross‑category result. There isn’t a single category that’s always easy or always hard. For some models, spatial knowledge is easiest; for others, physical or social knowledge takes fewer hits. But the key point is consistency: paraphrasing weakens performance across the spectrum, and no category proves itself resilient to surface rewordings. The researchers describe this as a challenge to the assumption that the WinoGrande task requires genuine reasoning. The data suggests that models may be exploiting artifacts or memorized patterns in the original wording, rather than consistently applying common sense when faced with a paraphrase.
The upshot is as sobering as it is elegant: if a benchmark’s value rests on a subtle interplay between language form and world knowledge, then we’re testing the model’s sensitivity to language, not its internal reasoning. And because this holds across multiple model families, the implication is broader than a single dataset. It’s a nudge to the AI community to calibrate its faith in benchmark scores with more robust tests that isolate genuine inference from surface associations.
Memorization is not the full story, but it’s part of the plot
The team didn’t stop at paraphrase. They pressed deeper into the memorization question with a careful data‑contamination analysis. First, they checked how many WinoGrande validation instances were likely to appear in the pretraining corpora of popular open models. They found that, for some datasets and models, a nontrivial fraction of the training material could resemble or contain the same sentences. But when they focused on the WinoGrande validation set, the signal of direct memorization was modest. In other words, the performance gap between WinoGrande and WinoWhat could not be explained away by data leakage alone.
To broaden the check, they turned to models with publicly known pretraining data. Two such families—Llama 1 and Pythia—allowed a more controlled look at memorization. The trend persisted: as model size grew, accuracy on the original WinoGrande tended to improve, but the drop when moving to WinoWhat remained. The implication isn’t that memorization is entirely absent; it’s that memorization cannot fully account for the observed gap. Paraphrase effects survive across training regimes and model architectures, suggesting that there are other, subtler forces at work—perhaps a mix of how models internalize cues during training, how they are fine‑tuned, or how evaluation metrics interact with task structure.
To triangulate further, the researchers also built two additional test suites. One was drawn from the Winograd Schema Challenge itself, which is likely to sit within many training corpora; the other was the WinoGrande test set, which remains private and less likely to be memorized. In both cases, paraphrasing consistently reduced performance, and robust superiority on the original data dissolved when the surface form changed. Even a benchmark that had helped some models “beat the odds” could not keep its edge in the paraphrased world. The message lands with two parts: data contamination matters, but even pristine testing cannot explain away the gap if the underlying evaluation is susceptible to surface tricks.
What this means for how we test AI minds
The Antwerp study doesn’t merely critique a single benchmark; it offers a blueprint for a more honest, future‑proof approach to evaluating AI reasoning. First, it shows the value of paraphrase, not as a novelty, but as a stress test that cuts to the heart of whether a model truly encodes world knowledge in a way that transfers when sentences are recast. The WinoWhat corpus, which the authors publicize, provides a concrete resource for researchers who want to probe generalization beyond specific phrasings. The act of moving the answer word to the sentence end is a simple, clever constraint that clarifies what the model must do: rely on the context, not the tail end of a familiar phrase.
Second, the work’s common sense categorization—five broad knowledge types—offers a structured way to diagnose where models struggle. If a system consistently falters on temporal reasoning but stumbles less on physical intuition, we learn something actionable about what to improve, whether through targeted data augmentation, targeted prompting, or architectural tweaks. The authors are careful to note that categorization isn’t a silver bullet; they see this as an error‑analysis tool that helps separate signal from noise, and patterns from quirks.
Third, the paper confronts a painful truth about benchmark design: even widely used datasets can be leaky. Contamination and artifacts have haunted AI evaluation for years, and this work reinforces the idea that we need benchmarks whose sources and splits resist leakage, retraining, and RLHF‑driven tuning. The authors’ two small test suites illustrate a practical path forward: if a test remains robust when you pares down and reframe the data, you’ve got a stronger claim about the model’s cognitive abilities.
Finally, there is a broader cultural implication. As language models grow into the realm of parlor‑room assistants and professional tools, we crave tests that distinguish “nice‑to‑have” tricks from genuine understanding. The WinoGrande/WinoWhat line doesn’t pretend to hand us a final answer about whether AI possesses humanlike common sense; it does, however, arm the community with a sharper instrument for asking what our models are actually doing when they read and predict. It’s a reminder that progress in AI isn’t always a straight line upward in a single metric, but a more nuanced story of where models can generalize, where they can’t, and why that matters for real‑world use.
In their closing notes, the University of Antwerp team argues for continued work along two intertwined threads: building better diagnostics that go beyond surface form, and developing benchmarks that stress the robust transfer of knowledge across rephrasings and contexts. The path forward is not a single fix but a plural effort—better data practices, sharper evaluation metrics, and a deeper curiosity about how models internalize the rules of human reasoning. If there’s a single takeaway, it’s this: the question of whether AI reasoned its way through a Winograd puzzle is less about a clever single flaw and more about the broader design of how we test, measure, and trust machine intelligence.
As the authors put it in their study, the WinoWhat corpus and the accompanying analysis are a call to move beyond “performance on a benchmark” toward a more faithful understanding of how models reason. The goal isn’t to shatter the entire enterprise of AI evaluation, but to raise the bar: to demand tests that resist shortcuts, reveal true generalization, and illuminate the kinds of knowledge AI will need as it becomes more woven into daily life. That’s not just a technical challenge; it’s a human one—figuring out how to measure something as subtle as understanding when the surface of a sentence is rearranged, and the underpinnings of reasoning remain the same.
In the end, the study is a reminder that progress in AI isn’t a victory lap after solving a riddle. It’s a reckoning with the complexity of human common sense itself, and a clarion call to build the tools that can responsibly chart AI’s growing capabilities. The University of Antwerp team’s work—grounded in careful experimentation, honest reporting, and a clear wish to sharpen our common sense about machines—pulls us toward a wiser, more cautious optimism about what our software can and cannot know.
About the study
The work comes from the CLiPS Research Center at the University of Antwerp, with Ine Gevers as the lead author and collaborators including Victor De Marez, Luna De Bruyne, and Walter Daelemans. The researchers introduced WinoWhat, a paraphrased parallel version of the WinoGrande validation set, and they analyzed performance across five common sense knowledge categories: physical, social, numerical, spatial, and temporal. They examined multiple open‑source model families—Gemma 2, LLaMA 2, and OPT—and conducted targeted tests to separate the effects of data memorization from genuine generalization. Their findings challenge the assumption that high benchmark scores on WinoGrande necessarily reflect robust reasoning, and they offer a concrete path for more reliable evaluation in the era of ever larger language models.