Your Next Prompt Could Copy Facts or Fabricate Truths

The inner life of a big language model is a busy place. Pages of learned facts mingle with the fragments of prompts, memories of past conversations, and the stubborn pattern of repetition. A team from the University of Amsterdam set out to peer into that hidden workshop and understand how models decide what to say when a prompt asks for a fact versus a counterfactual twist. Their work is a careful reproduction and extension of a line of studies that treats language models less like black boxes and more like crowded libraries where shelves contend for which memory gets checked out first. The researchers behind this study are Asen Dotsinski, Udit Thakur, Marko Ivanov, Mohammad Hafeez Khan, and Maria Heuss, all at the University of Amsterdam, and they build on the intriguing ideas first put forward by Ortu and colleagues in 2024.

In a sentence: these researchers are asking not just what a model knows, but how it structures what it knows inside its own architecture. They look at the tug between factual recall and counterfactual repetition and show that, in practice, the model seems to juggle two competing memories at once. The result is not a single straight line from question to answer, but a small army of components inside the network that vie for influence as the answer emerges. The study uses a careful mix of replication, extension, and a pinch of domain-dialect testing to show where these mechanisms live, how robust they are across model families, and how sensitive they are to the exact shape of a prompt.

A hidden tug-of-war inside memory and copy

The central idea is deceptively simple to state: factual knowledge and counterfactual copying inside a language model are not poured into one single destination. They are distributed across layers and positions within the model’s tokens, and they compete for the final say. The Amsterdam team reproduces three core claims from the earlier work: first, that factual information tends to show up in certain positions in the sentence being formed, while counterfactual information tends to appear in other positions; second, that attention blocks—the parts of the network that decide which words to weigh more heavily—play the dominant role in this competition; and third, that only a small group of attention heads end up steering the final outcome, suggesting a phenomenon some interpretability researchers call head specialization.

In their replication, the researchers run experiments on GPT-2 small and Pythia 6.9B, mirroring the original setup, and they extend the test to Llama 3.1 8B to see if these patterns hold when the models get bigger and more capable. What they find is a mix of robustness and nuance. Across these models, the basic pattern holds: there is a tilt toward separate encoding of factual versus counterfactual information, and attention blocks are the primary engines behind the competition. Yet the exact locus of facts within the prompt is more complicated than a simple subject-token rule might suggest. The last token of the first relation and the tokens that define the relation itself carry the strongest signals for the factual outcome, while the counterfactual path is nudged by other positions in the sequence. The discovery is a reminder that the truth inside a model is not pinned to a single switch but emerges from a chorus of interacting components.

What this means for trust, safety, and design

Interpretability matters because we want models we can trust and steer without turning every interaction into a guessing game. If a model’s final answer can hinge on a small subset of attention heads, as these studies suggest, then a researcher or engineer might imagine ways to make the model more honest by either reinforcing factual recall or dampening opportunistic copying of the prompt. The Amsterdam team’s work puts a spotlight on two potential levers: the structure of the prompt and the architecture of the network itself.

One clear implication is that prompt design can dramatically shift a model’s behavior. In one set of experiments, reformulating a prompt into a question format (a QnA style) changes the balance between factual recall and counterfactual copying. For GPT-2 small, this reformulation reduces the model’s tendency to repeat the counterfactual token, nudging it toward fact. For larger models like Pythia 6.9B and especially Llama 3.1 8B, the effect is more nuanced, but the trend persists: the way we phrase the prompt changes which mechanism gets to the finish line. This matters for everything from chatbots that must acknowledge uncertainty to retrieval-augmented generation systems that should prioritize sourced facts over blind repetition.

Another practical consequence is about domain bias. When the input topics cluster around domains like autos or electronics, the model’s factual tokens often align with what the prompt reveals about product-brand structures. In less represented domains, the balance tilts toward counterfactuals. The lesson is not that all models memorize the same facts in the same spots, but that the surrounding data distribution—what topics a prompt tends to dwell on—shapes how the model organizes its memory and, crucially, how it might be steered by an adversary or a misinformed user through carefully crafted prompts.

There are deeper takeaways for builders too. If one could reliably identify the handful of attention heads that dominate the mechanism competition, it might be possible to tune models to be less impressionable by a prompt or, conversely, to amplify a “copy” mechanism when desired, as in certain retrieval tasks. The catch, as the Amsterdam team notes, is that those specialized heads do not appear in all models or domains. In the larger Llama family they tested, the same heads did not present the same dominance, suggesting that a one-size-fits-all interpretability-knob may be out of reach. This cautions against overgeneralizing simple interpretability stories and argues for more nuanced, model- and domain-aware tools.

What surprised the researchers and what comes next

The paper’s extensions brought two big surprises. First, the longevity of the original claims across a wider family of models is encouraging but not uniform. While GPT-2 small and Pythia 6.9B show the classic pattern—a few heads leading the charge and attention blocks driving the competition—Llama 3.1 8B behaves differently. In Llama, early layers show almost no signal, with a burst of activity only in the final layer. The logit lens method, which researchers use to peek at what the network is predicting at different layers, becomes less reliable on very large models. In other words, a tool that helps you read the map may not always read it correctly when the map is huge and the terrain is unfamiliar.

Second, the domain sensitivity of the findings is striking. When prompts come from domains that naturally embed the factual token in the subject, the supposed importance of the subject position becomes muddier. In arts or law, where questions may revolve around language, citizenship, or non-product facts, the same heads that mattered in autos and electronics don’t wield the same influence. This is a humbling reminder that interpretability is not a universal dial you can turn; it’s a mosaic shaped by data, task, and architectural choices.

Beyond these surprises, the study presses toward a practical question: can we design models that are both powerful and principled in how they use facts? The authors of the original work argue for a cautious optimism: the mechanism competition framework offers a lens to study and eventually steer model behavior. The Amsterdam team echoes that sentiment, while also calling for more robust methods. A larger part of interpretability may lie in circuit-level analyses, alternative lenses tuned to different scales, and richer datasets that avoid the biases that creep into prompt banks. The goal is not to lock models into a single, easily explainable mode, but to cultivate a vocabulary for where truth hides in a network and how deliberate prompts or training choices can shift the balance toward reliable recall over copy.

A more nuanced map of a model’s memory

In the end, what makes this work compelling is not a single revelation about a mysterious algorithm, but a shift in how we picture a model’s memory. The model is not a single mind with a clearly labeled memory bank. It is a sprawling, layered ecosystem where different tokens and positions compete for influence, where attention heads act like specialized routes through a city, and where the next word can be shaped by the lingering pressure of the words that came before. The Amsterdam study reframes the narrative from a simple question of whether a model knows a fact to a richer story of how a fact travels through the network, how a counterfactual slips into place, and how the architecture and the prompt together choreograph the final act.

For readers who care about the futures we’re building with AI, the takeaway is this: the path from prompt to answer is not a straight line but a dance. The more we study that dance, the better we can design systems that respect truth, resist manipulation, and remain understandable even as they grow more capable. The research from the University of Amsterdam doesn’t close the book on interpretability, but it adds a chapter that is precise about where the internal arguments happen and how different prompts tilt the scales. It’s a reminder that progress in AI is as much about asking better questions as about building bigger models.