When AI Chooses Our Priors, What Is Uncertainty?

Bayesian statistics often feels like a delicate negotiation between what we already know and what the data will reveal. The priors are the first step in that conversation, the beliefs you bring to a model before you even glimpse the numbers. When those priors are well-chosen, the data can sing in tune with them; when they’re off, the whole analysis can stumble. The challenge isn’t just choosing priors; it’s choosing them in a way that reflects real-world knowledge without distorting what the data has to tell us.

New work from Simula Research Laboratory and OsloMet in Oslo, Norway, tests a provocative idea: could large language models—those vast digital libraries of text—help craft informative priors that are principled, transparent, and grounded in domain knowledge? The paper doesn’t claim that LLMs will replace careful literature reviews. Instead, it treats them as a knowledge reservoir to be consulted within a Bayesian framework, then tested against real data to see whether their suggestions actually make the analysis more trustworthy. The study is led by Michael A. Riegler and colleagues, with key authors including Kristoffer Herland Hellton, Vajira Thambawita, and Hugo L. Hammer. It’s a tidy reminder that in science, the machine can be a collaborator, not a substitute for judgment.

The Quiet Revolution of Priors

The central idea is bold but surprisingly old-fashioned in spirit: use the wisdom embedded in vast text corpora to inform the occasional, stubbornly difficult-to-translate prior. The researchers don’t want a one-off number from a model; they want a structured elicitation that makes the model explain, justify, and reflect on its choices. They require the LLM to propose multiple prior sets—one that’s moderately informative and one that’s weakly informative—then to attach a confidence score to each suggested set. In short, the model must narrate its own reasoning, reveal competing hedges, and acknowledge uncertainty. It’s a pragmatic turn: if you’re going to rely on machine-generated prior knowledge, you should first force the machine to justify it and to acknowledge what it doesn’t know.

Two things help make this a story about more than just numbers. First, the study uses three well-known LLMs—Claude Opus, Gemini 2.5 Pro, and ChatGPT o4-mini—to test whether the model’s knowledge aligns with reality across different domains. Second, the researchers don’t stop at “does it look right on paper?” They test the priors on two real datasets: a heart-disease risk study and a cement-concrete strength study. The goal isn’t to prove that LLMs always beat traditional priors, but to investigate whether these models can reliably point in the right direction and how well they calibrate the strength of their beliefs. When the priors point in the correct direction but are overconfident about magnitude, that’s a different kind of mistake than if they’re merely wishy-washy. The nuance matters because it affect how the data and priors dance together in the posterior distribution.

Beyond the technicalities, the work nudges us toward a bigger question about knowledge in the age of AI. If machines can summarize decades of literature and offer priors, what does that do to the epistemology of statistics? Does this push researchers to become more discriminating about how confident the priors should be? The paper’s answer is nuanced and candid: the most valuable part of the approach isn’t the number the model hands you, but the meta-work—the moment when the LLM reflects on its sources, its assumptions, and how strongly it feels about each choice. In other words, the method is as much about understanding bias as about crystallizing belief.

How prompts choreograph thinking

At the heart of the study is a prompt that aims to unlock interpretable knowledge, not a single shiny statistic. The prompt asks the LLM to map its general knowledge to a concrete Bayesian model, to generate two prior sets for the model’s hyperparameters, and to justify, step by step, the chosen means and variances. It goes further: the model must justify how it weighed domain knowledge against the observed data and must assign a relative weight to each prior set. This is the kind of prompt that turns a model from a clever word-metter into a co-thinker that demonstrates its reasoning and its limits.

Practically, the researchers test two classic statistical tasks: a logistic regression for heart disease risk and a linear regression for concrete strength. They feed the LLM a careful description of the predictors, the outcome, and the scientific context behind each variable. The prompts lean on established bodies of knowledge—Framingham, MONICA, and meta-analyses—as anchors so the model can anchor its priors in known relationships (for example, that age and male sex are risk-enhancing for heart disease; that more water tends to weaken concrete). But the model is not asked to copy textbook lore uncritically. It must translate that lore into priors whose center and width reflect what the data, in this specific study, can support.

The upshot is a clear pattern: the LLMs do a credible job identifying directionality, but the magnitude and the exact confidence levels vary widely. The result is not a win-at-all-costs moment; it’s a careful portrait of where machine-generated priors can help, and where they need restraint. And the authors don’t pretend the priors will perfectly match the data out of the box. They stress calibration and evaluation as indispensable steps, not optional add-ons.

What the experiments reveal

In the heart-disease analysis, the Cleveland dataset—one of the standard testbeds for CAD risk factors—serves as the proving ground. The model is logistic regression with six predictors: age, sex, resting blood pressure, cholesterol, maximum heart rate, and exercise-induced ST depression. The LLMs propose priors for the corresponding regression coefficients. Across platforms, the directions line up with established medical understanding: older age, being male, and certain physiological markers tend to increase CAD risk. But the magnitude is where the speckled light of reality appears. The priors—especially the moderately informative ones—tushed toward stronger effects than the data suggests in this particular sample. This is a cautionary note: priors that feel textbook-correct may still miscalibrate the true magnitude in a given dataset, a reminder that context matters even for “known” relations.

The cement-concrete experiment offers a second proving ground. Here the goal is to predict compressive strength from curing time and a mix of components. The LLMs again correctly signaled the directions—cement generally raises strength, high water content reduces it, and other components modulate the effect. Yet once more, the fine-grain sizes mattered. Moderately informative priors tended to inflate the perceived impact of some ingredients, while weakly informative priors produced broader, more cautious estimates. The takeaway: a tighter, more confident prior isn’t always better if the data refuses to cooperate with that confidence. The study repeatedly finds that robust priors strike a balance—informative enough to guide learning, but broad enough to accommodate data-driven surprises.

To quantify how well the LLMs’ priors matched reality, the authors use a practical metric: the KL divergence between the MLE distribution derived from data and the prior’s implied distribution. In both datasets, Claude Opus and Gemini 2.5 Pro frequently delivered priors that aligned better with the data-generating process than ChatGPT did, especially in the weakly informative regime. The pattern was not universal, but it was persistent enough to be compelling. An especially notable point: the weakly informative priors—those designed to express non-trivial but not overbearing beliefs—tended to perform better than the moderately informative priors in many cases. That’s a surprisingly human insight: when you’re dealing with real-world data, a little humility in your priors—expressed as width rather than bold certainty—often matters more than a bold center.

Beyond the KL metric, the team also looked at predictive performance in five-fold cross-validation, using metrics like the Brier score, MNLS, and AUC. The Bayesian models driven by LLM priors showed some improvement over a standard frequentist logistic model, but the gains were not statistically significant. In other words, with large datasets, the likelihood dominates the posterior, and priors can nudge results but rarely rewrite the core story. This truth—data, not priors, often wins in big samples—doesn’t diminish the value of the approach; it reframes its sweet spot: precision contexts, small datasets, or out-of-distribution settings where priors can help anchor learning when data alone may mislead.

In short, the experiments don’t declare a wholesale victory for AI-assisted priors. What they do declare is a measured, evidence-backed optimism: LLMs can identify the right direction for relationships and can generate priors that are informative yet not irrationally confident. The standout performer among the tested models, Claude Opus, often produced priors that balanced domain knowledge with appropriate width. Gemini offered competitive results, while ChatGPT tended to be more prone to overconfidence or, in weaker forms, excessive vagueness. The results hold across two different problem domains, suggesting a general pattern rather than a one-off quirk of a single dataset.

Why this matters beyond academia

Two practical takeaways matter beyond the corridor of statistical theory. First, AI models can serve as a bridge between a vast ocean of human knowledge and the messy, noisy world of real data. In domains where you have tons of data, the likelihood term often drowns out priors; in data-poor or noisy settings, informative priors can lean the inference toward plausible, scientifically grounded conclusions. This isn’t a turnkey solution, but it’s a concrete knob we can turn to help integrate what we think we know with what we observe in the lab and the field.

Second, the reliability of this tool hinges on good calibration and disciplined use. The paper’s strongest messages are about how to ask the AI the right questions and how to interpret its answers. The width of a prior—the degree of uncertainty encoded—matters just as much as the center. An overconfident prior can push the posterior into a biased region; a prior that’s too diffuse can yield little gain. The researchers’ emphasis on a structured prompting approach, including confidence scores and the presentation of multiple prior sets, offers a practical blueprint for responsibly using AI as a knowledge partner rather than a shortcut to certainty.

And there’s a broader, almost cultural implication. If LLMs can contribute meaningfully to the scaffolding of statistical models, we get a future where domain experts and data-driven insights can dance together—AI surfaces knowledge, humans judge its reliability, and the resulting models become more robust to the twists and turns of real data. The paper doesn’t claim to have solved uncertainty; it argues for a smarter way to approach uncertainty—one that makes explicit the sources of knowledge, the limits of the model, and the nature of the questions we’re asking. It’s a step toward a collaborative epistemology where language models help illuminate what we know and where we still need to learn more.

The work is a reminder that science is not a single moment of discovery but a continual dialogue among ideas, data, and methods. If LLMs can help surface well-structured priors, they should be held to the same standards as any scientific input: transparent prompts, explicit reasoning, and rigorous validation. The study’s authors make this point with care, showing both promise and caution in equal measure. The question now is less about whether AI can conjure priors at all and more about how we design the conversations that turn those priors into trustworthy decisions in the real world. The future of Bayesian analysis might look more like a duet between human judgment and machine knowledge—a partnership that respects uncertainty while offering a clearer map of where our beliefs come from and how much they should bend in light of new evidence.