Which Sensitivity Parameter Is Right for Omitted Variables?

In causal claims, the chain from cause to effect is only as strong as what we leave out. Economists have long wrestled with omitted-variable bias, the dreaded specter that lurks when a relevant factor isn’t observed or controlled. Across dozens, then hundreds, of studies, researchers have relied on sensitivity analyses to ask: how robust are my conclusions if unobserved variables matter more or less than the variables I can see? The problem, as Diegert, Masten, and Poirier show, is not just that there are many sensitivity parameters out there, but that there isn’t a universally good way to pick among them using the data alone. Their new paper (written with the Toulouse School of Economics and Georgetown and Duke collaboration) builds an axiomatic framework to compare these sensitivity measures themselves, independent of the data’s particular quirks. The result is a principled way to say which parameter makes sense to trust, and when certain popular choices might mislead more than they illuminate. The study behind this shift comes from the Toulouse School of Economics, with Paul Diegert as the lead author, along with Matthew A. Masten of Duke University and Alexandre Poirier of Georgetown University. They don’t just critique old tools; they propose a meta-toolkit for selecting sensitivity parameters that behave in predictable, defendable ways as the number of covariates grows.

To appreciate what’s new, think about a long regression: you want to explain an outcome Y using a treatment X and a set of observed controls W1, while suspecting there are unobserved controls W2 that could tilt the estimate of X. The bias that arises from omitting W2 is OVB, omitted-variable bias. Traditional sensitivity analyses parameterize how strongly unobservables could influence the treatment and the outcome, then translate that into bounds on the long-run estimate. But different papers adopt different sensitivity parameters, and those parameters aren’t measured from the data. The data can’t tell you which parameter is correct; the best you can do is declare a prior or a benchmark. The authors of the new work argue that we should think of sensitivity parameters the way we think about estimators in classical statistics: by their sampling distributions under a designed random process that imitates how covariates might be observed or unobserved in practice. In short, they push for an axiomatic, design-based way to compare sensitivity parameters, not just to compare results under different assumptions.

That shift matters because sensitivity analyses have become a kind of language of robustness in economics and beyond. Oster’s (2019) framework, for example, became a workhorse for translating “how strong would unobservables have to be to overturn this result?” into a number practitioners can quote. The new paper asks: if we amplify this conversation into a systematic, formal comparison among eleven (and counting) what-if parameters, which ones actually behave in a way that’s consistent with how covariates are distributed and selected in real data? The authors answer with a set of clear criteria—consistency and monotonicity—that any good sensitivity parameter should satisfy under a simple, abstract design: observe exactly d1 covariates out of K total, chosen uniformly at random. If a parameter’s values align with the intuition that unobserved factors become more influential when more covariates are unobserved, it passes the test. If not, it’s a flag that the parameter could mislead in some settings, even if it looks reasonable in others. In other words, they give us a yardstick to separate robust ideas from fragile ones.

The core idea: a design-based lens on sensitivity

Imagine you’re designing a lab-like experiment, but this time the “experiments” are the possible ways researchers might observe or ignore covariates in observational data. You start with a complete set of potential covariates W, and you decide which ones are observed (W1) and which ones are left unobserved (W2) in a given scenario. The binary vector S encodes this selection: Sk = 1 if the k-th covariate is observed, Sk = 0 if not. The distribution of S—the design distribution—plays the role of your experiment’s randomness. The clever move in the Diegert-Masten-Poirier paper is to treat the sensitivity parameter 1(·) as a random variable that depends on S, i.e., 1(S). Then they study its covariate sampling distribution: how does 1(S) behave when you randomly pick which covariates you’ll observe, under a fixed, finite joint distribution of (Y, X, W)? This is analogous to how classical statistics studies the sampling distribution of estimators over repeated samples, except here the randomness is about which covariates end up observed, not which units are observed.

The central assumption they formalize is a kind of neutral, uniform covariate selection, called B1: among K covariates, you observe exactly d1 of them with equal probability for each subset. This creates a clean, benchmark notion of “equal selection”—the case where observed and unobserved covariates are equally informative on average. They also spell out two properties they’d like sensitivity parameters to satisfy as K grows large: Consistency (the parameter tends to 1 under equal selection) and Monotonicity in selection (the parameter should drift above 1 when unobservables loom larger, and below 1 when they loom smaller). These aren’t abstract decorative criteria; they’re designed to weed out sensitivity measures that look sensible in narrow circumstances but fall apart when you scale up the number of covariates or face different patterns of correlation among them.

From this framework, the authors take several popular sensitivity parameters out for a test drive. They show early on that the best-known one, Oster’s δ_orig (and the companion δ_resid variant sometimes used to handle endogenous controls), does not satisfy Monotonicity in selection in general. In plain language: even when you imagine more unobservables, δ_orig doesn’t reliably say unobservables are more important. In some regimes it behaves in the opposite direction of what researchers would reasonably expect. That’s not a philosophical quibble; it’s a warning that the benchmark many papers cite could be biased in subtle ways that depend on the structure of covariates. The paper also shows clear cases where residualizing the unobservables (the δ_resid variant) performs even worse in terms of those core properties. By contrast, two other families of parameters—rX and kX (and their close cousins rY and kY) under standard regularity conditions—do satisfy Consistency and Monotonicity. The upshot is a principled pointer toward parameters that behave in an interpretable, predictable way as the covariate landscape changes as K grows.

In one of the sharp results, the authors prove a clean limit for the rX parameter: as the number of covariates grows, its covariate-sampling limit is a simple function of the ratio r = d2/d1 (how many covariates are left unobserved versus observed) and a correlation-driven constant cπ that encodes how the covariates hang together. Importantly, rX(S) converges to a value that increases with r, crossing the intuitive equal-selection point at r = 1. That makes rX a parameter that feels faithful to the design’s geometry: if you observe more covariates, the unobservables matter less on average; if you observe fewer covariates, the unobservables loom larger. Similar results hold for rY(S) tied to the outcome equation. The paper also shows that an ACET-type parameter (a variant used in Altonji-Elder-Taber style bounds) behaves differently: it passes the consistency test but, unlike rX, doesn’t exhibit monotonic growth or decline with the selection regime under the same assumptions. Then there’s kX, a sensitivity measure based on R-squared changes, which under reasonable exchangeable covariate structures can satisfy both properties as well. The mathematics is intricate, but the narrative is pragmatic: not all parameters are created equal when you push the thought experiment far enough, and some have the right asymptotic compass to guide interpretation across many covariates and correlation patterns.

Beyond the math, the paper makes a larger point: the data cannot tell you which sensitivity parameter to trust. The covariates themselves, their distribution, and their endogeneity shape the very meaning of a sensitivity parameter. The authors’ contribution is to formalize a method for comparing these parameters on their own terms, using the design distribution as a literal test bench. That reframes sensitivity analyses—from a toolbox of ad hoc choices to a disciplined comparison, with explicit criteria that any robust parameter should meet in large samples.

From theory to practice: what this means for researchers

Practically, what the authors offer is a meta-guide for robustness analysis. If you’re a researcher who uses sensitivity analyses to argue that your treatment effect is robust to lurking confounders, you now have a principled way to pick among eleven different sensitivity parameters, rather than defaulting to the one that’s most familiar or most convenient. The guidance is not that you must abandon Oster or any other established tool, but that you should be explicit about which properties you require from a sensitivity parameter and why. If your analytic context matches exchangeable covariates or roughly fits the low-level conditions the authors discuss, you can justify using rX or kX as your primary sensitivity parameter, because these metrics have consistent, monotone behavior under the design distribution they analyze. If, instead, your context emphasizes endogeneity and you rely on a residualized approach, you may need to be especially cautious about the benchmark you use to declare “equal selection.” The paper’s empirical appendix underscores this point with a hands-on, non-asymptotic comparison: in a data-generating process calibrated to real-world data (Bazis, Fiszbein, Gebresilasse, 2020), the rX(S) distribution stays tightly centered around 1 under equal selection, and shifts intuitively with more or fewer observed covariates. By contrast, the |δ_resid(S)| distribution stays wide and can move in the wrong direction relative to the nominal interpretation. That contrast is not a math abstraction; it’s a concrete reminder that “equal selection” benchmarks and residualized variants can drift in unexpected ways when you confront the real world’s messy correlation structures.

The paper doesn’t merely critique; it offers a path forward. The authors propose that practitioners adopt the covariate-sampling-distribution viewpoint as a standard part of sensitivity analysis. Rather than citing a single threshold or benchmark, researchers would report how their chosen sensitivity parameter behaves under the design distribution and whether it satisfies the two core properties. They also acknowledge that asymptotic properties aren’t the entire story: they accompany, rather than replace, non-asymptotic checks in real datasets. The empirical part of their work confirms that even with finite, modest numbers of covariates, the asymptotic intuition about rX and kX tends to appear in practice, and the problematic behavior of resid-like parameters can emerge in real data as well. In other words, the theory maps onto the messy world, not as a guarantee but as a guide to better judgment.

For anyone who writes about policy, education, health, or social programs, the message lands with particular clarity: robustness isn’t just a rhetorical flourish; it’s a methodological obligation. If you claim your result would hold under a plausible level of unobserved confounding, you should be explicit about which sensitivity parameter you’re using and why it is the right one for your covariate structure. The new framework helps articulate that justification in a way that is auditable, transparent, and, crucially, comparable across studies. It’s a step toward a shared language for robustness that isn’t hostage to a single parameter’s quirks or a paper’s particular data-generating process.

Why this matters beyond economics

Omitted-variable bias isn’t unique to economics; it travels across social sciences, epidemiology, education research, and public policy. Any field that relies on observational data to infer causal effects faces the same existential question: how confident can we be that an unseen factor didn’t drive our finding? The paper’s design-based, axiomatic approach offers a template adaptable to other disciplines. If epidemiologists or political scientists adopt a similar benchmarking approach—comparing sensitivity parameters not just by intuition or tradition but by how they behave under a well-specified design of covariate observation—they’d gain a common ground for claims about robustness. The idea of a covariate-sampling distribution could even inform machine-learning fairness and robustness debates, where the space of “covariates” can be enormous and their interdependencies complex. The authors’ emphasis on principled comparison over tacit trust is a universal impulse in science: we should be able to argue, with clarity, about why we trust one counterfactual assumption more than another, and why that trust should persist as the data scale up or as the covariate structure shifts.

In a field that loves to declare itself data-driven, this work is a reminder that not all robustness is created equal. The paper doesn’t give you a silver bullet; it gives you a framework to think more rigorously about which bullets you’re loading. It also shows how much of the uncertainty surrounding causal claims comes from the tools we choose to measure it. By reframing sensitivity parameters as objects to be compared, not just numbers to be reported, Diegert, Masten, and Poirier invite the research community to cultivate a more disciplined, comparable, and ultimately more trustworthy robustness conversation.

What’s next and what to watch for

The authors sketch a few directions that could sharpen this framework further. One is to move beyond probability limits and study the full asymptotic distribution of sensitivity parameters, which could reveal nuanced differences in how they bounce around under realistic covariate patterns. Another is to explore alternative design distributions that reflect more realistic covariate-selection mechanisms than uniform random sampling—think structured selection, missingness patterns, or covariate dependencies that mirror real-world data collection. The goal would be to understand whether the properties of consistency and monotonicity persist under these more intricate designs and what that means for practitioners who still want a grounded, interpretable benchmark. A third direction is to apply this framework to a broader class of robustness checks, perhaps even outside the linear regression world, to see whether the same axiomatic criteria help separate the wheat from the chaff in other sensitivity analyses. The work remains anchored in strong theoretical reasoning, but it also invites a broader conversation about how we build, compare, and trust robustness in a data-rich, message-driven age.

In the end, this paper is as much a manifesto as a method. It insists that the tools we use to assess robustness should themselves be judged by the same standards we apply to any scientific instrument: transparency, consistency, and a clear account of how our assumptions shape conclusions. If you’re a researcher, a policymaker, or a curious reader who wants to understand what lies beneath a robustness claim, this axiomatic approach offers a map. It’s a map drawn not from the fog of statistical folklore but from a design-based view of how covariates come and go, how unobserved variables might tilt the balance, and how we can choose parameters that honestly reflect that balance as our data grows more complex.

Institutional home and leadership: The study is a collaboration led by Paul Diegert of the Toulouse School of Economics, with co-authors Matthew A. Masten of Duke University and Alexandre Poirier of Georgetown University, anchoring the work in a triad of institutions renowned for causal inference and empirical methods. Their formal framework represents a thoughtful advance in how we reason about robustness, not just in economics, but in any discipline wrestling with the question of what we can believe when some pieces of the puzzle remain unseen.

In a world where data are plentiful but bears of missing information are endless, this paper offers a compass for navigating the murk. It doesn’t guarantee perfect knowledge, but it does give us a clearer, more defensible way to decide when a sensitivity analysis is telling the truth about unobserved factors and when it’s simply telling a story we want to hear. That’s the kind of clarity the scientific conversation needs more of—and the kind of clarity that could help all of us think more carefully about what robust evidence actually means in public life.