When Missing Data Hides Causality, New Tools Uncover Truth.

Table of Contents

Data in health research rarely arrives neat and complete. Observational studies chase causal questions, but missing data is a constant companion. Exposures people forget to report, confounders researchers wish they had, outcomes that arrive long after the study begins. The standard workaround is multiple imputation, a practical patch that smooths over gaps as if they weren’t there. But what if the gaps themselves carry information about the cause you’re trying to uncover? A fresh approach from the University of Waterloo asks exactly that question, recasting missing data not as a nuisance but as part of the causal story. The authors, Lan Wen and Glen McGee, show that you can still read a causal signal from incomplete data when you replace old assumptions with a more nuanced view of why data go missing.

Wen and McGee ground their work in a real world probe: the potential effect of prescription opioids on mortality among the elderly. In their analysis, about a quarter of observations had missing values in either the exposure (whether someone was prescribed opioids) or in potential confounders. That sort of incompleteness is ubiquitous in observational health data, and it has a habit of biasing conclusions if you pretend the data are fully observed. The paper doesn’t just point out the problem; it proposes a principled way to answer causal questions even when data misbehave, and it does so with tools that can be implemented by researchers who aren’t wedded to a single statistical recipe. The work is a collaboration led by Lan Wen and Glen McGee at the University of Waterloo in Ontario, Canada, and it builds a bridge between deep causal theory and practical data analysis for real-world health science.

Missing data and causality revisited

At the heart of causal inference in observational data is a simple yet stubborn problem: to tell whether A causes Y, you’d like to compare what would have happened to each person if they had experienced A versus if they had not, holding everything else fixed. In practice, we never observe both realities for the same person. Statisticians work around this with assumptions about how data come together. The classic one is missing at random, MAR: the fact that data are missing is related only to things you have observed. If MAR holds, standard tricks like multiple imputation can yield unbiased estimates when the rest of the model is well specified. But MAR is a strong assumption—often it’s not plausible in health data, especially when sensitive exposures and confounders come into play.

Wen and McGee push past MAR by developing a family of assumptions that permit missingness to be linked to unobserved factors, yet still allow identification of the average causal effect E(Ya). They introduce two main MNAR families, which they call MNAR-A and MNAR-B. In MNAR-A, the exposure missingness and the missing confounders can share unmeasured causes, but the missingness mechanism for the exposure is independent of the outcome once you condition on observed data. In MNAR-B, the authors add a careful, ordered structure among the partially observed confounders, so that missingness can cascade in a controlled, stepwise way from one confounder to the next. The upshot is a tractable set of identifying formulas for the causal effect that rest on plausible, testable-looking assumptions about what makes data appear or disappear in surveys and records.

To a reader, the idea may sound abstract, but the payoff is concrete. Under these MNAR assumptions, the average causal effect can be written in terms of quantities you can estimate from the data you actually observe, without waiting for a miracle of complete data. This is not a magic trick; it is a careful accounting of what the observed information says about the unobserved world, guided by a graphical way of thinking about missingness. Wen and McGee also connect these identifying formulas to influence functions, which in turn lay the groundwork for efficient, robust estimation in a nonparametric setting. The mathematics is dense, but the practical message is clear: with the right assumptions about why data go missing, we can still extract credible causal conclusions from imperfect data.

For practitioners, the MAR path remains attractive when it is reasonable to assume that the data missingness is driven by variables you’ve measured. In that familiar corner of the universe, standard multiple imputation followed by complete-data estimators can work very nicely. But in the messier, more realistic world where exposure data and key confounders may both be missing for reasons tied to unobserved factors, the MNAR framework is a lifeline. It provides not just a diagnostic about why a dataset might mislead, but a concrete recipe for estimation that protects against certain forms of misspecification. Wen and McGee’s contribution is thus twofold: a clarifying conceptual map of missingness mechanisms and a practical, double-robust toolkit to draw causal inferences when the map is imperfect.

Two MNAR families and the TMLE toolkit

The heart of the methodological contribution is a pair of MNAR-centered estimation strategies built on targeted maximum likelihood estimation, TMLE for short. TMLE is a modern fusion of machine learning and causal thinking that aims to harvest all available information in a data-driven way while preserving rigorous guarantees about bias and variance. Wen and McGee adapt TMLE to the MNAR-A and MNAR-B worlds, showing how to construct estimators that are consistent as long as at least one part of the model is correctly specified.

Under MNAR-A the missingness of the exposure and the partially observed covariates can share unobserved causes, but assuming outcome independence conditional on observed data, one can still identify the causal effect from the observed pieces. The authors derive an influence function for the MNAR-A functional, and they translate it into a practical TMLE algorithm. The core idea is to start with initial, data-driven predictions for the outcome under each exposure level, then apply a targeting step that reweights and updates those predictions to align with the observed data distribution. When the nuisance parts—the exposure model, the missingness mechanism, and the outcome model—are estimated, the TMLE-A estimator is doubly robust: it remains consistent if either the outcome model and the exposure/missingness model are correct or if the exposure/missingness model and the outcome model are correct. In short, you have a second chance if one piece of the puzzle is mis-specified.

MNAR-B goes deeper into the confounder geometry. It introduces a sequential structure by ordering the partially observed covariates in LM and allowing the missingness of one covariate to depend on the observed history of earlier covariates. This makes the missingness mechanism more flexible and better aligned with how real surveys are filled out. The identifying functional for MNAR-B looks similar in spirit to MNAR-A but carries a more elaborate weighting and cross-conditional prediction scheme that reflects the ordering. The corresponding TMLE-B estimator inherits the same double robustness property: if either the sequence of outcome models and early missingness models or the sequence of missingness indicators and later outcome models are correctly specified, the estimator remains consistent and efficient in a nonparametric sense.

Crucially, Wen and McGee do not pretend that one size fits all. The MNAR-A and MNAR-B frameworks provide a flexible menu for researchers facing different missing data realities. The estimators are designed to be compatible with modern machine learning tools, including highly adaptive lasso and cross-fitting, which helps promote good finite-sample performance in complex data landscapes. The theoretical results come with practical guidance: in large samples with flexible nuisance function estimation, the estimators achieve asymptotic normality, and their variance can be estimated with the bootstrap or via the influence-function-based variance formulas when the nuisances are well specified. In other words, the authors give you both the compass and the map for navigating the thickets of incomplete data.

Opioids and mortality: a real world stress test

The paper’s motivating application leans on a classic public health dataset: the National Health and Nutrition Examination Survey NHANES, linked to mortality records through the National Death Index. The researchers focus on the effect of prescription opioid use on five-year all-cause mortality among people aged 65 and older. The data set comprises thousands of individuals, a rich set of potential confounders, and, crucially, a nontrivial amount of missing information: about 6.5 percent reported opioid use, and roughly a quarter of observations were missing either the exposure or a confounder. It’s a perfect test bed for the new MNARTMLE toolkit because it embodies the very real tension between the desire to answer a causal question and the stubborn fact that data are incomplete in meaningful ways.

In their analysis, Wen and McGee compare several approaches. A plain complete-case analysis paired with the complete-data TMLE serves as a baseline, as does standard multiple imputation followed by the usual complete-data TMLE. Against this backdrop, they run their MNAR-aware TMLEs under two assumptions families, MNAR-A and MNAR-B. They also run a sensitivity analysis in which subjects who did not show their prescription containers to the interviewer are treated as missing exposure status. The results are telling rather than definitive: the MAR-based methods—complete-case and MI—tend to yield very small estimated effects, consistent with an almost null causal effect within the studied window. The MNAR-based estimators tend to push the point estimates upward by roughly a factor of two to three, but with confidence intervals that still include zero in the main analysis.

What does that mean in plain terms? It means that when the data smells like missing not at random, the direction and size of the estimated causal effect can shift in meaningful ways. If missingness is related to unobserved factors that also influence mortality, then ignoring that MNAR structure can blunt or bias the apparent impact of opioid prescriptions. The MNAR analyses do not claim a large, slam-dunk effect; rather, they reveal a more cautious, nuanced picture: the data are compatible with a somewhat larger risk than MAR-based analyses would suggest, but the uncertainty remains substantial. The sensitivity analysis, which loosens one of the MNAR assumptions, nudges the results further, underscoring how much conclusions can hinge on how we model the missing pieces.

Stepping back, the paper’s data-analysis message is as important as its theory: when missing data are part of the causal story, you need a methodology that respects that story rather than sweeping it under the rug. The University of Waterloo team shows that with MNAR assumptions that are thoughtfully constructed and with estimators built to endure mild misspecifications, you can still extract credible causal inferences from imperfect data. The numerical example with NHANES is not a final verdict on opioid mortality; it is a demonstration of a principled way to interrogate the data and to perform principled sensitivity analysis in the face of missingness that may be systematically informative.

The broader implication is practical and timely. Health researchers confront incomplete data in almost every project, from electronic health records to large-scale surveys. The Wen and McGee framework equips investigators with a structured way to ask, and answer, causal questions even when the data have not cooperated. It also invites a more honest conversation about uncertainty: when you introduce MNAR-based analyses, you acknowledge that the data carry clues about unmeasured realities, and you quantify how those clues could nudge conclusions. In an era where health policy increasingly hinges on observational evidence, having a toolkit that blends causal reasoning with robust handling of missing data is not just nice to have—it may be essential.

All of this stems from a straightforward fact: data collection is imperfect, but causal questions are too important to surrender. The work from the University of Waterloo, led by Lan Wen and Glen McGee, offers a thoughtful, technically rigorous way to keep asking these questions and to keep faith with the evidence, even when the evidence isn’t perfectly complete. It’s a reminder that in science, as in life, what you don’t see can still shape what you do know—and that the right methods can illuminate what the gaps are really telling you about the world.

Breast screening gaps mapped by data, not guesswork

Hidden Black Holes Shape the X-ray Sky’s Glow

Gaia unearths hidden dwarf carbon stars across the sky

Does a Warped Disk Hide a Black Hole’s Spin?

The Quiet Guardrails Keeping Self Driving Code Portable

Do Singular Matrices Harbor a Hidden Rule?

When Missing Data Hides Causality, New Tools Uncover Truth.

Missing data and causality revisited

Two MNAR families and the TMLE toolkit

Opioids and mortality: a real world stress test

Missing data and causality revisited

Two MNAR families and the TMLE toolkit

Opioids and mortality: a real world stress test

Related News