Real-Time Inference for Social Programs That Shifts Power

Policy experiments and program evaluations have a stubborn habit of asking for patience. Before a conclusion can be drawn, researchers often need to wait for a pre-set batch of data to accumulate, then run a fixed analysis that pretends the clock stopped the moment the data collection did. A new line of work changes that pace. Built by a team at Erasmus University Rotterdam’s Econometric Institute (Tinbergen Institute), led by Sam van Meer and Nick Koning, the paper Real-time Program Evaluation using Anytime-valid Rank Tests proposes a way to tell, in real time, whether a social program is making a difference. Not in a theatrical sense, but with mathematically controlled confidence that adapts as new data arrive. In other words: the truth about a policy’s effect can be pursued like a live conversation, not a taped interview.

To a broad audience, this sounds technical, even abstract. But the core idea is surprisingly human-scale: we want to know as soon as possible whether a policy is working, without burning through resources chasing false alarms or waiting so long that the moment to act has passed. The method hinges on a familiar-yet-creative twist on statistics called anytime-valid inference. It lets researchers peek at the data as it streams in, and, importantly, it guards against the temptation to prematurely declare victory just because a random fluctuation happened to look favorable on a given day. The authors frame their contribution around two well-known workhorses in program evaluation—difference-in-differences (DiD) and synthetic control methods—and show how their real-time tests remain valid even when data arrive one by one, rather than in a big, neatly-timed batch.

Van Meer and Koning anchor their work in real institutions and real problems. The study is rooted in the tradition of counterfactual analysis—asking, what would have happened if a program hadn’t run?—and they explicitly connect their framework to classic design ideas like the interactive fixed effects model. The practical upshot is a toolkit that can, in principle, be dropped into ongoing policy experiments to decide, with defined statistical guarantees, when enough evidence has accrued to claim a treatment effect or to stop an evaluation early for futility. In short: a more nimble, more trustworthy way to learn from policy experiments as they unfold.

What it means to test in real time

Traditional statistical tests are built around a fixed sample size. You decide in advance how much data you’ll collect, you lock in a test statistic, and you protect the Type I error—the probability of a false positive—by design. If you later stare at the data and say, “Let’s just look again,” you risk inflating that error rate. The authors’ central move is to reframe what counts as evidence in a way that remains honest no matter when you stop looking.

To do this, they lean on a concept called an e-value, a nonnegative quantity that, on average, equals or stays below 1 under the null hypothesis of no effect. When you chain together many e-values across time, you get a test martingale—a kind of growing score that aggregates evidence over the course of a study. Ville’s inequality then binds the probability that this score ever crosses a critical threshold, ensuring that the chance of a false alarm remains controlled at the desired level, even if you check the data continuously. The practical payoff is a p-value that is valid at any data-dependent time, not just after a pre-specified final sample. The paper calls this an anytime-valid p-value, and it is the engine behind real-time inference in their framework.

Crucially, the authors do not pretend the world is simple. Data can be noisy, and researchers rarely know the exact shape of the alternative (the way a program might actually affect outcomes). So they design their e-values to be robust: they can be built from ranks, which are less sensitive to the precise distribution of outcomes, or from Gaussian-like specifications when a researcher has a sensible guess about the effect size. This two-pronged approach—rank-based and distribution-conditional—lets the method adapt to what is plausibly known about the setting while keeping the real-time guarantee intact.

Two paths through the same door: post-exchangeable vs generic alternatives

One of the paper’s elegant ideas is to translate the problem of detecting an effect into the language of exchangeability. In a no-effect world, the order of observations should look random with respect to pre- and post-treatment periods. If the data depart from this exchangeability in predictable ways, that signals a treatment effect. The authors formalize two natural alternatives that frame how evidence might accumulate.

The first is the post-exchangeable alternative. Here, the post-treatment observations, once the treatment has begun, are considered exchangeable among themselves but not with the pre-treatment data. Intuitively, this is like saying: after the policy starts, the data settle into a new, stable pattern that doesn’t depend on the exact timing of the earlier observations. The second, the generic alternative, is broader: it allows the entire sequence to be non-exchangeable. This captures scenarios where the treatment effect might drift, grow, or fluctuate over time. Naturally, the generic alternative is harder to prove against, but it is a closer mirror of many real-world processes where policy effects aren’t a one-and-done shock but a changing tide.

To operationalize these ideas, the paper introduces sequential ranks and reduced sequential ranks as the pool from which to draw evidence. The sequential rank of a new observation is its position relative to all prior observations, while the reduced rank focuses only on where the new observation lands among pre-treatment data. This distinction matters: in the post-exchangeable world, the reduced ranks track how post-treatment data stack up against the old, pre-treatment baseline. In the generic world, the full ranks carry the informative weight that the time-series might be behaving unpredictably. The authors show how to build e-values that are tailored to each of these rank-based perspectives, and how to combine them into a single, anytime-valid test.

With a nod to the practicalities of analysis, the paper also discusses how to handle serial dependence and other forms of non-exchangeability. In real data, observations aren’t perfectly independent. The authors propose a block-structure approach: group observations into blocks where dependence is contained, then run the sequential testing procedure on the block means. It’s a pragmatic acknowledgment that clean theory must still play nicely with messy data.

From theory to practice: DiD, SCM, and the IFE canvas

Difference-in-differences and synthetic control methods are the juggernauts of policy evaluation. They let researchers compare a treated unit to a counterfactual constructed from untreated peers, aiming to isolate the causal effect of a program. Van Meer and Koning show that, under typical assumptions that make these methods exchangeable in the absence of a treatment effect, their anytime-valid testing framework can be applied to the treatment estimators produced by DiD and SCM. In plain terms: you can monitor a DiD or SCM analysis as data flow in, and still maintain a trustworthy error rate while possibly declaring results before the study ends or continuing to collect data when the signal is not yet strong enough.

The IFE, or interactive fixed effects model, provides the broader statistical backdrop for this connection. It lets researchers model unobserved, time-varying factors that affect all units differently. When the pre-treatment patterns obey certain exchangeability properties, the post-treatment treatment estimators—whether from DiD, SCM, or the IFE’s own machinery—inherit the same kind of probabilistic symmetry the authors need for their real-time tests. If those symmetry conditions fail, the authors propose practical remedies, such as block-structure adjustments, to curtail size distortions and preserve the integrity of inference as data flow in.

In simulations, the method demonstrates a delicate trade-off: the price of anytime-validity is a bit of power, especially early in the data stream. But as evidence accrues, the tests gain strength, and the option to reject early or late exists without inflating the risk of false positives. They also explore how a Gaussian, model-based alternative or a plug-in adaptive mixture over several plausible effect sizes can help sharpen power when you have a working belief about the likely magnitude of the impact. The overarching message is practical: real-time inferences are feasible, and they can be tuned to balance the costs of early stopping against the costs of waiting too long for a robust result.

What this changes about policy testing and scientific humility

The stakes in program evaluation aren’t abstract. Governments, NGOs, and philanthropic organizations invest with a mix of idealism and pragmatism. Decisions often hinge on whether a program’s benefits justify continuing or scaling up. The new approach reframes how those decisions are made in real time. If early data already tell a convincing story, officials can accelerate deployment or expansion. If early data are equivocal, the method preserves the possibility of continuing to observe and learn, without sacrificing credibility when the moment for action arrives.

One of the paper’s most human angles is the explicit modeling of time and dynamics. Policy effects aren’t always sudden jumps; they can unfold gradually, evolve with the population, or respond to changing circumstances. The framework’s flexibility to handle dynamic treatment effects—whether a treatment’s impact grows, wanes, or shifts direction—addresses a perennial complaint about short-run studies: are you sure the effect you see now will persist? While no statistical method can erase uncertainty, anytime-valid inference offers a principled way to be honest about what we know at any point in the timeline.

Beyond the methodological elegance, there is a democratic thread. Real-time, valid inference lowers the barrier between data and decision-making. It enables policymakers to respond to evidence as it appears, not only after a distant audit. It also nudges researchers toward transparent reporting of when and why they stopped collecting data, and what the observed evidence looked like along the way. In a world where data streams are relentless and policy windows can close quickly, the ability to say, with rigor, that we know enough to act—or that we need more data—feels essential.

As with any advance, caveats matter. The guarantees hinge on the exchangeability conditions in the underlying treatment estimators. When serial dependence or misspecification runs strong, the authors propose practical workarounds, like block-structured testing, but the onus remains on researchers to diagnose the data-generating process carefully. The method shines when researchers have a credible pre-treatment batch of observations and a model that respects, at least approximately, exchangeability under the null. It then becomes a powerful ally in the ongoing quest to learn from policy in a way that is both rapid and responsible.

In their discussions of future directions, van Meer and Koning hint at the frontier: dynamic treatment effects, learning the alternative on the fly, and further refinements to power management without sacrificing safety. The mathematics are intricate, yes, but the ambition is refreshingly down-to-earth: give decision-makers a clearer, timelier read on whether a program is delivering on its promises, with honest boundaries about what remains uncertain.

Institutions and researchers who adopt this approach will likely see two shifts at once. First, a recalibration of the tempo of evidence: proofs of effectiveness can arrive earlier, enabling faster iteration and course corrections. Second, a shift in the storytelling of science: results will be framed not as a single verdict after a fixed horizon, but as a narrative of evolving evidence, with the option to adjust, extend, or stop the study as the data dictates. The paper’s basic promise—truth guarded by a rigorous, time-uniform standard—feels at once rigorous and humane, a necessary alignment of method with the way policy actually unfolds in the wild.

Ultimately, the Erasmus University Rotterdam team offers more than a new statistical gadget. They present a philosophy of learning under uncertainty that is continually useful as data streams multiply and policies press for faster feedback loops. If the world of social programs is a chorus of imperfect experiments, then the real-time, anytime-valid approach gives us a way to listen more closely to the music as it changes tempo—without mistaking a passing note for a chorus. It is a small, practical shift with the potential to reshape how we know what works, and how quickly we can act on that knowledge.

Lead authors: Sam van Meer and Nick Koning. Affiliations: Econometric Institute, Erasmus University Rotterdam; Tinbergen Institute. Context: Real-time inference for program evaluation using anytime-valid rank tests, bridging exchangeability concepts with practical counterfactual methods like difference-in-differences and synthetic control.