The world of causal science is full of tidy, brave promises: a number that captures the impact of a policy, a program, or a treatment, and a method that claims to reveal what would have happened if we had done something different. But real life rarely cooperates with tidy math. When researchers compress a bundle of multiple, interwoven sub-treatments into a single aggregated measure, they’re not just trimming data; they’re potentially scrubbing away crucial context. The same total dose of an intervention can come loaded with very different ingredients, and those ingredients can change outcomes in ways that a single number can’t reliably summarize. This is not a quibble about precision; it’s a fundamental design issue with how we pick and interpret our treatment variables.
The new paper by Carolina Caetano, Gregorio Caetano, Brantly Callaway, and Derek Dyal from the University of Georgia takes aim at this quiet but pervasive practice. They show that even in the best-case scenarios—random assignment, clean data—the marginal effect of an aggregated treatment is often a complicated, ambiguous blend of many sub-treatments. The weights that make up that blend aren’t unique, and they can even be negative. And as the number of sub-treatments grows, the chance of drifting into these interpretive thickets rises exponentially. It’s like trying to judge a symphony by listening to a single note from a random instrument—the note matters, but the full harmony matters more. The paper doesn’t just diagnose the problem; it offers practical paths to clearer inference, depending on whether you can observe the sub-treatments or not.
Why aggregated treatments can mislead
Imagine you’re studying the effect of enrichment activities on children’s noncognitive skills. The researchers don’t just count hours in a single activity label like “enrichment.” They combine homework time, music lessons, sports, volunteering, and after-school programs into a single score, a total “enrichment” D. Two kids could have D = 2, yet one might accumulate those two hours as two hours of homework and one hour of music, while another does homework plus sports. The aggregated number hides who contributed what. If the underlying sub-treatments push outcomes in different directions, then the change in Y when D goes from 1 to 2 isn’t a clean, uniform mirror of a single underlying mechanism. That’s the heart of the problem Caetano and colleagues spotlight: no hidden versions of the sub-treatments—SUTVA’s second pillar is assumed for the sub-treatments—doesn’t guarantee a well-defined intervention when you collapse them into D.
In the paper’s language, the same aggregated dose d can be achieved by many distinct sub-treatment vectors s ∈ S with A(s) = d. The mismatch between the world we observe (the sub-treatments) and the world we pretend to intervene on (the aggregated D) creates a potential-outcome function that isn’t well-defined for D alone. That is, Yi(d) may depend on which sub-treatments actually happened, not just on the total. When researchers then estimate E[Y | D = d] − E[Y | D = d − 1], they’re implicitly averaging over many different sub-treatment comparisons. Some of those comparisons are “congruent”—they move only one sub-component up by one unit, leaving the rest fixed. Others are “incongruent”—they nudge several sub-components at once in different directions. The difference between what you’re hoping to measure and what you actually average over is where trouble hides.
Converging evidence from practice is that many social science papers routinely aggregate sub-treatments. The authors point to schooling, peer effects, crime measures, environmental exposures, and time-use studies as familiar culprits. When you read a regression that uses D as the treatment, you’re not just estimating a straightforward marginal effect; you’re effectively orchestrating a weighted mix of congruent and incongruent sub-treatments. The weights are not unique, and they can tilt the result in surprising directions, especially as the number of sub-treatments grows. This is not a theoretical footnote. It changes what we can claim about causality, and it invites us to rethink how we present and interpret such results.
The math behind the mischief
To parse what’s going on, the paper dives into a careful decomposition of the marginal effect of the aggregated treatment. The main result shows that the jump in the average outcome when D rises by one unit can be written as a weighted sum of marginal effects on the treated for the congruent sub-treatments plus a (potentially nonzero) contribution from incongruent sub-treatments. In plain terms: the overall change is a mashup of several smaller “micro-effects,” and those micro-effects don’t all look the same. Some come from clean, neighboring shifts in one sub-treatment; others come from juggling multiple sub-treatments in a more complex way. The weights that combine these micro-effects are not unique. Different mathematically valid weighting schemes can tell very different stories about the same aggregate change.
Why do incongruent comparisons matter so much? Because incongruent MATTs—the marginal average treatment effects on the treated for incongruent sub-treatments—can dominate the observed average difference, even when every sub-treatment effect is positive. Negative weights can crop up in the decomposition, effectively telling you that some parts of the sub-treatment landscape are pulling the average in the opposite direction of what you’d expect from a naïve reading. In other words, the math can flip the sign of an aggregate effect you’d imagined to be uniformly positive or negative, simply because of how the pieces are put together.
The authors formalize several key ideas with precise definitions: congruent versus incongruent sub-treatments, a marginal set M(d) of neighboring aggregation sets, and a decomposition of E[Y | D = d] − E[Y | D = d − 1] into MATT+ (congruent) and MATT− (incongruent) pieces. They also show that, in a world with many sub-treatments or with sub-treatments that have a wide range of possible values, incongruent comparisons proliferate. A central implication is stark: the aggregated marginal effect becomes harder to interpret as a true marginal effect of the sub-treatments when sub-treatments are numerous or highly heterogeneous.
Two big takeaways emerge from the math. First, the weights that translate a marginal change in D into sub-treatment terms are not unique, which means two researchers can report different marginal pictures of the same data. Second, incongruent comparisons can generate negative weights, which can flip the apparent direction of the effect. The result is not a theoretical curiosity; it’s a practical warning that the way we aggregate can distort our interpretation of causal effects, sometimes in subtle, sometimes in dramatic, ways.
What to do when sub-treatments matter
The paper doesn’t stop at diagnosis. It offers concrete, actionable paths for researchers who want to avoid the pitfalls of aggregation or to work with them in a principled way. There are two broad roads, depending on whether the sub-treatments are observed in the data.
The first road embraces non-marginal causal effects. Instead of asking how a one-unit increase in the aggregate dose changes outcomes (which invites incongruence), researchers can compare outcomes for a given dose d to the untreated group (D = 0). This baseline-to-d comparison yields a quantity called ATT(S)—the average treatment effect on the treated for a specific sub-treatment vector—and an aggregated version AATT(d) that averages those ATT(sd) across all sub-treatments that map to d. The key promise is simplicity and interpretability: the building blocks are the actual sub-treatments, the weights are positive and probabilistic (they reflect the share of each sub-treatment at that dose), and there’s no entanglement with incongruent comparisons. Identification remains possible even if the sub-treatments themselves aren’t observed, in which case AATT(d) is identified through the aggregated data and the assumed unconfoundedness of the sub-treatments with respect to the outcome.
The second road sees researchers who do have sub-treatments. If you can observe the vector S of sub-treatments, you can construct AMATT+(.). This is a weighted average of the congruent MATT+ parameters, with weights designed to emphasize the more common sub-treatments and to stay faithful to the joint distribution of sub-treatments across adjacent doses. The authors even propose a practical, plug-in weighting scheme—normalized product weights—that tends to give more weight to the sub-treatments that actually co-occur with a given aggregate level. This approach preserves a sense of marginal interpretation but confines it to the realm of congruent comparisons, sidestepping the incongruent mudslinging that can plague the naive aggregate slope.
In the end, the authors show that there’s a useful trade-off. If you want to keep a single scalar summary while staying close to causal interpretations, you can report AMATT+ or its aggregated cousin AMATT+(D). If you’re comfortable moving away from marginal interpretations, you can emphasize non-marginal quantities like E[Y | D = d] − E[Y | D = 0], which cleanly decompose into sub-treatments without hinging on tricky marginal weights. The paper also reminds us that reporting multiple targets can be valuable: a regression coefficient on D in a standard Y ~ D model might look convenient, but it hides a nontrivial weighting scheme and can be affected by incongruence in ways that surprise readers who assume a straightforward, single-parameter story.
In their empirical illustration using PSID’s Childhood Development Supplement data on enrichment activities and children’s noncognitive skills, the authors walk through a concrete demonstration of the ideas. They show how the same aggregate dose can be supported by very different sub-treatment profiles, and how the minimally incongruent weighting scheme still assigns meaningful weight to incongruent comparisons in several dose levels. The takeaway is not that enrichment programs are or aren’t effective; it’s that when you summarize a bundle of activities into a single hours-of-enrichment variable, you risk mixing together distinct causal stories. The point is to equip researchers with diagnostics and alternative estimands that reveal what aggregation hides—and when it’s safe to proceed with a simple marginal narrative, versus when you should switch to a non-marginal or sub-treatment–aware summary.
What does this mean for researchers and readers? It means being deliberate about how we define “treatment” and what we’re willing to claim about causality. If sub-treatments are observable and plentiful, it’s wise to target congruent, interpretable quantities that respect the actual composition of the treatment. If sub-treatments aren’t observed, we can still learn meaningful things by focusing on non-marginal effects that aggregate sub-treatment effects in a way that remains robust to the aggregation. And regardless of the path, the paper argues for transparency: report the diagnostic checks that reveal incongruence, or report a suite of target parameters that together give a fuller, more honest picture of the causal landscape.
The study’s author list—Carolina Caetano, Gregorio Caetano, Brantly Callaway, and Derek Dyal—are affiliated with the John M. Godfrey, Sr. Department of Economics at the University of Georgia. The work is a thoughtful reminder that science advances not only through clever methods, but through careful attention to what our variables actually capture. When we collapse a complex, multi-faceted reality into a single number, we’re stacking the deck with assumptions. The more we name, test, and adjust for those assumptions, the closer we come to understanding what truly causes what—and why the same dose can taste like very different experiences depending on the ingredients.
Bottom line: aggregated treatment variables are a practical staple in social science, but they’re not neutral. The Caetano et al. framework helps us see the hidden geometry beneath a single slope and offers practical tools to either avoid or robustly interpret it. In a world where policy decisions ride on causal estimates, that clarity isn’t just mathematical—it’s essential for making wiser choices about which programs to fund, how to measure their value, and how to talk about their effects in a way that matches the messy, real world they aim to improve.