Does the mean hide bias in cell perturbation models?

In the fast-growing world of single-cell biology, researchers race to predict how a cell’s gene expression will respond when you tweak a gene or apply a drug. The dream is a thunderous, scalable in silico lab: test thousands of perturbations in seconds, cut costs, and accelerate discovery. But a recent collaboration—centered at Shift Bioscience in Cambridge, UK, with collaborators at the Vector Institute/University of Toronto—looks under the hood and uncovers a quiet, stubborn bias in how these models are judged. The lead voices, Gabriel M. Mejia and Henry E. Miller, argue that a lot of what passes for “better performance” on public benchmarks isn’t learning at all. It’s a statistical mirage born from how we measure things and what we compare against. They propose a different, more honest way to judge perturbation-predicting models—weighted metrics that prize genuine, perturbation-specific signals over rote averages. The result isn’t just a tweak to scoring; it’s a prescription for making virtual testing a real guide to biology rather than a mirror that reflects the dataset’s own quirks.

What this paper examines is deceptively simple: when you ask a model to predict how cells react to perturbations, the most boring predictor—the dataset mean—often looks stunning. But that apparent brilliance is not evidence of understanding. It’s a side effect of biased controls and a weakness in common metrics that reward predicting the same old baseline rather than the interesting, niche signals that actually differentiate one perturbation from another. The authors show this with both synthetic data and two real Perturb-seq datasets, revealing a pervasive problem: if you calibrate your evaluation against a biased reference, you’ll prize the wrong kind of accuracy. And that’s a risk not just for numbers, but for biology itself, because it nudges researchers toward models that do well for the wrong reasons and miss the subtle, high-value signals that could point to new therapies or deeper biological insights.

What broke in perturbation benchmarks

To understand the problem, think of a classroom where every student’s score is compared to a single, possibly biased, “control” class. If that control class isn’t perfectly balanced—if it carries its own quirks or systemic shifts—the student who merely copies the average of all perturbed students can look like the top performer. In the world of single-cell RNA sequencing, this manifests as a model that seems to predict perturbation responses simply by predicting the mean of all perturbed cells. That mean baseline can appear to beat advanced models on standard metrics, even though no model has truly learned to distinguish among perturbations.

The researchers highlight two entwined culprits. First is control bias: a systematic shift between the control cell population and the rest of the perturbations. When the control isn’t perfectly centered, the differences between perturbed and control cells (the perturbation deltas) begin to resemble the average effect of all perturbations. In other words, delta-based metrics can be gamed by a dataset whose control is off-kilter. The second culprit is signal dilution: the biological signal that truly distinguishes one perturbation from another can be sparse and high-dimensional, easily drowned out by all the noise and by the distributional pull of the broader dataset. When you evaluate with common loss functions and metrics that treat every gene equally, you reward a model that tracks the global distribution rather than one that captures the critical, perturbation-specific shifts.

These ideas aren’t abstract curiosities. The authors back them up with a mix of simulations and real data from Norman et al.’s 2019 Perturb-seq CRISPRa dataset and Replogle et al.’s 2022 CRISPRi dataset. They show that, under realistic levels of bias and signal sparsity, conventional metrics push you toward the mean baseline. In practical terms, that means you’re more likely to buy a model that says, “Here’s the average perturbation effect,” rather than a model that can tell you, “This particular perturbation causes a niche, gene-specific change that matters for a drug target.”

To make matters worse, the very metrics used to rank models—MSE and Pearson deltas—don’t capture scale or the pace of change, and they’re susceptible to control bias. A model that only replicates the average outcome can glide through these tests, while a truly predictive model that captures rare but important signals may be unfairly penalized. The authors’ instinct is to flip the problem: instead of evaluating against the control, evaluate against a reference rooted in the full, perturbed population. That simple switch changes the entire calibration of the scoring system and exposes whether a model is genuinely learning biology or just echoing a biased baseline.

A better way to measure a real signal

If you want to detect real perturbation-specific biology, you have to measure what actually changes in a cell relative to the average perturbation, not to an unbalanced control. Mejia and colleagues propose two DEG-aware metrics that do just that: Weighted MSE (WMSE) and a weighted version of R-squared, R2 w(∆). The core idea is to tilt your lens toward differentially expressed genes (DEGs)—those genes that truly move in response to perturbations—while keeping the analysis anchored to all perturbations, not just the control. The math matters less than what the weights do: they amplify the importance of niche signals and shrink the role of background noise that comes from genes that barely budge across perturbations.

In practice, WMSE is a straightforward twist on the classic MSE. Each gene’s contribution to the error is weighted by how strongly that gene responds to perturbations across the whole dataset, with weights derived from DEG statistics calculated against all other perturbations (not against control). The effect is dramatic: genes that carry the meaningful, perturbation-specific information get more weight, while ubiquitous or uninformative genes fade into the background. The authors also introduce R2 w(∆), a weighted delta-R2 that measures how well a model recovers the true perturbation-induced changes when you reference the mean of all perturbed cells rather than the control. This choice matters because it removes a source of bias and emphasizes the scale of the response—whether a model captures both the direction and the magnitude of change across the most informative genes.

Crucially, these weighted metrics do more than shift rankings. They restore a kind of honesty to benchmarking by ensuring that a model’s score reflects its ability to detect high-value signals rather than to mimic the average effect. They also acknowledge that in complex biology, most perturbations don’t flip a vast swath of genes. A few DEGs are the signal; the rest is noise. The weighting scheme is built to respect that reality, nudging learning algorithms toward the sparse, high-variance predictions that actually matter for understanding cellular responses and, potentially, for predicting therapeutic effects.

Calibrating metrics with baselines you can trust

The authors don’t stop at redefining the metrics; they also propose a principled way to interpret them. They introduce three baselines to anchor what counts as good or poor performance: a negative baseline based on the control mean (µc), a null baseline based on the mean of all perturbed cells (µall), and a positive, empirical ceiling drawn from a technical duplicate baseline that simulates what performance would look like if you split a perturbation’s cells in half and predicted one half from the other. The last baseline is not a toy. It embodies the practical limit set by the intrinsic variance of the experiment. If a model can’t beat the technical duplicate, it’s not really learning anything beyond the data’s own noise floor.

When these baselines are paired with the DEG-aware metrics, something interesting happens: the mean baseline, which previously could look dazzling on unweighted Pearson(∆), sinks to null performance. The technical duplicate baseline rises as the ceiling, revealing which models truly capture perturbation-specific signals. The contrast is not just a statistical curiosity; it’s a guardrail against overclaiming progress in a field where the stakes—new therapies, new understanding—are high. The authors even show that you can use WMSE as a training objective, not just as a post hoc score. Training with a biology-informed loss function reduces a dreaded problem called mode collapse, where the model learns to map many perturbations to a single, boring outcome. In their experiments, WMSE-driven training nudges models away from predicting the dataset mean toward richer, more meaningful perturbation responses.

What this means for biology and medicine

Beyond the neatness of cleaner metrics, the paper makes a practical case for how to accelerate real discovery. In silico perturbation screening—predicting how cells respond to a thousand or a million perturbations without running a lab experiment—holds enormous promise for identifying drug targets, predicting off-target effects, and understanding complex gene networks. If the metrics used to evaluate these models reward the wrong thing, the entire pipeline can misfire: expensive experimental work can be chasing the shadow of a statistical trick rather than a genuine biological signal. By re-calibrating how we measure success, Mejia and colleagues steer the field toward models that live up to the promise of virtual screening: they reveal real, perturbation-specific biology that could guide experiments, save time, and reduce the need for animal testing in early-stage research.

The broader implication is a reminder that science advances not just by bigger models or more data, but by better questions and better tests. The paper’s central insight—avoid rewarding a degenerate baseline and weight evaluation by where biology actually changes—could ripple through other domains of computational biology. It’s a nudge toward more thoughtful benchmarks, to ensure that the next generation of foundation models in single-cell biology learns something genuinely useful, rather than something that looks impressive on a slide but collapses in the clinic or the lab bench.

And there’s a human, humanizing takeaway here. The study was born out of a collaboration that crosses borders and disciplines, uniting rigorous statistics with a visceral desire to move biology forward. The Shift Biotech team, anchored in Cambridge, partnered with researchers at the Vector Institute and the University of Toronto to ground their ideas in real data and real-world constraints. The authors—led by Gabriel Mejia and Henry Miller with contributions from Francis Leblanc, Bo Wang, and Lucas de Lima Camillo—show that careful design of both metrics and training objectives can unlock a more faithful map of cellular responses. It’s the scientific equivalent of replacing a flaky compass with a well-calibrated one: you may still head into uncharted territory, but at least you won’t walk in circles.

In the end, the message is hopeful and pragmatic: by removing reference bias, adopting DEG-aware metrics, and training with signals that truly matter, we can distinguish models that merely memorize from models that learn. The mean baseline no longer reigns supreme; the real biology—scattered, high-variance, and deceptively sparse—gets its due. If you’re building the next generation of in silico perturbation tools, this paper gives you a clearer, fairer yardstick and a sharper instrument to tune your models toward real, testable biology rather than the echo of a biased benchmark.

As a field, we’re learning to trust metrics that reward genuine understanding. That shift won’t just produce nicer graphs; it could accelerate the discovery of therapies, illuminate the architecture of gene networks, and shorten the path from bench to bedside. That’s not merely progress in a statistical sense—it’s progress in how we understand life at its most intricate, distributed, and wonderful level: the single cell.