Dyadic Deception Gets a Multimodal Second Opinion

The room is quiet, and the two strangers sit across from each other, one weaving a story, the other probing with questions. For a long time, humans have tried to read deception from tiny, flickering signs: a flick of the eye, a hesitant pause, a slight tremor in the voice. The new study from Stockholm University nudges us toward a different instinct: deception isn’t a solo performance. It’s a duet. And if you want to detect it, you’ll need to listen to both sides of the exchange—and you’ll want a few cameras and microphones to help you hear what humans often miss. The researchers, led by Franco Rugolon and Thomas Jack Samuels at Stockholm University, scrutinize deception in a Swedish cohort by pairing advanced machine learning with data that captures both the deceiver and the deceived. It’s a bold reminder that truth in conversation is a dynamic, two-person phenomenon, not a one-person cue sheet.

In this era of headlining headlines about artificial intelligence and algorithmic truth-tellers, the study asks a deceptively simple question: can a machine do better at catching lies if it watches a dialogue from both participants, across both voice and face, and fuses those signals in a disciplined way? The answer, at least in this Scandinavian lab, is a cautious yes. The team shows that combining speech with facial signals, and crucially including data from both people in the interaction, yields more accurate deception detection than any single-signal approach. It’s not a magic bullet, and it’s not universally foolproof, but it is a meaningful step toward machines that understand the choreography of a real conversation rather than just isolated gestures or syllables.

Stockholm University researchers led by Franco Rugolon and Thomas Jack Samuels (with colleagues Stephan Hau and Lennart Högman) built and tested a framework on a newly collected dataset of native Swedish speakers asked to truthfully recount experiences or to fabricate them under controlled conditions. The goal wasn’t to stage a courtroom-ready lie detector, but to probe whether a multimodal, dyadic approach could reveal patterns that slip past human observers who rely on intuition or single cues. And the results point to a nuanced portrait: when you listen to both voices and watch both faces, the machine has a better chance of distinguishing truth from fiction than when you isolate one person or one kind of signal.

Two Voices, One Signal

Traditionally, deception research has leaned on either what a person says or what a person looks like, sometimes assuming that nonverbal cues—facial expressions, eye movements, gestures—are the telltale markers of deceit. The field has grown wiser about how tricky such cues can be. Some cues are consciously suppressed; others are spontaneously regulated; culture, stakes, and personality all color the display. The Stockholm study adds a crucial twist: it treats deception as a dyadic, interactive process. In other words, cue patterns aren’t just about the sender; they emerge from the interaction between sender and listener, and from how each participant responds in real time.

That shift has a practical snag: humans often rely on a single modality—speech or face—when forming quick judgments. The researchers push back against the idea that one signal rules deception. They argue, with psychological theory in mind, that cross-modality patterns—how voice and face coordinate, moment by moment—are essential to understanding if someone is lying. This aligns with Interpersonal Deception Theory, which emphasizes the dynamic nature of deception as an exchange between people, not a monologue performed in isolation. It’s a reminder that truth in conversation is a dance, and the dance is best understood by watching the whole floor, not just one dancer.

To explore this, the team runs a careful experiment with visual and vocal data streams. They extract facial action units—micro-adjustments of the lips, brows, and cheeks—and gaze information from video, and they pull a suite of acoustic features from speech. The idea is to capture both what a liar’s face does while they talk and how their voice modulates tone, pitch, and energy as they attempt to persuade or conceal. The core hypothesis is not that one cue will pop out as the universal tell, but that a constellation of cues across modalities, especially when you consider both participants, will provide a clearer signal than any single cue alone.

The Swedish Lab Experiment

The study recruited 44 native Swedish speakers, forming 22 dyads that interacted in a controlled lab setting. In each dyad, one person played the role of sender and was assigned to tell the truth or to lie about prepared prompts; the other person acted as the receiver and asked follow-up questions to foster a natural conversation. The sessions lasted about 15 minutes, with seven rounds of two-minute exchanges. The setup was deliberately intimate: two cameras directed at each participant, microphones aimed at the conversation, and a protocol to minimize artifacts—like removing facial coverings and keeping environment noise low—to get clean reads of nonverbal behavior.

Data collection wasn’t about storytelling alone. It was about precision—peeling apart when and how certain signals arise in deception. The researchers used OpenFace to extract facial cues and eye-tracking data, and OpenSmile to pull out the GeMAPS acoustic features that describe voice quality, energy, and cadence. Each feature was tracked frame by frame, turning a conversation into a richly labeled time series. The team then organized the data into modalities and participants: facial cues from the sender, facial cues from both participants, voice cues from the sender, and voice cues from both participants. The big question was whether including both participants and both modalities would unlock patterns that single-stream analyses miss.

Crucially, the study wasn’t a playground for the flashiest neural network or the most exotic time-series trick. Given the relatively small sample size, the researchers steered away from the most complex “joint fusion” architectures that risk overfitting. Instead, they tested three well-understood fusion strategies: unimodal (one modality at a time), early fusion (combining modalities into one representation before learning), and late fusion (learning separate models per modality and then merging their predictions with a meta-model). The choice to emphasize late fusion—where each modality is allowed to shine on its own before a simple decision layer fuses their opinions—proved pivotal in the results.

What It Means to Read Deception as a Dialogue

The findings ride on a simple, powerful insight: the combination of speech and facial signals generally outperformed any single modality, and data from both participants consistently boosted performance. The study’s standout result sits in the late-fusion camp: the best-performing approach used data from both modalities and both participants, feeding the modality-specific predictions into a meta-model for the final call. The reported 71% accuracy under this setup marks a meaningful jump beyond unimodal baselines and beyond early-fusion configurations. And there’s a striking practical detail tucked in the numbers: the model achieved perfect precision for lies in the best case, meaning that when it labeled a dyad’s statement as a lie, it was correct. That level of zero false positives is meaningful in sensitive contexts, where wrongful accusations can carry heavy consequences.

Several threads help explain why this works. First, the visual channel (faces, eye gaze) and the auditory channel (prosody, timbre, rhythm) respond to deception in different ways and under different cognitive pressures. When someone is trying to lie, they may intentionally suppress certain facial movements while still exhibiting changes in voice, or vice versa. Second, and perhaps more surprisingly, incorporating the receiver’s signals matters. The other person’s reactions—questions that probe consistency, the listener’s own micro-responses—shape the deception dynamics, and the machine benefits from seeing how the interaction unfolds in both directions. This dovetails with theories of deception that view it as an interactive, ongoing process rather than a static trait of one person.

These results also underscore a nuanced take on long-standing ideas about “leakage”—the notion that deception inevitably leaks through involuntary facial tells. The unimodal facial results in this study were only modestly informative, suggesting that nonverbal leakage is not a universal fingerprint of lying, at least not in the contexts tested. The cognitive-load story—deception as a demanding task that taxes working memory and narrative coherence—finds a stronger ally in vocal cues, which often reveal subtle shifts in energy and cadence as someone fabricates. When you defer to late fusion, the model can triangulate these modality-specific hints and temper them with cross-modal consistency, yielding a more robust verdict than any single channel could deliver.

What It Means for Our Notion of Truth

Beyond the numbers, the study offers a conceptual pivot. It lends weight to Interpersonal Deception Theory’s insistence that deception is a relational, evolving act. The fact that dyadic data—signals from both sides of the exchange—improves detection aligns with the idea that deception is not a solo performance but a joint negotiation of credibility. In other words, truth in conversation is not a static readout of one person’s nonverbal repertoire; it’s the emergent property of an interaction in which both parties contribute, respond, and adapt in real time. The results imply that to understand deception, we should study the entire conversational ecosystem, not just the most dramatic facial micro-movements or the most expressive syllable.

And yet the research also humbles the dream of a universal lie detector. Even with multimodal data from both participants, the authors stress that the best accuracy remains far from perfect across all contexts, and generalizability remains a live question. The Swedish cohort provides a crucial testbed, but deception differs across stakes, cultures, and individual differences. The authors argue for larger, more diverse datasets and for careful attention to ecological validity. They also remind readers of the ethical terrain: how such technology could be deployed in courts, clinics, or workplaces raises questions about privacy, consent, bias, and the risk of overreliance on algorithmic judgments in emotionally charged situations.

That ethical frame matters. The researchers explicitly call for transparency (explainability) and for guardrails that prevent misuse. In a GDPR world where decisions made by automated systems can affect people’s lives, understanding not just what the model predicts but why it reached that conclusion is essential. The potential for cultural or demographic biases in facial expression and vocal patterns invites a thoughtful, ongoing audit of datasets and models. In short, the paper is as much a compliance, fairness, and governance project as it is a scientific one.

From Lab to Real Life: The Road Ahead

What’s next, the authors suggest, is both technical and societal. On the technical front, more diverse and larger multimodal datasets are needed to train models that generalize beyond low-stakes laboratory tasks. High-stakes deception—think legal proceedings or security screenings—likely presents different nonverbal dynamics, and models trained only on casual, low-stakes dialogues could misfire in more consequential settings. The researchers also emphasize the value of dyadic data in psychotherapy and counseling contexts, where understanding how a patient and therapist co-create meaning could illuminate trust, transparency, and miscommunication in therapeutic rapport.

On the societal front, the study is a cautionary map: it shows what’s possible when we widen the lens from one speaker to two and from a single cue to a constellation of cues. It invites policymakers, ethicists, and practitioners to consider where and how such tools should be used, and to invest in infrastructure for data sharing, methodological transparency, and human-centered safeguards. The authors even advocate for open-source benchmarking datasets to foster reproducibility and cross-cultural comparisons—an invite to the broader research community to test ideas across languages, contexts, and stakes.

In the end, the work from Stockholm University gives us a vivid image of deception as a choreography rather than a caricature. It reminds us that the best way to understand a lie might be to watch the dance between two people—the way their voices mingle with their faces, the way each reacts to the other’s questions, and the way the conversation as a whole reveals whether the story holds together. It’s not a final proof of a universal detector, but it is a meaningful advance that nudges both science and society toward listening more carefully to the two voices in every dialogue we navigate.

If you take away one idea from this study, let it be this: truth in social exchange lives in the interaction itself, not in any single cue. And when machines learn to read that interaction more holistically—by honoring both participants and by respecting the choreography of voice and face—they move a little closer to understanding one of humanity’s most enduring mysteries: when is a lie a lie, and when is it merely a passing tremor in a conversation?

Lead investigators and affiliations: The work was conducted at Stockholm University in Sweden, with Franco Rugolon, Thomas Jack Samuels, Stephan Hau, and Lennart Högman authoring the study. The institution behind the investigation is the Department of Computer and Systems Sciences and the Department of Psychology at Stockholm University, Stockholm, Sweden.