OmniEval tests AI that sees, hears, and thinks

In a world buzzing with screens that listen, watch, and read, the real magic is less about clever tricks and more about coherence. Can a system stitch together what it sees on the screen, what it hears in the soundtrack, and what it reads or hears as captions and dialogue to produce something that feels genuinely understanding rather than a bag of clever responses? A new benchmark named OmniEval dares to measure just that kind of all-in-one perception. It’s not a flashy gadget—it’s a rigorous testbed that asks a single, stubborn question: can a model truly fuse audio, video, and text into a coherent picture of what’s happening, in real time, across languages?

The work behind OmniEval comes from Huawei Noah’s Ark Lab, with collaboration from the University of Science and Technology of China. The team—led by Yiman Zhang and collaborators including Ziheng Luo, Qiangyu Yan, Wei He, Borui Jiang, Xinghao Chen, and Kai Han (the corresponding author)—built a benchmark that pushes multimodal systems beyond syllables and frames into genuine cross-modal reasoning. It’s a bit like asking someone to watch a movie with the sound off, turn on the subtitles, and still be able to explain not just what happened, but why it happened and when it happened. OmniEval is designed to test that deeper kind of understanding, not just surface recognition.

Why does this matter now? Because the hype around “multimodal” AI often treats seeing, hearing, and reading as separate skills that can be piled on top of one another. Real-world tasks—robotic assistants that must navigate a noisy kitchen, educational tools that can explain a video in multiple languages, or accessibility-powered apps that describe scenes for people who are blind—demand a more integrated sense of the world. OmniEval treats integration as the core challenge, examining how tightly the three information streams interact and whether models can reason across them in a way that mirrors human perception.

What OmniEval Changes How We Assess Machines

At the heart of OmniEval is the idea of full-modal collaboration. The benchmark is designed so that some questions can only be answered if the model truly combines both the dynamic visuals and the sound track with the accompanying text. It isn’t enough to recognize objects in a video or transcribe dialogue; the tests reward models that understand how sounds correlate with movements, how speech aligns with on-screen action, and how subtitles and spoken language jointly shape meaning. This is a step beyond previous tests that mostly looked at one or two modalities in isolation or that treated the streams as loosely linked data.

Then there’s the scale and the diversity. OmniEval assembles 810 audio-visual synchronized videos, split almost evenly between Chinese and English content (285 CN videos and 525 EN videos). Across these clips, there are 2,617 question–answer pairs: 1,412 open-ended prompts and 1,205 multiple-choice items. The questions aren’t just generic comprehension checks; they’re crafted to probe three broad kinds of tasks and 12 sub-types, ranging from counting objects and tracking actions to more nuanced groundings of events in time. And for the most exacting form of understanding, OmniEval introduces a granular grounding task. Answering a question might require locating the precise moment an event unfolds or identifying the exact interval during which something happens. It’s a way to demand temporal precision, not just a correct conclusion.

Two features stand out. First, the benchmark deliberately emphasizes coherence across modalities rather than sequencing unimodal successes. Second, it embraces bilingual evaluation, enabling direct study of models that can cross language boundaries in real time. This is not a toy dataset; it’s a platform for diagnosing where current systems stumble when the world is not neatly partitioned into separate senses or languages.

How the Benchmarks Were Built

Pulling off something this ambitious requires a careful dance between automation and human judgment. The OmniEval team started by gathering videos from a mix of established video-language benchmarks and public platforms such as YouTube, Youku, and bilibili to ensure a broad spectrum of topics and styles. They used a dedicated ASR system to generate accurate transcripts in both Chinese and English, while captions from existing sources were also incorporated when available. A key quality gate was a filtering step that weeded out clips with scant spoken content; the idea was to keep the audio track substantive enough to meaningfully interact with the visual content.

From there, the real work began: generating questions. The team used large language systems to draft open-ended questions anchored by the video’s captions and transcripts. These OE prompts were then transformed into multiple-choice variants, complete with plausible distractors. Importantly, the pipeline included a rigorous curation stage with human annotators who refined wording, checked grounding, and ensured that answers truly depended on the video content, not biases in the model’s training data. This human-in-the-loop step is essential; it prevents the benchmark from drifting into traps that good models would beat simply by exploiting patterns in language alone.

Another deliberate design choice was to categorize each QA pair into 12 cognitive skill types, such as Grounding, Object Counting, Action Perception, Causal Reasoning, and Emotion Recognition. The Grounding category, in particular, is designed to test whether a model can tie a given answer to a precise temporal moment or span in the video. And to keep the evaluation honest across languages, the dataset explicitly supports both English and Chinese, enabling researchers to study bilingual multi-modal understanding in a way that many benchmarks have not historically allowed.

Finally, OmniEval doesn’t rely on a single “gold standard” approach. They quantify performance with separate metrics for open-ended and multiple-choice questions, with specialized evaluation rules for grounding tasks. For moment-level grounding, they use a frame-based threshold that adapts to video timing; for time-span grounding, they rely on IoU thresholds to judge how well a predicted interval matches the ground truth. These choices reflect an effort to measure the kind of precise, context-rich reasoning that real-world tasks demand.

What The Results Hint About Real-World Progress

When the OmniEval team tested several prominent omni-modal systems on their new benchmark, a clear pattern emerged: the best performers are not the same across all tricks. A model called Gemini 2.5, in a Pro preview version, topped the overall scores, including strong performance on bilingual tasks. It wasn’t just about raw numbers: Gemini demonstrated robust capabilities in both perception (how well the model senses the world) and reasoning (how well it connects those senses to questions and context). Other models—like Qwen 2.5 variants and Baichuan-Omni—showed solid bilingual performance but tended to edge out in some areas while lagging in others, underscoring that no single system dominates every dimension of omni-modal understanding yet.

One of the most striking findings is a kind of fragility in “multimodal” behavior when raw video is added alongside captions and transcripts. In several cases, simply feeding a model the raw video frames did not improve, and in some circumstances even reduced, performance. By contrast, captions—textual descriptions of what’s happening—consistently boosted accuracy across models. This doesn’t mean video is unimportant; it means that current systems often rely heavily on language channels to anchor their reasoning, and that raw visual data remains challenging to fuse in a fully coherent, cross-modal way. It’s a reminder that progress is not just about adding more inputs but about building better internal representations and alignment across modalities.

The results also highlight a language gap. Gemini’s bilingual success suggests that multilingual training helps a model build more flexible representations that transfer across languages. But even the strongest systems still struggle with precise temporal localization and nuanced understanding when the scene’s dynamics are complex or the cues are subtle. In other words, the benchmark isn’t just testing “smarter” machines; it’s exposing where perception, memory, and cross-modal reasoning still diverge from human-like understanding—and where those gaps matter most for real-life tasks.

From a broader perspective, the OmniEval findings offer a diagnostic rather than a verdict. They suggest where to push next: invest in better temporal grounding, improve multi-sensory alignment, and broaden multilingual coverage so systems become truly global in their understanding. They also reveal a pragmatic truth: simply throwing more data at a model isn’t the same as teaching it to reason across senses in a way that’s useful for people in the wild. The ability to ground a claim to a specific moment in a video, for instance, is not just academically satisfying—it’s crucial for education, journalism, and accessibility tools that need precise, verifiable references.

Why This Matters for the Real World

OmniEval is more than a clever academic exercise. It represents a compass for a future where machines need to operate in real time, under noisy conditions, and across languages. Imagine a multilingual robotic assistant that can watch a cooking video, hear the chef’s narration, and answer questions about both the steps and the timing of each action, all while matching the narrative to what’s happening onscreen. Or consider accessibility tools that describe a live performance in a language familiar to the user, synchronizing spectacle, sound, and transcripts so the experience feels seamless rather than fragmented. The benchmark’s emphasis on full-modal collaboration and temporal grounding brings those futures a few paces closer by pinning down what current systems can and cannot do in concrete, testable terms.

From an industry standpoint, OmniEval could accelerate how products are built and evaluated. Instead of chasing separate metrics for vision, audio, and language in isolation, developers can aim for integrated performance that better mirrors human perception. And because the dataset spans Chinese and English, researchers can probe whether systems truly understand content across linguistic boundaries—the kind of capability that will be essential for global apps and services in the coming years.

There are also important cautions. A benchmark is not a passport to perfect systems; it’s a snapshot of current capabilities under a particular design. Real-world deployments will have to grapple with issues the test bed can’t fully capture yet—privacy, bias, and the messy variability of real audio-visual data. OmniEval’s designers acknowledge these limits and propose a roadmap that invites the broader community to contribute more diverse content, languages, and task types. The aim is a continual cycle of measurement, improvement, and real-world validation that pushes omni-modal systems toward robust, humane understanding.

Where Do We Go From Here

The path forward for omni-modal systems will hinge on better alignment across senses and better generalization across languages and domains. OmniEval already sets a high bar by demanding strong audio–video coupling, precise temporal localization, and bilingual evaluation. If researchers rise to that challenge, we could see systems that not only describe what’s happening on a screen but also explain why it matters, in multiple tongues, with a level of situational awareness that feels closer to human perception.

In practical terms, the benchmark invites a few concrete lines of progress. First, improving how models interpret and fuse audio signals—so raw sound can contribute meaningfully beyond transcripts and captions. Second, strengthening temporal grounding so claims can be anchored to the exact moment or interval of interest, a capability that matters for everything from sports analytics to education. Third, expanding multilingual coverage so non-English content isn’t an afterthought but a first-class citizen. And finally, broadening the range of real-world tasks represented in the benchmark so progress translates into tools that help people, not just more impressive numbers on a leaderboard.

As OmniEval opens its doors to researchers worldwide, it offers a shared arena where teams from different backgrounds can compare notes, reproduce results, and push each other toward better, more human-like multi-sensory understanding. It’s not a final verdict on whether machines can truly “understand” the world, but it is a clear signal about what the next generation of omni-modal systems must prove they can do—and how we’ll know when they do it well.

In the end, OmniEval is a collaborative invitation to a more intelligent, multilingual, and multisensory future. It reminds us that the drama of artificial perception isn’t just in what machines can see or hear in isolation, but in how they weave those threads into knowledge that feels coherent, trustworthy, and useful in the messy, wonderful world we actually live in. The authors behind OmniEval—under the banner of Huawei Noah’s Ark Lab, with partners at the University of Science and Technology of China—hope this benchmark becomes a shared platform for progress. If the early results are any guide, we’re wobbling toward a prettier, more capable era of omni-modal understanding, one grounded moment at a time.

OmniEval is more than a test. It is a lens on what it takes to build systems that can think with their eyes, ears, and words together—and a map for where we need to go next to get there.