When AI tracks its own tools through multi-turn talks.

AI assistants have become the Swiss Army knives of the digital age, peeling weather from the web, booking flights, summarizing emails. But real life doesn’t happen in a single, tidy sentence. In a lively group chat, ideas and cues fracture across voices, and a tool-using AI must stitch those pieces into a single action. That challenge is at the heart of DICE-BENCH, a new benchmark for evaluating how well large language models can perform function calling in multi-round, multi-party dialogues.

Developed by researchers at Seoul National University’s IPAI and Department of Intelligence and Information, with collaborators from Korea University, AIGEN Sciences, and Cornell University, the project is led by Bongwon Suh with Kyochul Jang as the lead author. The team built DICE-BENCH to simulate realistic group interactions and to measure something scientists hadn’t formally quantified before: how dispersed tool-related details are across a conversation and how that dispersion affects an AI’s ability to perform the right function call.

Understanding DICE-BENCH and DICE-SCORE

Function calling, in this context, means an AI’s ability to translate natural-language intent into calls to external tools or APIs—think: fetch weather, reserve a restaurant, or schedule a flight—without requiring a single, perfectly formed instruction. Early benchmarks often treated this like a one-shot puzzle: one user utterance, all parameters present, one tool to call. Real life, of course, looks nothing like that. In a group chat, different people drop pieces of the required information over many turns, sometimes across days, and the same tool might need several parameters that arrive piecemeal.

To capture this messiness, the researchers built a framework called DICE-BENCH. A core idea is a Tool Graph, a directed network where each node represents a tool function and each edge encodes a dependency: one tool’s output or parameters feed into another. The graph here isn’t large or dense by design—124 tools and 270 edges—yet it is complex enough to model how a multi-round, multi-party dialogue might chain actions (for example, checking the weather before booking a hotel that depends on forecasted conditions).
The Tool Graph becomes the backbone of simulated dialogues, ensuring that tool dependencies persist across rounds rather than collapsing into a single, isolated decision point.

In practice, the data pipeline starts with constructing the graph, then configuring scenarios that sample tool chains and persona-driven dialogues, and finally generating dialogues across several rounds. A group of agents, each with a distinct personality, participate in a controlled, multi-turn conversation. An orchestrator (a separate role in the system) determines who speaks when, mirroring the way a real group chat ebbs and flows. The goal is to produce conversations in which the exact function name and the precise parameter values must be inferred not from a single line, but from a tapestry of turns and voices.

The DICE-SCORE: measuring how hard it is to pull the right strings

Central to the study is a novel metric called DICE-SCORE, short for Dialogue Information Coverage Evaluation Score. It answers a deceptively simple question: how scattered are the tool-related details across the dialogue? A high DICE-SCORE means the critical information—tool names, arguments, and the sequence in which tools must be used—is spread out across multiple turns and speakers. A low score means the necessary details live in a tight bundle, easy to grab in one go.

Technically, DICE-SCORE weighs two things at once: how many distinct tool items must be identified (T) and how many dialogue turns actually mention any tool-related item (the S sequence). The score rewards dispersion but also dampens redundancy; mentioning the same item repeatedly across turns won’t linearly inflate the score thanks to a logarithmic penalty. Normalization factors adjust for dialogue length and the total number of required items, so the score reflects difficulty independent of fluff or mere length. In short, if a model has to piece together a complicated, distributed cue set to complete a task, DICE-SCORE climbs.

The authors also ground DICE-SCORE in human judgment. They compared model performance against human accuracy on a carefully sampled subset and found a striking correlation: higher DICE-SCORE generally aligned with lower human accuracy, reinforcing that the score tracks genuine task difficulty. One published figure shows a near-perfect negative correlation in their validation, underscoring that the dispersion of information is not a cosmetic detail but a core driver of success or failure in tool use during dialogue.

Why this matters for real-world AI assistants

The practical upshot is both sobering and exciting. The study reveals a gap between how current benchmarks test AI tool use and how people actually solve problems in the wild. In a group chat, you don’t hand a machine a neat instruction like “call this hotel with this date and price.” You discuss constraints, thread together partial updates, and rely on a shared, evolving context. DICE-BENCH shows that when tool-relevant information is dispersed across turns and voices, many state-of-the-art models stumble.

Across 19 different large language models with at least an 8,000-token context window, the researchers observed a consistent pattern: performance dipped as DICE-SCORE rose. It wasn’t simply a matter of longer inputs wearing out the model’s memory; it was the fundamental difficulty of stitching together scattered cues from multiple speakers. That distinction matters because it points to where improvement should happen. It’s not enough to enlarge the model’s memory; you also need robust dialogue-state tracking, better cross-turn reasoning, and a more sophisticated sense of who said what and when it matters for the next tool call.

Another striking result affects how we think about “tool-specialized” models. Some tools-focused variants were fine-tuned to handle a single instruction, and they performed less well on DICE-BENCH’s multi-round, multi-party tasks than broader, conversation-tuned models. In their experiments, general-purpose, multi-turn-oriented models often outperformed those tailored for single-shot tool usage. The lesson: real-world utility may hinge on training regimes that expose models to messy, multi-party discourse, not just the mechanics of calling an API in isolation.

What this means for the future of AI tools and dialogue

If DICE-BENCH is a wake-up call, its implications are practical as well as philosophical. For developers building AI assistants intended to operate in real-world environments—whether in consumer apps, enterprise productivity suites, or proactive personal assistants—the study suggests a few concrete directions. First, better long-context dialogue management is essential. The AI must not only remember what each participant said but also how those statements map to a sequence of tool calls across rounds. Second, models need to infer intent and plan steps in a way that respects dependencies encoded in tool graphs. Third, datasets used to train and evaluate these models should mirror the complexity of real conversations: multi-party, multi-round, with diverse dialogue styles and personas.

One of the paper’s intriguing findings is that the difficulty of the task scales with dispersion, but can be mitigated by training regimes that emphasize multi-turn reasoning and real-world dialogue patterns. This hints at a broader takeaway: if we want AI that can coordinate tools in a real social setting—meeting planning among friends, triaging customer support, or orchestrating a team’s calendar during a sprint—we may need to embrace and simulate messy, distributed communication as a fundamental part of learning, not an afterthought or an add-on.

In terms of benchmarks, DICE-BENCH fills a crucial niche. It explicitly tests multi-round and multi-party dynamics, something many existing datasets only hint at. Its metric, DICE-SCORE, provides a quantitative lens on how hard a given dialogue is for an AI to navigate, independent of sheer token counts. Taken together, the dataset and the score create a feedback loop: as models improve, researchers can design more challenging, more realistic dialogues that push the state of the art toward tools that collaborate with humans in authentic, context-rich ways.

What’s surprising and what’s next

Several surprises surface in the study. For one, the distribution of information across voices and turns is a potent predictor of success, sometimes more telling than total input length. For another, even models with enormous context windows can falter when the essential cues are not co-located in a single turn. That reframes what we should value in model design: not just longer memories, but smarter memory architectures that track dependencies across participants and rounds.

Beyond model architecture, the paper nudges us to rethink how we construct tool ecosystems. If a tool graph underpins multi-round logic, then future AI systems could benefit from explicit, human-readable representations of tool dependencies. This could enable better debugging, explainability, and even human oversight when the AI sequences complex steps across several tools. The authors’ approach—synthetic dialogues guided by a tool graph, enriched by personas, and validated by humans—also suggests a practical blueprint: test AI in scenarios that resemble the messy human world, not sanitized lab tasks.

There are limits, as the authors acknowledge. Long conversations still strain the available token budgets, and some tool-specific models struggle with JSON-format outputs or dynamic turn-taking when the orchestrator’s behavior isn’t perfectly predictable. Still, they’ve opened a path forward: broaden domain coverage, refine evaluation strategies beyond strict formatting, and experiment with orchestrator protocols that better mimic human conversation. The work is already public, with code and data available for researchers who want to push this frontier even further.

In a sense, DICE-BENCH frames a new challenge for AI designers: build systems that aren’t just clever at one-shot prompts or isolated API calls, but that can weave together ideas, voices, and dependencies into coherent, reliable actions after many turns. The researchers at Seoul National University, along with their collaborators, have given the field a map for that journey and a compass to gauge progress as we chart uncharted conversational territory.

The big takeaway is that realism in AI tool use isn’t a one-turn problem; it’s a choreography problem. If your assistant can learn to listen across a group, track what each speaker implies, and plan a sequence of interdependent tool calls, it will feel less like a calculator and more like a capable teammate. DICE-BENCH shows us where the choreography fails today and, more importantly, where it can begin to succeed tomorrow.