The moment you ask a smart assistant to draft something, you’re stepping into a tiny social experiment. The model replies, you point out what’s off, and suddenly the exchange isn’t a single push toward an answer but a back-and-forth dance toward a better one. Real conversations with AI rarely land in a neat one-shot victory; they unfold in loops, with feedback shaping what comes next. A new study from Meituan’s AI research team tries to capture that lived experience, not just the isolated moment of compliance. It’s a shift in how we test machines, a move from one-shot accuracy to a process that looks and learns like human collaboration.
The work, conducted within Meituan’s AI research ecosystem, centers on a framework called Meeseeks. At its heart is a simple truth: if you want an AI to behave as a dependable agent in the messy real world, you need to test how well it handles iterative feedback. This isn’t just about making a model memorize a rule; it’s about watching it adapt when a requirement is missed, when a constraint changes, or when a word count slips out of bounds. The authors, led by Jiaming Wang and Yunke Zhao, argue that the true test of instruction-following isn’t a single perfect answer but a sequence of corrections that reveals the model’s underlying capability to align with human intent across turns.
What makes Meeseeks provocative is its patient attention to the mechanics of following instructions. Instead of evaluating a model with a single prompt and a single rubric, the researchers simulate a realistic user–AI loop: a prompt, a response, a pointed feedback, and a revised answer. The goal isn’t to punish models for mistakes in a single breath but to understand how they learn to correct themselves when the ground shifts under their feet. That framing, argue the authors, more accurately reflects how people actually use AI in professional settings—from customer support to content creation to code-assisted tasks—and it has big implications for how we design, deploy, and evaluate these systems.
The Meeseeks project is anchored in a real research context. It is built by a team at Meituan, a major technology company known for online delivery and a growing portfolio of AI tools. The study’s authors include Jiaming Wang, Yunke Zhao, Peng Ding, Jun Kuang, Zongyu Wang, Xuezhi Cao, and Xunliang Cai, with Wang and Zhao listed as lead authors. Framing the work around a three-turn default interaction, the team lays out a broader vision: to map the full arc of an LLM’s instruction-following ability as it unfolds across dialogue, multiple feedback cycles, and self-correction. That is a design philosophy, not just a test protocol, and it nudges us to rethink how we measure competence in AI agents that operate in the real world.
A benchmark that mirrors real conversations
Meeseeks isn’t a single-number quiz. It’s a dynamic, iterative benchmark designed to mirror how people actually interact with AI assistants. The process begins with a user prompt and an initial model answer, then moves into a feedback phase where specific requirements are flagged as unmet. After that, the model is asked to try again, incorporating the feedback, and the cycle can repeat. This multi-turn loop is purpose-built to stress the model’s ability to self-correct and to reveal where misalignment tends to creep in as turns accumulate.
To organize the evaluation, the Meeseeks framework relies on 38 capability tags distributed across three dimensions: Intent Recognition, Granular Content Validation, and Output Structure Validation. It’s a cognitive-inspired taxonomy rather than a clutter of isolated checks. Intent Recognition asks whether the model truly understands what the user wants, even when the prompt is noisy or layered. Granular Content Validation checks that each piece of the instruction—such as exact word counts, inclusion of specific terms, or required languages—meets its mark. Output Structure Validation ensures the response is not only correct in content but also well organized, with the right headings, formatting, and overall flow.
Beyond the taxonomy, Meeseeks introduces a practical engineering layer: data parameterization. Researchers generate large swaths of synthetic prompts whose backgrounds, lengths, and constraints can be tweaked at will. This isn’t random fluff; it’s a deliberate effort to stress-test models under a range of plausible real-world constraints. And it isn’t all about the AI; the evaluation itself benefits from a clever mix of rule-based checks and LLM-driven extraction. In short, the framework tries to keep the test faithful to human use while staying scalable enough to compare dozens of models.
Two metrics anchor the Meeseeks evaluation: Utility Rate and Meeseeks Score. Utility Rate measures how often a model’s response satisfies all prompt requirements—essentially, how usable the output is in a professional setting. Meeseeks Score aggregates how well a model demonstrates proficiency across the top-level capability tags, giving researchers a sense of where a model’s instruction-following strengths and gaps lie. Put another way, Utility Rate is about practical usability on a task-by-task basis, while Meeseeks Score looks at broader capability coverage across turns. The pairing is designed to disentangle exact correctness from the steadiness and versatility of instruction-following behavior across a dialogue.
The three cognitive dimensions that guide instruction-following
At the core of Meeseeks are three guiding dimensions that map closely to how people think through instructions. Intent Recognition is the initial filter: can the model parse the user’s goal amid distractions, multi-step requests, or rephrasings? It’s not just about understanding a single sentence; it’s about grasping the intent behind a composite prompt that might juggle multiple tasks at once.
Granular Content Validation is the juice of the test. It checks each individual constraint—the exact word counts, the required keywords, the precise language mix, and even the sensitive organization of information. This level of granularity matters because many real-world tasks hinge on precisely measured outputs: a market report with a word-limit, a legal-style brief with mandated terms, or a marketing blurb with exact keyword usage. If a model can’t hit those micro-constraints, even if the overall message is correct, it can still fail the task in a compliance-sensitive setting.
Output Structure Validation completes the triad by asking whether the model’s reply is assembled in the right form. The test considers whether the content respects format constraints, sequence, and the logical flow needed for downstream use. A perfectly factual paragraph that arrives in a tangled, unstructured block isn’t ready for handoff to a human reviewer or a downstream system that expects machine-readable outputs.
The results the team reports illuminate a striking pattern. Reasoning-enabled models—those trained or tuned to produce step-by-step explanations or to structure their reasoning before answering—start out with an edge on the first turn. They tend to catch more of the requirements in the initial pass, thanks in part to a tendency to checklist their responses. But as the turns accumulate, that edge tends to erode. In many cases, non-reasoning models close the gap and even surpass their more deliberate peers by the third turn. The implication is provocative: the capacity to reason out loud may not guarantee sustained advantage when the goal is to deliver a perfectly aligned, multi-turn instruction-following output.
Another revealing thread concerns how models handle distractions and prompt structure. The study suggests that anti-distraction capabilities—staying on task in the face of complex prompts and potential derailments—are shaped more by training approaches than by any inherent cognitive property. In other words, it’s not simply that a bigger or smarter model can resist noise; it’s that the way the model is trained, tuned, and prompted influences how well it remains focused when the instruction evolves across turns.
Within the Meeseeks analysis, two hard problems persist across models: Language Requirements and Word Count Requirements. Language Requirements test the model’s ability to meet tricky linguistic constraints, such as preserving a balanced English-Chinese mix or following hybrid language rules. Word Count Requirements probe whether the model can hit precise word counts or operate within strict length bands. The study finds these are stubborn, recurrent bottlenecks even for state-of-the-art systems. It’s a reminder that even as we push models to be clever, we also ask them to be precise, and those two demands don’t always align naturally.
In fact, the authors observe three broad behavioral trajectories for models as turns accumulate: divergence, convergence, and performance reversal. Divergence is when two models with similar first-turn performance drift apart over turns. Convergence is when initially different models end up performing similarly after iterative corrections. Performance reversal is the surprising case where the early leader loses its edge and becomes, later on, a laggard. These patterns challenge the idea that a single-turn snapshot tells the full story of an AI’s capabilities. They invite researchers to look at how a model’s competence evolves when it’s kept in dialogue with a user, a developer, or an evaluator that keeps pushing for refinement.
Why this rethinks building reliable AI agents
The Meeseeks framework is not just a clever benchmarking trick; it’s a design philosophy shift. In many everyday uses, AI agents work best when they can take feedback, adjust on the fly, and demonstrate steady improvement across a conversation. That’s exactly what Meeseeks is designed to expose: whether a model can buckle down, reset its prior assumptions, and shape a response that finally satisfies a human’s multi-turn instruction. And because the benchmark maps directly to multi-turn interactions, it helps researchers and product teams understand how these systems will actually behave in customer-service chats, content workflows, or coding assistants where a user repeatedly nudges the model toward a desired outcome.
Two practical engineering moves in the study deserve emphasis. First, the Code-guided rule-augmented evaluation is a clever way to tame the cost of multi-turn evaluation. Instead of regenerating all context to extract and check content, Meeseeks provides the evaluator with code-guided prompts that extract the relevant parts and then check them against the rules. In their tests, this approach pushed end-to-end accuracy up dramatically—from 78.7% to 98.4% on their quick-start dataset—and slashed the token-burn during extraction. It’s a reminder that clever tooling in evaluation can tilt the scales toward scalable, trustworthy testing as models grow bigger and more capable.
Second, the introduction of a large, structured dataset with 38 capability tags—paired with a cognitive-inspired three-dimension evaluation framework—gives researchers a more granular map of where instruction-following succeeds and where it stumbles. This isn’t just about a single score; it’s about diagnosing specific weaknesses, from language juggling to exact word counts, and using that diagnosis to guide future training and prompting strategies. The Meeseeks dataset, with its 700-plus pre-synthesized entries, offers a broad and adaptable playground for probing multi-turn instruction-following in a controlled yet realistically varied setting. And while the current quick-start dataset is Chinese-first, the authors signal plans to expand into English, widening the benchmarking net for a global research and development ecosystem.
So what does this mean for developers who actually ship AI assistants? It means embracing multi-turn feedback loops as a core design element, not an afterthought. It means designing prompts and system messages that explicitly solicit and structure user feedback, so the model can iteratively converge toward a fully compliant answer. It means recognizing that a model’s ability to follow instructions isn’t a single test score but a trajectory—where early advantages may fade or even flip as the dialogue deepens. And it means investing in evaluation frameworks that reflect lived usage, with measurable goals for self-correction, not just polished first impressions.
The Meeseeks study doesn’t pretend to solve every challenge in instruction-following or to fix every corner-case in LLM behavior. But it does offer a concrete, scalable way to study how models behave when the user asks for what they want, then nudges them toward it again and again. In that sense, Meeseeks is less a single product feature than a mindset shift: the best AI agents are not judged by one perfect answer, but by how gracefully they adapt, iterate, and ultimately align with human intent across a conversation.
The researchers are clear about limits. The current dataset is primarily Chinese, with English versions on the horizon, and the evaluation setup requires substantial computational heft to run the multi-turn cycles. Still, the core insight stands: multi-turn instruction-following is where the rubber meets the road for real-world AI agents, and a benchmark that mirrors that reality can illuminate both what models can do and where they still stumble. As AI systems become more embedded in workflows and daily life, the ability to learn from feedback—quickly, reliably, and transparently—may prove to be as crucial as raw cleverness or sheer scale.
In the end, Meeseeks is a thoughtful invitation to look beyond a single-turn miracle and toward the patient, iterative craft of reliable AI agents. It’s a reminder that follow-through matters as much as follow-up—that a model’s value in a professional setting lies as much in what it corrects as in what it first creates. And it’s a nudge to align our benchmarks with the rhythms of real human use, so the machines we build today grow into the partners we’ll rely on tomorrow.