Across the modern enterprise, the dream of an AI assistant that can answer a question by stitching together clues from Slack threads, meeting transcripts, PRs, documents, and even customer notes is no longer a sci‑fi fantasy. It’s a living, breathing ambition that tech teams chase as eagerly as product managers chase a roadmap. But the waters are murky. Information isn’t neatly filed in one place; it’s scattered, messy, and human in its mess. The reality is not a clean library of linked articles but a sprawling ecosystem where signals wander from chat to code to calendar invite, sometimes circling back to a prior version or a stray URL. The question is whether our AI can swim through this coral reef of data and surface trustworthy answers, not just plausible ones.
That challenge sits at the heart of HERB, a new benchmark designed to test what researchers call Deep Search in enterprise contexts. Instead of evaluating how well a model recovers a single fact from a tidy corpus, HERB asks: can an AI navigate a web of heterogeneous sources—Slack messages, meetings, GitHub PRs, URLs, and customer profiles—and reason across multiple steps to ground its conclusions in evidence? The benchmark is not an abstract exercise. It’s engineered to mimic the actual work of software teams as they plan, build, and support products, with realistic noise, multi‑hop questions, and ground‑truth answers anchored in intricate workflows. And it’s not just a test of what a model can recall; it’s a test of what a model can search for, what it can integrate, and how it handles missing information or conflicting signals.
The team behind HERB comes from Salesforce AI Research, and the project is led by researchers including Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Kung-Hsiang Huang, Caiming Xiong, and Chien-Sheng Wu. They built HERB to push past the familiar pitfall of synthetic QA that feels safe but unreal, toward a benchmark that actually resembles how knowledge works inside a real company. The data pipeline they built simulates a three‑stage software lifecycle—planning, development, and deployment—producing thousands of artifacts that reflect how people really talk, decide, and document work. It’s a provocative reminder that the best AI isn’t just clever at answering questions; it’s disciplined about where those answers come from and how they’re supported by evidence.
What HERB really tests: deep search across a human‑shaped data world
HERB is built around a deceptively simple idea: you start with a complex question, then you trace the chain of evidence across many kinds of artifacts. The benchmark uses a synthetic enterprise environment that includes six organizational hubs, three product lines, and a workforce of 530 employees. There are Slack channels buzzing with planning chatter, meeting transcripts that capture real-time decision making, a library of internal documents, GitHub pull requests, shared URLs, and even customer profiles. Altogether, HERB yields 39,190 distinct artifacts and 815 concrete, answerable queries. It also deliberately includes 699 unanswerable questions to test whether a system can recognize when there’s no valid grounding—an essential habit in enterprise governance where a wrong answer can mislead decisions as effectively as no answer at all.
This is where the benchmark departs from many multi‑hop QA datasets. Prior setups tended to stitch together a few related documents via explicit links, creating an artificial ladder that well‑drilled models could climb with shallow reasoning. HERB, by contrast, requires models to decide not just what to search for, but where to search for it. It’s a test of search strategy as much as of reasoning: given a question about a product’s PRD or a bug in a customer’s account, the model has to figure out which data sources are likely to contain the answer, how to extract the relevant bits from Slack lines or meeting notes, and how to map names to IDs across a noisy landscape of organizational breadcrumbs. And because the data are generated to resemble real existence of noise—overlapping tasks, partial information, distractors—the task demands a more robust, end‑to‑end grounding than most benchmarks encourage.
One of HERB’s most revealing design choices is the inclusion of both structure and free text. You might have a GitHub PR with metadata that’s machine‑readable, or a Slack message with a casual reference to a decision. A modern enterprise AI needs to bridge these formats and connect a feature discussion in a Slack thread to a PR in GitHub, to a customer ticket, and back to a road‑map document. HERB makes that bridging explicit. It also emphasizes a crucial practical constraint: in real life, you rarely have the full context in one place. The benchmark includes scenarios where the best evidence sits in separate sources and times, requiring the model to reason with a sense of chronology and provenance.
To keep the challenge honest, HERB doesn’t hand you a single “answer key” that’s trivial to locate. Instead, it ties each answer to a ground‑truth set of evidence. The evaluation framework asks a model to extract precise pieces—employee IDs, company names, or exact PR links—so you can measure precision, recall, and resilience to distractors. In practice, that means a system must produce not just a final sentence but a chain of justification: which sources were consulted, what evidence was used, and how the pieces connect. The goal is a grounded, auditable answer, not a glib guess that could be backed up by plausible but false signals.
HERB’s long‑context experiments push the envelope even further. In a “product‑specific” setting, an AI can be given a curated slice of the enterprise—only data about a given product—and asked to reason through extended documents, code changes, and discussion threads. In the oracle setting, models are fed only the exact evidence ground‑truth attached to a question, stripping away retrieval challenges. Across both modes, the results reveal a stubborn truth: even the best large language models stumble when the reasoning path is long, the sources are diverse, and the evidence is distributed across human conversations and structured records. The takeaway is not that models are useless, but that retrieval and grounding remain the bottlenecks—where the rubber meets the road in enterprise AI.
Why this matters: a wake‑up call for enterprise AI design
The implications of HERB land squarely in the everyday concerns of teams building AI copilots for business. For years, researchers and product leaders have celebrated leaps in language models, but HERB asks a humbler, more urgent question: can those models actually work when the data aren’t neatly organized? The answer so far is sobering. In the full RAG (retrieval augmented generation) setting, even sophisticated agentic configurations that employ planning and tools reach only mid‑range performance. The best results in standard configurations hovered in the low 30s on average, while older or less capable models lagged far behind. This isn’t a minor gap; it’s a reminder that in the messy, mixed reality of enterprise data, retrieval quality and the ability to ground reasoning in real evidence are the real frontiers.
Another striking takeaway is the stubborn difference between “long‑context” reasoning and retrieval‑augmented reasoning. When researchers give a model access to product‑specific data in one long block, some language models perform surprisingly well. But when that same model must search across a sprawling dataset with the usual enterprise noise, performance drops sharply. In other words, scale and clever prompting aren’t enough; you need robust retrieval architectures that can navigate multi‑hop, cross‑format questions without losing track of provenance. HERB’s long‑context results also highlight a gap: even when you reduce input length, the model’s ability to chain evidence across many sources remains fragile. Grounded reasoning is harder than it looks on paper, and the enterprise environment makes it feel almost artisanal—every detour, every mislink, every distractor matters.
That insight matters beyond academic curiosity. It reframes how we should measure progress in RAG systems for workplaces. It isn’t enough to chase higher scores on a neat benchmark; developers must reckon with retrieval latency, source diversity, and the risk of ungrounded or unanswerable responses. HERB’s explicit inclusion of unanswerable queries is a welcome advance: it trains models to recognize when no robust answer exists and to avoid hallucinating a ground that isn’t there. In real life, such honesty can be priceless, preventing decisions based on faulty extrapolation or mistaken assumptions about context. These design choices—emphasizing retrieval reliability, multi‑format grounding, and the ability to say “I don’t know”—could become the default guardrails for enterprise AI systems rather than afterthought features.
Finally, HERB signals a new quality bar for the field: benchmarks should resemble how knowledge actually travels in a company, not how analysts wish it would travel. The synthetic data pipeline, crafted from workflow stories of planning, development, and support, offers a blueprint for building future benchmarks that matter to practitioners. If a benchmark can capture the friction, noise, and human context of real work, models trained against it will be better prepared to handle live environments. In that sense, HERB does more than benchmark a capability; it codifies a philosophy for evaluating AI as a collaborator in the workplace, someone who must locate, cite, and justify the evidence that underpins every recommendation.
What the results actually reveal about today’s AI capabilities
Across a suite of configurations, the study compares zero‑shot prompting, pure vector retrieval, hybrid retrieval, and several graph‑based methods, all paired with either standard prompting or the agentic ReAct framework. The numbers aren’t pretty, but they’re instructive. The 0‑shot baseline—asking a model to answer without any retrieval—performs near zero on most query types. Retrieval matters. The best plain RAG baselines improve on the 0‑shot baseline but still struggle with multi‑hop, cross‑source questions. Agentic RAG, which couples reasoning with tool use, shows the strongest gains, but even then the average remains in the low 30s when evaluated with GPT‑4o as the backbone. In plain terms: the problem isn’t solved by throwing more data at the model; it’s solved by smarter ways to search and reason across that data, and by acknowledging the limits of what a model can prove from the ground truth alone.
Another important thread is unanswerability. A large fraction of queries are designed to be unsolvable given the data, and models differ dramatically in how often they correctly mark a question as unanswerable. Some approaches do a reasonable job at risk management, but others drift into speculative territory, especially when the model is pushed to extract every last bit of information. The takeaway is not merely a safety concern; it’s a design signal. If your enterprise AI must live in a world where not every question has a solid grounding, you need mechanisms to detect when you’re venturing into uncertain territory and to gracefully refuse or request more context.
Looking at long‑context versus retrieval, HERB shows that even the most capable modern models can do impressive things when given a curated, product‑specific slice of data. But once you widen the lens to the full, heterogeneous pool of enterprise artifacts, retrieval becomes the rate‑limiting step. It’s a humbling reminder that memory isn’t just a matter of how many tokens a model can hold; it’s about how well a system can fetch, filter, and fuse the right bits from the right places at the right times. In practice, this argues for tighter integration between retrieval systems, data governance, and reasoning modules—an ecosystem where the model’s power is matched by the rigor of how it locates and constrains its sources.
From a human‑factors perspective, HERB’s error analyses are revealing. Even in oracle settings, where the evidence is precisely the ground truth, models stumble. Some of the failures come from incomplete use of context, others from flawed reasoning about which artifacts matter for a given question, and a few from overambitious extrapolations that ignore the careful mapping between user names and employee IDs. The researchers dissected ReAct trajectories to understand tool use, noting that many runs relied on surface signals rather than iterative, multi‑step searches. The message is that even with powerful tools, real‑world tasks require disciplined search strategies, not just clever prompts or brute force computing power.
What this means for the future of enterprise AI
HERB is more than a dataset; it’s a manifesto for how to build AI that actually helps people in large organizations. It nudges the field toward retrieval systems that can perform genuine deep searches across diverse formats, with a built‑in respect for provenance and an awareness of when evidence is missing. It also invites tool builders to design interfaces that let agents reason across both unstructured content and structured metadata—what you might think of as a hybrid mind that reads emails and PR metadata with equal facility. The practical upshot is clear: if you’re designing an enterprise AI assistant today, you should invest not only in bigger models but in smarter search architectures, better grounding strategies, and safeguards for unanswerable questions.
In the near term, HERB’s findings suggest several concrete priorities. First, retrieval quality must catch up with model sophistication. Models can’t reliably reason across 39,000 artifacts if they don’t retrieve the right handful of them at the right time. Second, tool use needs to become more nuanced. The ReAct framework is a strong start, but real effectiveness will require more disciplined planning, better disambiguation of names and IDs, and deeper, multi‑step interactions with structured data sources. Third, long context isn’t a panacea. Even with 131K tokens at hand, models struggle to reproduce the deductive chains a human would naturally follow. The lesson is not to abandon long context but to pair it with targeted, efficient retrieval that respects the workflow and the limits of verification.
There’s a broader, philosophical takeaway as well. In a world where AI surfaces answers by stitching together pieces of text, the value of evidence—to be grounded, traceable, and contestable—becomes paramount. HERB nudges us to design systems that don’t pretend to know everything, but instead anchor their answers in verified sources. That is not a concession to humility; it’s a commitment to reliability in the professional ecosystems where decisions matter. The authors’ emphasis on both evidence grounding and the reality of noisy, interwoven data makes HERB a useful compass for researchers and practitioners alike as they navigate the tricky, beautiful waters of enterprise knowledge.
Ultimately, HERB is a reminder that the world inside a company is not a tidy library but a living, evolving reef. And if AI is going to play a meaningful role there, it has to learn to swim through the currents of human communication, to map the species of data across Slack, meetings, and code, and to surface not just a quick answer but a credible trail of evidence. That is the frontier HERB highlights—and it’s a frontier worth crossing, brick by brick, byte by byte.
Note on the source: HERB was developed by Salesforce AI Research, with Prafulla Kumar Choubey as the lead author and a team including Xiangyu Peng, Shilpa Bhagavath, Kung-Hsiang Huang, Caiming Xiong, and Chien-Sheng Wu. The benchmark embodies a synthetic but carefully grounded enterprise environment designed to reflect real‑world data workflows and to stress test long‑context LLMs and retrieval‑augmented systems in a way that maps onto actual software product pipelines.