Language is a library with many dialects. The way we talk about people, places, and events in a CNN headline is very different from how we talk about diseases in a biomedical paper or how a tweet frames a political moment. Named Entity Recognition, a core task in information extraction, has to pick out people, organizations, and places from text. But a model trained on one kind of text often stumbles on another. It’s the classic: one size fits some shelves, but not all shelves. A team at Huazhong University of Science and Technology led by Zhuojun Ding, Wei Wei, and Chenghao Fan has proposed a crafty workaround that plays like a committee rather than a single expert: select a handful of domain specialists and merge their know‑how at inference time to tailor a model to a target domain without retraining from scratch.
Think of it as assembling a panel of translators who specialize in different domains. When a new document arrives, the system asks, in effect, which panelists are most likely to understand this text and what would happen if we let their opinions converge. The result is a task so small in data labeling that it can scale to many domains while preserving sharp, domain-specific nuance. The researchers call this SaM: Select and Merge. The big leap is not just picking the right experts but blending their knowledge on the fly to create a domain‑tailored instrument for NER, all without paying the price of a single gigantic, fully trained model for every domain.
There is a human story here: the challenge of teaching a single generalist model to understand everything well enough to be trusted across contexts. The SaM idea leans into specialization, then orchestrates specialization at the moment of use. It acknowledges that data distributions differ from domain to domain and that a one‑size‑fits‑all approach often squanders valuable signal. And it does this with a practical eye toward scalability: experts can be added or removed as needs change, with only light overhead in storage and inference. That’s crucial in a field where the data landscape is always shifting, and where institutions must adapt quickly without breaking the bank.
The study behind SaM evaluated its approach on two well-known NER benchmarks, CrossNER and MIT, which cover a spectrum of domains from AI and literature to science and restaurant reviews. The researchers trained dozens of domain experts, each tuned on data from a specific domain, and then, at test time, used two complementary strategies to pick the most helpful experts for a given target domain. The results were robust: on average, SaM outperformed a single unified model by about 10 percent in entity recognition accuracy, with certain domains seeing gains as high as 20 percent. It’s not just a nice bump; it’s a rethinking of how we deploy knowledge in real time, not by forcing a single model to swallow every domain, but by letting a small, curated set of specialists speak when their expertise matters most.
In the pages that follow, we’ll walk through what the SaM framework actually does, why it matters for the future of adaptable AI systems, and where this kind of domain‑aware collaboration could take us next. A note of clarity: this approach does not require re‑labeling target-domain data or retraining from scratch for every new domain. Instead, it leverages existing domain experts and fuses their strengths at inference time, a design choice that could dramatically reshape how we think about building scalable, robust information extractors in a world full of diverse text.
A Patchwork of Domain Expertise
The problem SaM tackles is deceptively simple in appearance: how do you build a single system that can accurately extract entities across many domains? The researchers started by collecting a broad suite of 17 NER datasets that span six domains: Biomedical, Legal, News, Social media, STEM, and Traffic. They chiseled each dataset down to a useful, high‑signal portion (removing noisy data) and then trained domain‑specific experts on each domain’s data using instruction tuning. In other words, they created a cabinet of domain‑specialist models, each finely tuned to recognize the kinds of entities that tend to appear in its own field: medical entities in AnatEM, legal entities in E‑NER and conllpp, or social media entities in BroadTweet and WNUT datasets, among others.
Crucially, SaM does not attempt to train a single model that pretends to know everything. Instead, it builds a portfolio of experts and then, at test time, asks a practical question: which experts should contribute to a target domain? The framework proposes two complementary signals to answer this: domain similarity and sampling evaluation. The first looks at data distributions—how close is the target domain to the data distributions each expert was trained on? The second uses a kind of data‑driven peer review: it samples a small set of target‑domain instances, lets all experts generate predictions, and then uses a majority vote to create pseudo labels. Each expert’s performance on these pseudo labels helps decide which experts to merge. It’s a filter-and‑merge workflow that tries to preserve domain nuance while avoiding wasteful cross‑domain cross‑talk.
On the data side, the team gathered six principal domains and organized them into a structured training regime. The domain embedding, which serves as a compact representation of a domain’s data, is computed by averaging the embeddings of its data points. Expert similarity then becomes a matter of cosine distance between these domain centroids. In practice, that means a biomedical domain expert whose training data sit in a cluster far from the target domain should—on purely distribution grounds—get a lower priority than one whose data sit closer in the embedding space. But the sampling evaluation step adds a practical check: even if a domain looks similar on paper, how does it perform on real target‑domain instances when their predictions are compared against the ensemble’s pseudo labels? The answer, the researchers found, is that the two strategies complement each other beautifully.
There’s a simple but powerful intuition here: domain similarity gives you a principled starting point, a kind of theoretical nudge toward useful experts. Sampling evaluation, by contrast, gives you ground‑truth discipline, albeit with pseudo labels, to sort the wheat from the chaff in the most stubborn real‑world cases. Together, they create a short list of the most helpful experts to merge for a given target domain. It’s like picking a jury where every member has a track record in some closely related field, then letting the collective verdict reflect the target domain’s quirks rather than forcing a universal rule across all domains.
SaM in Action: Merging Expertise Without Re‑Training
At the heart of SaM is a careful dance around what to merge and how to merge. The researchers treat each domain‑specific expert as a delta that adds new capabilities to a base model. In their setup they rely on parameter‑efficient fine tuning. The delta, or task vector, represents what the expert has learned beyond the base. When several experts are selected, SaM merges these deltas to form a task‑specific model. The merging leverages a technique called Ties‑Merging, designed to address parameter redundancy and sign inconsistencies that can otherwise undermine multi‑expert fusion. The upshot is a single, cohesive model tailored to the target domain, built not by training anew, but by recombining existing specialized pieces.
Remarkably, SaM yields two task‑specific models for a target domain: MDS from Domain Similarity and MSE from Sampling Evaluation. Each model reflects a different perspective on what the target domain needs. When inference arrives, both task‑specific models generate predictions independently, and SaM takes the union of their outputs to form the final set of named entities. The union isn’t just a redundancy play; it leverages the fact that the two models can capture complementary strengths. Where one model might miss a rare entity type present in a new domain, the other might pick it up, and together they cover more ground with better reliability.
The team also explored a practical simplification they call SaMeco, which bundles the two strategies into an economical variant that still preserves most of the performance benefits. The idea is to keep inference costs in check while reaping the advantages of combining domain‑expert insights. In other words, you get a scalable, adaptable tool that can be deployed across an organization with limited hardware overhead, rather than a laboratory curiosity that requires a fleet of specialized GPUs to run in production.
Beyond the core idea, the experiments drill into how many experts to merge, which merging method to use, and how many source domains to draw from. The results show a sweet spot: merging about two to four experts often yields the best average improvements, though the optimal number varies by target domain. More sources can help, up to a point, but too many can dilute the benefits if the domains don’t align well with the target. It’s a reminder that even in a modular, inference‑time system, balance and curation matter as much as raw breadth.
Why This Matters: A Practical Path to Flexible AI
The SaM framework is more than a clever trick for NER. It gestures toward a broader philosophy of building adaptable AI systems: instead of forcing a single, monolithic model to cover every corner of a complex world, we assemble a library of domain specialists and then compose them at the moment we need them. This approach mirrors how skilled teams work in human domains: a tax attorney, a healthcare consultant, and a data scientist might each weigh in on a problem, and the final decision reflects the right mix of perspectives for the situation at hand. SaM translates that mindset into the realm of machine reading and text understanding.
The practical payoffs are notable. First, adaptability scales gracefully. New domains can be added by training a small number of new experts and injecting them into the system without retraining the entire apparatus. Second, the framework preserves domain nuance. A single, global model often settles for a middle ground that pleases no one; SaM’s expert pool preserves domain‑specific cues that matter for precise NER, such as how scientific papers encode entities or how legal documents name statutes. Third, the system is designed with real‑world deployment in mind. It accommodates addition and removal of experts as the data landscape evolves, which is essential for enterprise environments where text sources shift with regulatory, technological, or social changes.
In their analysis, the authors also show the framework’s versatility across different architectural backbones and even touch on multilingual and non‑strict domain scenarios. The takeaway is not a one‑off trick but a blueprint for building modular knowledge systems that can flex and adapt without being rewritten from the ground up. It’s a blueprint that could influence a wide range of tasks beyond NER, from relation extraction to event detection, wherever the texture of the data matters as much as its content.
There are caveats, of course. SaM does introduce some storage overhead because you keep multiple domain experts, even if you avoid full retraining. Inference can be slightly more expensive when you run two task‑specific models, though economically minded variants like SaMeco show you can keep costs in the same ballpark as a traditional unified approach while gaining performance. The experiments focus on NER; extending the idea to other information extraction tasks will require careful empirical work. And, as with any data‑driven method, the quality and representativeness of the source domains influence the effectiveness of the domain similarity signals and the sampling evaluation.
Still, the core idea shines: knowledge can be modular, adaptable, and scalable without sacrificing domain fidelity. If you imagine the future of AI systems as a growing cabinet of specialized tools, SaM offers a practical way to assemble the right tools for the right job at the right time. The result is not just smarter NER; it’s a blueprint for smarter, more human‑sensitive AI systems that can learn to speak the language of any field by listening to the right local experts and making their voices heard when they matter most.
As the authors conclude, SaM represents a meaningful step toward adaptable and scalable information extraction. It is built on real data, tested across meaningful domains, and grounded in a simple but powerful insight: specialization, when orchestrated intelligently, can outperform universalism without breaking the bank. In the language of science and engineering, it’s a practical demonstration of how modularity and emergence can go hand in hand: a few well‑chosen experts, working together at inference time, can read the world with greater fidelity than a single, one‑size‑fits‑all system.
Institutional note: The work was conducted at the School of Computer Science & Technology, Huazhong University of Science and Technology, with Zhuojun Ding, Wei Wei, and Chenghao Fan as lead authors. The SaM framework stands as a testament to what thoughtful collaboration across domains can achieve when design priorities are adaptability, scalability, and real‑world impact.