A Universal AI for Medical Imaging Across Specialties

Medical imaging has become the nervous system of modern medicine. From a patient’s chest x-ray to a biopsy’s tissue slide, doctors build a map of what’s happening inside the body. Yet the tools that help interpret these images are often siloed by modality (the kind of image) and by specialty (radiology, ophthalmology, pathology, dermatology, and beyond). A new kind of AI foundation model aims to change that by learning from thousands of images across many modalities and diseases, all within one architecture. The goal is not just to be smart at one thing, but to be broadly competent across the entire imaging landscape. The study behind MerMED-FM—short for Multimodal, Multi-Disease Medical Imaging Foundation Model—embodies this ambition and tests whether a single model can read CTs, ultrasounds, pathology slides, and fundus photographs with the same eye for diagnosis as a portfolio of specialized tools.

The work is a cross-institutional effort led by researchers at the Institute of High Performance Computing (IHPC), A*STAR, and the Singapore Eye Research Institute, with collaborators spanning Singapore General Hospital, Duke-NUS Medical School, and several global partners. The lead authors include Yang Zhou and Chrystie Wan Ning Quek, with a large team of engineers and clinicians contributing. In short, this is an attempt to mirror the way clinicians actually think: we bounce between imaging streams, compare patterns across organs, and develop a gestalt sense of illness that isn’t tied to one snapshot or one test. MerMED-FM is built to mirror that kind of cross-disciplinary reasoning in an AI system—and then to do it efficiently, even when labeled data are scarce.

Why does this matter? Because in real clinics, a single patient might undergo multiple imaging tests across specialties. A true multispecialty foundation model could streamline workflows, reduce software fragmentation, and offer a single, consistent interpretive framework that scales from ophthalmology to oncology. If MerMED-FM can perform robustly across seven imaging modalities with millions of images and still learn from limited labeled data, it could reshape how hospitals deploy AI—moving from a patchwork of specialist tools to a single, reliable assistant capable of cross-checking findings in one pass.

In the pages ahead, we’ll explore what MerMED-FM does, why the results matter, and what surprises lie behind the numbers. We’ll also connect the technology to the kind of decisions doctors actually make in busy clinics and how patients might benefit when AI understands the body from many angles, not just one.

What MerMED-FM Is and How It Learns

MerMED-FM is built on a vision of a shared visual language. It uses Vision Transformers as its backbone and a teacher–student framework that learns from many “views” of the same image. But the clever centerpiece is a memory-augmented, self-supervised learning approach. The model is trained on roughly 3.3 million images drawn from more than ten medical specialties and seven imaging modalities—CT, chest X-ray, ultrasound, color fundus photography, optical coherence tomography, pathology patches, and dermoscopic images. That breadth is the point: it isn’t a model trained to recognize one disease in one kind of image; it’s meant to extract patterns that recur across different kinds of medical pictures and different diseases.

How does it manage such diversity without drowning in data? The training mix is driven by self-supervised learning, which means the model builds meaningful representations without requiring every image to be painstakingly labeled. A memory module acts like a living notebook: representations from past training samples across modalities are stored and consulted when new images come in. This memory encourages consistency, helps the model relate findings across different image types, and guards against forgetting information as the model encounters new data. The system also uses a teacher–student setup, where the teacher slowly guides the student’s learning to stabilize training and improve generalization across tasks.

Crucially, MerMED-FM avoids leaning on language inputs to guide interpretation. It’s a vision-only model that learns from visual patterns themselves, which matters in clinical settings where imaging data arrive as images first and text descriptions come later. The memory is organized in blocks and updated with a first-in, first-out scheme, ensuring that newer clinical realities remain relevant while still retaining historical context. Beyond raw accuracy, the researchers implemented modality- and specialty-aware sampling to prevent any single image type from dominating learning. The result is a model that tries to be equally credible across radiology, ophthalmology, pathology, and dermatology tasks, rather than excelling only where data or labeling were abundant.

To adapt MerMED-FM to downstream tasks, only the student encoder is finetuned for each new disease–modality pair. In their experiments, the team tested it across seven imaging modalities and 25 public datasets while also evaluating performance on local patient data from a Singapore clinical cluster. The headline claim is not just that MerMED-FM performs well overall, but that it can match or exceed the specialized models that dominate particular niches, sometimes with far less labeled data.

Behind the scenes, the training ran on a powerful hardware stack—eight NVIDIA H100 GPUs—and the authors report encouraging data-efficiency: with only half of the typical fine-tuning data, MerMED-FM remained highly accurate on several tasks. This is not a magic trick; it’s a carefully engineered combination of SSL representations, a cross-modality embedding space, and a memory mechanism that anchors what the model has learned so far while it absorbs new patterns.

Across Modalities and Diseases

When you tune a model to a single modality, you optimize for a narrow slice of the world. MerMED-FM deliberately broadens the scope: CT slices for lung cancer and COVID-19, chest radiographs for pneumonia and pneumothorax, ultrasounds for breast cancer, histopathology patches for colorectal and breast cancers, color fundus photographs and OCT for eye diseases, and dermoscopic skin images for a range of dermatologic conditions. Across these seven modalities and 25 public datasets, MerMED-FM demonstrated high diagnostic performance. A striking summary from the study is a mean AUROC (a measure of diagnostic accuracy) of 0.935 across tasks, with performance often at or near the top compared to both multispecialty and single-modality models.

On chest CT, MerMED-FM achieved a mean AUROC around 0.975 for identifying lung carcinomas and COVID-19 pneumonia, with extremely high sensitivity and specificity in some datasets. On chest radiographs, it reached near-top performance for pneumonia and pneumothorax identification, and on ultrasound it delivered robust results in detecting breast cancer. Histopathology tasks—where the stakes feel highest because a single misread can alter a patient’s treatment path—also showed excellent results, with the model approaching the accuracy of specialized pathology systems on several benchmarks.

Ophthalmology tasks were a particular highlight. On OCT images—crucial for detecting diabetic retinopathy and other retinal diseases—the model reached AUROCs close to 1.0 in several datasets, and it performed very well on color fundus photographs, including glaucoma detection where the margin of improvement over some baselines was statistically significant. The dermatology tasks—ranging from pigmented skin lesions to broader skin diseases—also saw MerMED-FM performing on par with or closely behind specialist dermatology models in several benchmarks, even though the model was trained without a language-based grounding.

One of the study’s core takeaways is data efficiency. MerMED-FM maintained strong performance as the researchers reduced the amount of fine-tuning data for downstream tasks, illustrating the model’s ability to generalize from a shared, cross-specialty representation. In other words, the model can learn how to interpret a chest CT, a fundus photo, and a pathology slide using a common set of features that translate across domains—much as a clinician might notice a pattern in an X-ray that also appears in a pathology slide, even though the two images look nothing alike at first glance.

Beyond numbers, the results hint at practical realities. In clinical settings, a single AI system that can handle multiple image types could streamline workflows, reduce the maintenance burden of running separate models, and provide a consistent interpretive frame across departments. The authors emphasize that MerMED-FM’s architecture is designed with deployment in mind: a unified model can be integrated into hospital AI offices and imaging pipelines in a way that scales with patient volume and hospital complexity.

In addition to demonstrating superior average performance, the researchers stress data efficiency and low-shot adaptability. For several tasks, MerMED-FM outperformed strong specialty-focused models even when trained with relatively small amounts of labeled data. That attribute matters in medicine, where high-quality annotated data can be scarce, expensive to obtain, or sensitive to local practice patterns. The idea is not to replace clinicians but to act as a first-pass, cross-checking partner that can flag inconsistencies or highlight patterns that might be overlooked when reading a single image in isolation.

From Triage to Treatment: Why It Matters in the Real World

Think of MerMED-FM as a universal translator for medical imaging. If a patient presents with a constellation of symptoms suggesting a systemic problem, doctors may need to pull signals from a CT scan, an eye OCT, a pathology slide, and even a skin image to build a coherent story. Today, that often means juggling several specialized tools and juggling multiple teams. A unified model could streamline triage, help standardize diagnoses across departments, and reduce the lag between initial imaging and final conclusions.

In the Singapore cluster where some of the evaluation occurred, MerMED-FM demonstrated encouraging performance on local datasets drawn from real-world clinical practice. That matters because public benchmarks, while valuable, don’t always capture the complexity and variability of everyday hospital imaging—differences in scanners, patient populations, and labeling practices can all tilt results. The researchers argue that a multispecialty model with a robust memory mechanism is better poised to navigate such variability than a suite of siloed, separately optimized tools.

For patients, the potential benefits are tangible but nuanced. A more capable, single-model system could speed up diagnostic workflows, reduce the chance that a subtle cue in one image type is overlooked because it lives in a different tool, and support more consistent decision-making across clinicians who rely on different imaging modalities. It could also lower the cognitive load on junior staff who must learn to interpret multiple AI assistants, each with its own quirks and limitations. In short, a truly cross-specialty AI could stabilize the fog that often surrounds multi-modal diagnoses and help clinicians focus on what matters: the patient in front of them.

Of course, there are important caveats. MerMED-FM did not consistently beat every single-modality model on every task—some specialized systems still hold advantages in narrow domains. It also remains a vision-only model; volumetric data and full 3D reasoning across multiple images—think a CT stack paired with pathology—will require further development. The ethical, regulatory, and privacy dimensions of deploying a cross-specialty system are nontrivial. The authors acknowledge the need for careful governance, ongoing evaluation, and transparent reporting to ensure safety, reliability, and accountability in real-world care.

In addition, the work highlights a broader design choice in AI: do we build one grand, generalist model that tries to do everything, or a suite of specialists that collaborate? MerMED-FM leans toward the former, arguing that a shared representation space can unlock cross-domain insights and reduce duplicated effort. Whether hospitals will embrace a single, memory-rich, multispecialty AI depends on how well the system integrates with existing workflows, how confidently clinicians can trust its recommendations, and how regulators evaluate such a broad tool in practice.

Limits, Caveats, and the Road Ahead

No study is a final verdict, and MerMED-FM is no exception. The model’s strongest gains appear in certain lung and ocular tasks, reflecting the breadth of data and the particularities of those domains. In some dermatology and pathology tasks, MerMED-FM trails the very best specialist systems. The team is candid about the fact that the model currently does not handle volumetric imaging (3D) as a whole, focusing instead on slice-based inputs. That leaves room for extension into 3D contexts, which would be essential for many radiology and oncology workflows where tissue and organ structure span volumetric volumes.

Another caveat is the interplay between data diversity and model bias. A model trained on large, diverse datasets can still learn biases present in the source data, and multimodal learning adds new axes along which biases can creep in. The researchers address this with careful sampling strategies and memory-based regularization, but real-world deployment will require ongoing auditing, calibration, and stakeholder oversight to ensure equitable performance across patient groups and imaging centers.

Finally, MerMED-FM raises questions about the degree to which AI should interpret medical images autonomously versus augmenting clinician judgment. The authors frame the model as a tool to enhance diagnostic accuracy, streamline workflows, and support a patient-centered, cross-specialty approach—not as a replacement for clinicians. The human-in-the-loop remains essential: AI can surface patterns, propose probabilities, and co-pilot decisions, but clinicians must still integrate imaging findings with history, examination, and judgment that goes beyond what pixels alone can reveal.

The next steps proposed by the team include extending the model to segmentation and prognostication, incorporating longitudinal data to track disease progression, and exploring ways to synthesize data from multiple modalities into a unified patient-level perspective. They also point to regulatory and ethical considerations as central to real-world adoption, from privacy protections to transparent performance reporting and robust quality assurance. If future work can extend the memory framework to three-dimensional data, integrate richer clinical histories, and maintain reliability across diverse clinical settings, the dream of a truly universal medical imaging AI inches closer to fruition.

In the end, MerMED-FM isn’t just an algorithmic trophy case. It’s a blueprint for how AI could align with the messy, multidimensional reality of patient care, where images—from the chest to the retina to the biopsy slide—are not isolated data points but parts of a single, evolving story. The study, anchored in the Singaporean scientific ecosystem—with institutions like IHPC, the Singapore Eye Research Institute, and partners across Duke-NUS and Stanford-affiliated centers—shows that a memory-augmented, self-supervised, vision-only foundation model can learn from a forest of images, not just a single tree. If that forest can be nurtured responsibly, it could help clinicians see more clearly across disciplines and deliver care that’s quicker, steadier, and more attuned to the patient as a whole.

Institutional note: The MerMED-FM work is led by researchers at the Institute of High Performance Computing (IHPC), A*STAR, and the Singapore Eye Research Institute, with collaborators from Duke-NUS Medical School and partner hospitals, reflecting a concerted national effort to push medical AI toward practical, cross-disciplinary clinical utility.