In the quiet corners of a Greek lab, a handful of researchers introduced something that sounds like science fiction: a single AI that can plan, run, and audit an entire research workflow—end to end—without writing a line of code. This is not a rumor about a far-off dream, but a concrete open‑source project called mAIstro, built to orchestrate the messy, multi-step world of medical imaging research. It promises a future where clinicians and researchers can describe what they want in plain language and watch a team of tiny software agents assemble the right data analysis, radiomic features, segmentation, and predictive models all in one go. The study behind it wasn’t a niche proof of concept; it tested real-life tasks across tabular data and medical images, on multiple datasets and modalities, with several large language models (LLMs) at the core of its decision making.
The work comes from the Artificial Intelligence and Translational Imaging (ATI) Lab at the University of Crete’s School of Medicine, in collaboration with the Computational Biomedicine Laboratory at FORTH and the Karolinska Institute in Sweden. The lead researchers are Eleftherios Tzanis and Michail E. Klontzas, who frame mAIstro as a first-of-its-kind, open-source multi‑agent framework capable of unifying data analysis, radiomics, model development, and inference across varied medical tasks. In other words, this is not just a smarter calculator; it’s a programmable, self-coordinating research assistant that translates natural language into concrete data science steps while keeping everything transparent and reproducible.
A Digital Lab Team That Thinks and Acts
Imagine a small team in a single room, each person a specialist: one handles exploratory data analysis, another weighs which features matter most, a third pulls out radiomic fingerprints from images, a fourth trains segmentation networks, and others build classification or regression models. In mAIstro, that team exists as eight task-specific agents, all supervised by a master agent that acts like a conductor. They operate in cycles of thinking, acting, and observing, guided by a language-enabled brain—the large language model—that translates prompts into plans and then evaluates what happened, feeds the results back, and adjusts the next move.
Crucially, the system is designed to work with a variety of LLMs, from the giants like GPT-4o and GPT-4.x to other capable open‑source options. The architecture itself rests on a modular pool of tools—16 in total—each with metadata that helps the agents decide when and how to use them. The tools are not mere plug-ins; they encode standardized input/output conventions so that one agent’s output can become another’s input without fragile handoffs. The result is a reproducible, auditable chain of actions that can be inspected, adjusted, and extended by researchers who aren’t necessarily programmers.
Eight Agents Under One Brain
At the heart of mAIstro is a roster of eight task‑specific agents. Each has a clearly delimited role, yet they share a common language for describing tasks and outcomes. The Exploratory Data Analysis (EDA) Agent profiles tabular datasets, generates descriptive statistics and visuals, and can produce textual summaries. The Feature Importance and Selection Agent tests different ways to rank features and chooses the most informative subsets for downstream models. Then there’s the Radiomics Feature Extraction Agent, which taps into PyRadiomics to harvest quantitative features from medical images, accommodating numerous image types, filters, and labels.
On the imaging side, two segmentation powerhouses are automated: the nnU‑Net Developer and Implementer Agent builds and validates segmentation models with the nnU‑Net framework, while the TotalSegmentator Agent handles large-scale organ segmentation across hundreds of structures in CT or MRI. For predictive modeling on tabulated data, the Classifier and Regressor Agents automate model development, evaluation, and deployment using PyCaret’s classification and regression capabilities. Finally, the Image Classifier Agent brings deep learning to 2D and 3D medical images, orchestrating architectures like ResNet, VGG, and Inception through dedicated tools.
What makes this more than a long list of parts is how the system assembles them. The Master Agent parses a user’s natural‑language request, selects the right specialized agent, launches its toolchain, monitors progress, and loop‑backs results to drive the next step. In the paper’s experiments, this process proved remarkably capable across diverse data types—from standard clinical tables to complex imaging datasets like BraTS for brain tumors and KiTS for kidney tumors.
Why This Is a Big Bet on Open, End‑to‑End AI Workflows
The appeal of mAIstro is not just the novelty of eight bots in a single system. It’s the “end-to-end” ambition: an autonomous, auditable pipeline that can take a raw dataset, run through feature extraction, model training, and validation, and return fully documented outputs with interpretable results. The system is deliberately modular, so researchers can swap in new tools or replaced components as methods evolve. It’s also designed to be LLM-agnostic, meaning labs can run it with whatever language model they trust or have access to—an important feature for environments with strict data governance or restricted internet access.
Five essential ideas underpin this promise: first, accessibility. By letting non‑programmers issue natural language prompts to launch complex AI workflows, the barrier to entry drops dramatically. second, reproducibility. The inputs, tool configurations, and intermediate outputs are saved in a structured way, enabling independent replication. third, standardization. With a common framework, disparate imaging datasets and analysis tasks can be approached with uniform rigor. fourth, extensibility. Researchers can contribute new tools or pipelines as medical imaging techniques evolve. And fifth, safety through transparency. Because the framework emphasizes explicit provenance and interpretable outputs, clinicians and researchers can audit how a model came to its conclusions, a crucial feature for medical settings.
What Surprises and What It Could Change
One striking result in the study is the performance variability across language models. The researchers ran numerous prompts across a spectrum of LLMs and found that high‑performing models—GPT‑4o, GPT‑4.1, Claude Sonnet 3.7, DeepSeek variants, and a strong 70‑billion‑parameter Llama—achieved a near-perfect 100% task success rate across the tested tasks. In contrast, smaller models lagged, with success rates dwindling to the 10%–55% range. This isn’t just a cautionary note about model size; it underscores that the “brain” of an autonomous medical workflow still matters a lot. The architecture is only as good as the reasoning engine that plans and interprets steps.
Beyond the numbers, the project demonstrates that it’s possible to unify seemingly disparate parts of medical AI—radiomics, segmentation, and both classification and regression on tabulated data—into a single, coherent process guided by natural language prompts. The Radiomics Feature Extraction Agent, for instance, can produce high‑dimensional feature sets from CT and MRI data, including 3D and 2D analyses, then merge those features with clinical predictors when available. The experience of running end‑to‑end workflows on datasets like BraTS and KiTS without manual intervention hints at a future where researchers can test new hypotheses more quickly, with consistent experimental pipelines that others can reproduce.
There’s a deeper shift here as well: the democratization of AI research in medicine. The authors frame mAIstro as a foundation—an extensible platform that can be used by researchers who may not be seasoned programmers to build, evaluate, and deploy AI models. It also offers a modular scaffold for experienced teams to drop in new tools, new model types, or new data modalities. In short, mAIstro could become the common language and toolkit for a generation of clinically oriented AI experiments.
Limitations and What Stays Unsettled
Smart automation does not mean instant clinical maturity. The paper is careful to acknowledge that real‑world deployment would require navigating regulatory, ethical, and privacy constraints that go beyond what a research prototype can address. The authors also note that while the system’s reasoning is driven by probabilistic language models, tool execution is deterministic; the quality of the output hinges on the prompts and the model’s ability to reason through a task description. In other words, the “thinking” is probabilistic, but the “doing” can be made reliably repeatable with careful engineering—yet that balance is still delicate in medicine.
Another reality check comes from data diversity. The study tested a wide set of datasets, but medical data is famously heterogeneous: imaging protocols vary, patient populations differ, and institutional practices change. The authors show promising results across several datasets, but the path to widespread clinical adoption will demand even more extensive validation across institutions, scanner types, and patient cohorts. Moreover, while mAIstro can operate offline with local models, healthcare systems must still grapple with governance, version control, and the thorny issue of interpretability when patients’ lives depend on it.
A Modular, Open-Source Path Forward
One of the most important takeaways is not a single model, but a blueprint. By releasing mAIstro as an open‑source framework, the authors invite researchers to build, critique, and improve a shared platform for end‑to‑end medical AI research. It’s a design philosophy: a modular, reproducible scaffold that can be extended with new datasets, new imaging modalities, and new solvers without reinventing the wheel each time. The framework’s emphasis on natural‑language prompts lowers the technical barrier while preserving the rigor of a formal pipeline with traceable inputs, configurations, and outputs.
Looking ahead, the authors suggest several exciting directions. Labs could replace or augment the internal language model with institutionally trusted models to keep data in-house while still offering powerful automation. New tools could be developed to tackle emerging imaging techniques or to fuse imaging data with other omics layers. And because the system is architecture‑driven, researchers could tailor mAIstro to domains beyond radiology—any field that wrestles with high‑dimensional data, complex preprocessing, and multi-step modeling might eventually benefit from a similar agentic framework.
Where This Leaves Patients, Clinicians, and Researchers
For patients and clinicians, the immediate takeaway is a cautious optimism: better tools to organize and evaluate AI in medicine may accelerate discovery and improve transparency. For researchers, mAIstro offers a way to test ideas quickly, requiring fewer lines of code to set up experiments, while preserving a clear record of what was tried, what worked, and why. For the broader scientific community, the open‑source nature means the work can be scrutinized, improved, and extended—an antidote to the reproducibility anxieties that sometimes dog AI in health care.
In naming the institutions behind the effort—the University of Crete’s ATI Lab, FORTH in Greece, and Karolinska Institute in Sweden—the authors remind us that this kind of project thrives when cross‑pollination happens across disciplines and borders. The contribution isn’t a single breakthrough so much as a practical, scalable approach to building, testing, and sharing AI in medicine. And at the helm, Eleftherios Tzanis and Michail E. Klontzas anchor the work in a real, research‑driven culture where curiosity, collaboration, and careful validation matter as much as speed.
Conclusion: A New Kind of Scientific Labor, Shared and Open
mAIstro isn’t a completed solution to every medical imaging challenge. It is, instead, a provocative invitation to reimagine how we design, run, and learn from AI experiments in medicine. By assembling a small, diverse team of autonomous agents under a language‑driven brain, the project points toward a future where researchers can sculpt complex workflows with natural language, track every step, and share results in a way that makes replication and critique not inconveniences but the default. The open‑source nature of the project makes it less a novel gadget and more a communal toolkit—one that could help bend the arc of medical AI toward faster discovery, clearer interpretation, and more equitable access to powerful computational methods.
That future isn’t guaranteed, and it isn’t imminent. But with mAIstro, the question moves from whether AI can do medical data work at all to how we design, govern, and improve systems that can do it—from end to end and in a way that invites everyone to participate. In that sense, the project is less about a single breakthrough and more about a reproducible, extensible way to teach the AI to teach itself how to do science.