Clinical notes are the whispered history books of medicine. They sit in the margins of electronic health records, scribbled in the moment between test results and treatment plans, rich with nuance but almost invisible to traditional data silos. When researchers talk about menstrual health, they often stumble over a stubborn truth: the most informative clues aren’t in neatly coded fields, but in the long, winding narratives clinicians write during patient visits. This paper, a collaboration between the Hasso Plattner Institute at the University of Potsdam and the Icahn School of Medicine at Mount Sinai, aims to teach computers to read those narratives and extract consistent, clinically useful facts about menstruation. The stakes aren’t academic trivia. Detailed menstrual data helps illuminate risks for cardiometabolic disease, guides gynecological care, and could finally unlock large-scale insights about women’s health that have been hiding in plain sight.
Led by Anna Shopova, with colleagues Christoph Lippert, Leslee J. Shaw, Eugenia Alleva, and others, the study tests a focused question through a clever mix of language modeling and information retrieval. The core idea is simple in spirit: can a machine learn to spot specific menstrual attributes—dysmenorrhea, regularity, flow, and intermenstrual bleeding—from prose that can easily ramble for pages? The answer, at least in their data, is yes, and with a precision that surprises many who assume medical text is too noisy for reliable automation. The authors emphasize a practical aim: build a pipeline that works well even when you have only a small number of painstakingly annotated notes, which is a common reality in specialized medical domains.
To put it plainly, this work is as much about who is reading the notes as it is about what is in them. The Mount Sinai notes come from a gynecological well-woman visits dataset, yet the project spans international ground—Germany’s Hasso Plattner Institute brings deep learning expertise to a problem rooted in everyday clinical care in the United States. The result is a demonstration that modern natural language processing can turn unstructured clinical narratives into reliable, structured phenotypes that researchers can analyze at scale. In a field where underdocumentation has long been a bottleneck, the study offers a blueprint for surfacing the personal, lived data of menstrual health from the narratives doctors actually write.
What the study is really doing
At its core, the paper builds and tests an NLP pipeline designed to extract five menstrual attributes from free-text clinical notes: presence of dysmenorrhea, dysmenorrhea severity, regularity of the menstrual cycle, flow (whether the bleeding is scanty, normal, or abundant), and intermenstrual bleeding. The researchers don’t just throw a single model at the problem; they run a careful comparison across several modern approaches to see which patterning of training data and prompts yields the most reliable results. Their experiments pit supervised fine-tuning on a domain-specific language model against prompt-based techniques and a hybrid multi-task approach, all while testing the impact of a retrieval step that narrows the focus to the most relevant passages in a note.
The dataset is small by machine-learning standards: 140 notes from gynecological visits, annotated by clinicians for the five attributes, split into 91 for training and 49 for testing. The distribution is telling. Dysmenorrhea appears in about half of the notes, regularity is documented in about three-quarters, flow in roughly two-thirds, and intermenstrual bleeding in a minority. The authors don’t pretend that this is a representative cross-section of all clinics or all patient populations. Instead, they use what they have to explore how far current NLP methods can push performance when data are scarce and notes are long and messy—a common situation in medical subfields where detailed phenotyping isn’t routinely codified in structured fields.
To address the practical problem of long notes, the team adds a retrieval preprocessing step. Long clinical notes often exceed the token limits of language models, so the pipeline first selects the ten most relevant text segments. They do this with a hybrid approach: a fast lexical search (BM25) plus a semantic similarity measure (MedEmbed). The idea is to emulate a clinician’s habit of skimming for the most informative passages, but at machine scale. This retrieval step is not a cosmetic add-on; it consistently improves performance across models and tasks, ensuring that the model doesn’t have to read through hundreds of irrelevant sentences to find a few critical phrases about a patient’s period pattern or bleeding.
How the pipeline works and why multi-task learning helps
The methodological heart of the paper is a comparison across several approaches to text classification, all grounded in modern transformer-based language models. The baseline methods include Supervised Fine-Tuning (SFT) on a domain-specific model called GatorTron, and In-Context Learning (ICL) using a clinically fine-tuned version of LLaMA-3. A third family, Prompt-Based Learning (PBL), operates by crafting prompts that frame the extraction as a masked-language task, with a verbalizer mapping model outputs to the target label categories. Each of these approaches is tested both with and without the retrieval step, so the authors can quantify the incremental value of intelligent segment selection.
Where the study earns its creative leverage is in Multi-Task Prompt-Based Learning (MTPBL). Instead of training separate models for each attribute, MTPBL trains a single model to handle all five tasks in a unified, multi-task setup. The intuition is that these attributes share context within the same notes and even within the same patient narratives. By exposing the model to all tasks in a shuffled order, the approach encourages the model to learn cross-task commonalities and to avoid overfitting to any one label schema. The retrieved segments feed this shared model, which then produces a set of predictions—one per attribute—for each note segment. A single backpropagation step updates the model with a combined loss across tasks, optimizing the system for generalization rather than peak per-task performance in isolation.
In practice, the results are striking. On validation, MTPBL paired with the retrieval step yields the strongest overall performance across four of the five attributes, and it holds up well on the test set. The most dramatic gains appear in flow and regularity: retrieval bumps flow from around 0.64 to 0.90 F1, and regularity from about 0.77 to 0.92. Dysmenorrhea presence also benefits when paired with retrieval in some configurations, though a single-task prompt-based approach without retrieval occasionally edges it out on the test set. The overarching takeaway is not just that a multi-task approach helps, but that coupling it with a retrieval strategy makes a robust difference in the messy real world of clinical text.
In parallel, the authors explore ClinicalLongformer, a long-sequence transformer designed for clinical text. Even here, the retrieval preprocessing proves valuable, driving large gains, especially for dysmenorrhea in the test data. Yet even with these gains, the multi-task GatorTron-based approach with retrieval remains the strongest overall performer among the tested configurations. The comparison matters because it reinforces a practical message: when you’re dealing with long, imperfect data, a strategy that combines strong cross-task learning with deliberate focus on relevant text segments tends to generalize better to new notes and new patients.
Why this matters for health and for research
The paper’s results land in a moment when women’s health data are increasingly recognized as a key lever for improving public health. Menstrual characteristics—how often cycles are regular, how heavy the flow is, whether there is intermenstrual bleeding or painful cramps—have links to cardiometabolic risk, fertility, and quality of life. Yet these attributes tend to be underdocumented in structured medical records. The authors quote literature showing associations between heavy menstrual bleeding and cardiovascular disease, and between dysmenorrhea and risks of ischemic heart disease and stroke. If researchers can reliably extract these variables from the narrative text where clinicians actually discuss them, the field gains a much richer dataset to explore long-standing questions about how menstrual health intersects with other health outcomes across a person’s life course.
From an epidemiological standpoint, the capacity to scale menstrual phenotyping across thousands of notes could transform how we study risk factors, treatments, and health trajectories. The current study is a focused proof of concept, but it points toward a practical future where researchers can assemble larger cohorts without labor-intensive manual annotation. That matters because, as the authors note, certain attributes are simply underreported in notes, such as intermenstrual bleeding. The paper’s explicit acknowledgment of documentation gaps is not a critique but a clarion call: if we want NLP to unlock health insights, clinicians and health systems will need to improve how menstrual information is captured and stored in both notes and structured fields.
The collaboration itself is a sign of where biomedical NLP is headed. It blends high-performance language models with clinical pragmatism, showing that sophisticated AI can thrive in small-data regimes when the problem is well-scoped and the data thoughtfully curated. The Mount Sinai side of the work grounds the study in real-world clinical practice, while the Potsdam team provides a rigorous methodological backbone. The lead author, Anna Shopova, and her co-authors demonstrate a pathway toward tools that could eventually live in research pipelines and even in clinical decision support systems, helping researchers and clinicians understand menstrual health trends with clarity and speed.
Limitations, caveats, and a road ahead
As with any study of its kind, there are important caveats. The dataset is modest in size and drawn from a single medical center, which raises questions about how well the approach will generalize across different patient populations, note-writing styles, and EHR systems. The authors are candid about underdocumentation: certain attributes are rarely mentioned, and even when described, inconsistent wording can confuse a model. This is not a flaw of the models alone but a reflection of the data ecosystem they try to read. If clinicians don’t consistently capture menstrual characteristics in notes, even the most sophisticated NLP system will struggle to extract them reliably from every note.
Another practical challenge is the labor involved in building the prompts, verbalizers, and retrieval queries that drive PBL and MTPBL. The paper openly discusses the manual engineering required to create templates and mappings from tokens to labels. While the authors point toward automated prompt generation and adaptive retrieval as future work, the current setup still requires substantial effort to deploy in new settings. The segmentation strategy—splitting notes by double spaces—also surfaced as a point of fragility. Real-world notes vary widely in structure, and imperfect segmentation can lead to missed information, underscoring the need for more robust text processing that can handle templated forms as well as unstructured narratives.
Finally, beyond the technicalities, there’s a human dimension to the work that deserves emphasis. The goal is not to supplant clinicians’ judgments but to augment them, to turn the wisdom tucked away in narrative notes into scalable data that can inform research and care. That requires careful attention to data privacy, consent, and the risk that automated extraction could misrepresent a patient’s experience if the context is lost. The study implicitly invites a broader conversation about how to balance the speed and reach of AI-assisted data extraction with the nuance and sensitivity that menstrual health, and health in general, deserve.
In the end, the paper offers a pragmatic, encouraging message: when you combine multi-task learning with strategic retrieval, you can extract meaningful, clinically relevant menstrual attributes from notes that were previously too unruly to parse at scale. It’s not a silver bullet, but it’s a purposeful step toward turning a thorny data problem into usable knowledge. And it’s a reminder that even in the clutter of real-world records, there are patterns worth discovering—patterns that can help doctors tailor care, help researchers map risk, and help the broader public understand how menstrual health intersects with overall well-being.
As the authors close, the path forward is clear: test these methods on larger, multi-institutional datasets; continue to refine segmentation and prompt design; and push toward closer integration with clinical workflows. If future work can reduce the manual scaffolding while preserving performance, this approach could become part of routine research pipelines, a quiet, steady amplifier of what the notes already hint at—the intimate story of menstrual health written into the fabric of patient care.
Institutional provenance: The study was conducted by researchers from the Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, in collaboration with Icahn School of Medicine at Mount Sinai, including the Mount Sinai Data Warehouse and the Mount Sinai AI Ready platform. The lead author is Anna Shopova; senior contributors include Christoph Lippert, Leslee J. Shaw, and Eugenia Alleva.
What to take away: This work shows that smart combinations of retrieval, multi-task learning, and prompt-based NLP can reveal meaningful, clinically useful menstrual health attributes from notes that previously felt too tangled to mine. The approach isn’t a final product yet, but it’s a blueprint for turning narrative data into scalable insights—an essential step if we’re to illuminate the quiet but consequential ways menstrual health intersects with overall health across populations.