Can AI Read Doctor’s Notes to Detect Disease?

The notes inside electronic health records resemble a bustling city at noon: patient stories, test results, medication lists, and the quiet whispers of clinicians’ judgments. They’re essential for understanding a patient’s health, but they’re also messy, unstructured, and enormous in volume. That combination has kept researchers from turning those notes into scalable, real-time health signals—until now. A team at the University of Calgary, led by Jie Pan, tested a new approach that marries a powerful language model with real clinical guidance to read through thousands of notes and flag three common cardiovascular conditions: acute myocardial infarction (AMI), diabetes, and hypertension. Their aim wasn’t to replace doctors or ICD codes but to augment surveillance with a flexible, multi-condition lens that can work across hospital systems without manually labelling every case.

What makes this study feel timely and a little audacious is the recipe: push a generative language model to interpret clinical prose, but anchor its thinking with human expertise embedded in prompts and clinical rules. The researchers built a four-part pipeline. First, they preprocess the notes to filter out noise. Second, they pose prompts that steer the AI toward evaluating whether a condition is present. Third, they run the model to infer disease status. Fourth, they apply rule-based checks against clinical guidelines. The result is not a single diagnosis engine but a scalable, human-guided lens that can read the language of care and translate it into disease signals—without requiring a hand-labelled training set for every condition.

In a dataset drawn from Alberta’s CREATE/APPROACH cardiac registry linked with electronic records, the team analyzed 3,088 inpatients admitted in 2015 and 551,095 clinical notes spanning a wide range of document types. They focused on AMI, diabetes, and hypertension—conditions that touch millions of patients and drive urgent, long-term care decisions. The researchers compared the AI-driven detections against clinician-validated diagnoses and a conventional ICD-10 code approach. The headline takeaway isn’t that AI wins every metric, but that a carefully guided, multi-disease analysis can achieve robust sensitivity and useful specificity at scale, with the potential to reshape how public health surveillance and real-time monitoring could operate inside hospitals. The study’s home institution and the authors behind it are clearly identified, and the work centers on a practical question: can we unlock the silent signals buried in note-laden patient records without drowning in manual labelling? The answer, for now, appears to be yes, with caveats and a path forward that invites broader collaboration.

How the pipeline works in plain language

The core idea is to treat the hospital’s notes as a living text corpus that contains clues about whether a patient truly has a condition, not just mentions of the condition. The researchers designed a pipeline that combines four pieces, each guided by clinical knowledge and human judgment. First comes a lightweight, AI-driven preprocessing that filters out irrelevant documents and focuses on the parts of notes most likely to contain diagnoses, treatments, or test results. This is not a blunt mass-scan; it’s like a smart sieve that aims to keep the signal while discarding the noise. Strong emphasis on precision here matters because it makes the subsequent AI reasoning more reliable and less expensive to run.

Second, clinicians help shape prompts that instruct the language model how to think about each condition. Instead of asking the model to spit out a diagnosis in a vacuum, the prompts are anchored in diagnosis criteria, management plans, and laboratory thresholds. For diabetes, for example, the model looks for mentions of elevated glucose values or anti-diabetic medications and then uses a clinical rule to decide if the text supports a diagnosis. The prompts also include instructions to extract key laboratory values (like glucose or blood pressure readings) that can be cross-checked against guidelines. This explicit blend of AI reasoning with human clinical logic is what gives the approach its explainability—the model is expected to point to the evidence in the notes that supports its conclusion, not merely declare a verdict in a black box.

Third, the model performs text inference. The study used Mistral-7B-OpenOrca, a seven-billion-parameter model chosen for its balance of accuracy and computational practicality in a health-system firewall. The model reads the preprocessed notes and responds with an inference (Yes/No/No mention) about whether a disease is present, along with “key-value” lab results when applicable. This two-stream output—an inference plus a structured extract—lets the pipeline couple natural-language reasoning with objective measurements. The final step is post-processing: a set of clinical rules converts the model’s outputs into a patient-level decision, rolling up individual documents into a single “present” or “absent” label for each condition per patient. In short, the system is a collaboration between a language model and a clinician-curated decision framework rather than a one-shot AI verdict.

Fourth, and perhaps most important for real-world use, the pipeline emphasizes explainability. The researchers showed that the model can highlight the exact sentences in the notes that support a positive diagnosis, creating a transparent trail from evidence to conclusion. They also explored different prompts and a few practical design choices—like a document-type filtering threshold (they used a 25th percentile cutoff in one experiment) to balance the amount of text fed to the AI with the desire to keep true positives intact. The upshot is a system that can justify its conclusions with concrete textual evidence, a feature health-care teams have long asked for when AI systems are employed in decision support or surveillance contexts.

Why this matters for public health and hospital practice

The most immediate promise of this approach is real-time, multi-disease surveillance that scales with data volume. Traditional methods often rely on administrative codes or curated labels, which can lag behind the movement of real patients through a health system and may miss subtler presentations or comorbidities. The Calgary pipeline, by contrast, is designed to run on unlabelled EHR notes and still deliver timely signals about disease status. That’s a big difference when epidemiologists want to track how AMI, diabetes, or hypertension trends shift month to month across a population. It also matters for hospital quality and performance monitoring, where timely detection and reporting can influence everything from staffing to resource allocation to patient safety initiatives.

On the performance side, the study presents a nuanced picture. For diabetes, the pipeline achieved high sensitivity (about 91%) and strong specificity (around 86%), along with a reassuring negative predictive value. In plain terms: it was good at catching true diabetes cases and reasonably conservative about false alarms. AMI and hypertension were more uneven. AMI showed solid sensitivity (around 88%) but more room to improve specificity (roughly 63%), meaning more false positives compared to a perfect system. Hypertension stood out with very high sensitivity (about 94%) but surprisingly low specificity (roughly 32%), a pattern that reflects the challenge of chronic blood pressure measurements and documentation in real-world notes. The researchers also ran comparisons against ICD-10-based detection, noting that the LLM-based approach often offered higher sensitivity and negative predictive value, which is valuable when you want to minimize missed true cases in surveillance or screening contexts.

Another practical takeaway is the trade-off between different configurations. A two-prompt or “merged” approach—combining results from multiple prompts and cross-checking with ICD-10 data—tended to lift sensitivity, sometimes at the cost of specificity. In disease surveillance, that trade-off can be a feature, not a bug: in settings where missing a true case is particularly costly (for example, initiating a cascade of preventive measures for high-risk patients), higher sensitivity with a tolerable number of false positives can be desirable. The Calgary study suggests that public health teams could tune such a pipeline to balance the cost of follow-up testing against the benefit of catching more true cases, and that the same framework could be expanded to additional conditions beyond AMI, diabetes, and hypertension.

Crucially, the study demonstrates a practical, privacy-conscious path to scaling AI in health systems. The pipeline operates within a secure, institution-controlled environment and does not require distributing raw patient notes to external cloud services. That alignment with data privacy and local compute resources matters as health systems wrestle with regulatory constraints and the cultural shift needed to adopt AI tools responsibly. It’s not a sleek sci-fi vision of AI replacing clinicians; it’s a collaborative construct that respects clinical judgment, patient privacy, and the real-world constraints of hospitals.

What surprised the researchers and what’s next

Several findings stood out in ways that challenge simple narratives about AI in medicine. One surprise was the balance between prompt complexity and performance. The team experimented with very detailed prompts that woven clinical guidelines directly into the instruction, but a more moderate prompt—one that steered the model with essential evidence and well-chosen prompts—often yielded better detection. In other words, more doesn’t always mean better when you’re asking an LLM to reason about nuanced medical text. This insight matters for practitioners who worry about prompt engineering turning into a brittle art; it suggests that the most robust pipelines may emerge from a combination of strategic prompts and rigorous clinical rules, rather than ever more elaborate language-model prompts alone.

Another striking detail is the value of preprocessing and compression. The researchers reported that, on average, the preprocessing steps trimmed about 75% of words from the notes before the model read them, without sacrificing essential information. That isn’t just a happy side effect; it’s a practical optimization that makes scalable deployment feasible in hospital IT environments where compute cycles and costs matter. The ability to derive meaningful signals from hundreds of thousands of notes with a fraction of the text is a reminder that good data hygiene—identifying relevant document types and sentences—can dramatically improve AI performance in the real world.

Explainability isn’t an afterthought here either. The researchers showed the model can highlight the exact sentences it used to justify a positive diagnosis, providing a transparent link from evidence to conclusion. This kind of “textual justification” is a rare but powerful feature for clinical settings, where teams need to understand why an AI flagged a case and decide whether to trust or challenge the result. The study’s authors argue that this kind of traceability is essential for responsible adoption, especially when AI systems operate within a health system’s firewall and must earn clinicians’ and patients’ trust.

Despite these strengths, the authors are careful about the limits. The pipeline is validated on a cardiac cohort from Calgary and would benefit from external testing across different regions and health systems. False positives for certain conditions were more common, and some discrepancies arose from how reference labels were captured in the registry. The authors also acknowledge that prompts won’t fix fundamental gaps in model reasoning or clinical knowledge, and that larger, more capable models could offer improvements but also come with greater computational demands and privacy considerations. The path forward, then, is not a single silver bullet but a collaborative, iterative process: refine prompts, test across diverse datasets, and weave in additional clinical insights as the models scale to more diseases and settings.

So what does the near future look like? If institutions embrace this hybrid approach—human-guided prompts, rule-based post-processing, and secure, on-site inference—then we could see real-time, multi-condition surveillance becoming routine in hospitals. Think of it as extending the clinician’s observational reach, not replacing it: a scalable, AI-assisted lens that helps public health officials watch trends, identify emerging clusters, and allocate resources more responsively. The Calgary study offers a blueprint for that path, one that respects patient privacy, centers clinical expertise, and invites collaboration across researchers, clinicians, and data scientists to broaden the scope beyond a handful of conditions.

In the end, the study reminds us of a core truth about modern AI in medicine: the most powerful systems may not be those that claim to see everything on their own, but those that thoughtfully pair machine-scale pattern recognition with human-guided reasoning. When you give the AI the right prompts, the right checks, and the right guardrails, you don’t just compress the noise—you amplify the value of the thousands of clinician notes that have already shaped patient care for decades. The question isn’t whether AI can read hospital notes; it’s whether we can build systems that read with the care, accountability, and context that patients deserve. This Calgary project suggests we can, with careful design, layered expertise, and a willingness to bridge disciplines in the service of public health.