A new study shows that off-the-shelf language models can reconstruct complex patient histories from messy clinical notes without manual human labeling.
How do we track a disease that develops over decades? In chronic liver disease, the crucial clues are buried deep inside thousands of unstructured clinical notes. Standard algorithms fail to read these messy narratives, leaving valuable real-world evidence trapped in hospital databases.
This bottleneck forces researchers to manually read and label thousands of charts. It is slow, expensive, and limits the scale of clinical trials. A new preprint challenges the assumption that we need highly specialized, expensive medical AI to solve this. Instead, the researchers proved that general-purpose, open-source models can reconstruct complex patient histories with remarkable precision.
The power of constraints
To test this, the team built a calibration dataset of 507 clinical reports from 30 liver transplant patients. This set included 414 radiology scans, 65 pathology reports, and 28 transplant assessments. They ran these documents through four open-source models across 73 distinct data extraction tasks.
The researchers compared constrained decoding—using software to force the AI to output data in a specific format—against standard prompting across 5,590 prompt-output pairs. Constrained decoding achieved a near-perfect 99.9% format adherence. Standard prompting failed frequently, managing only 87.4%. This proves that raw intelligence matters less than putting strict guardrails on how the AI outputs its answers.
General models beat medical AI
The performance breakdown reveals a surprising trend for the future of clinical AI.
- Llama 3.3 70B was the top performer, achieving over 90% accuracy on 59 out of 73 tasks.
- It easily beat Open-BioLLM 70B, a model specifically fine-tuned on medical literature.
- It also outperformed smaller models like Llama 3.1 8B and DeepSeek R1 8B.
This suggests that general reasoning ability is more important than specialized medical pre-training. It aligns with broader research in medical data processing, which shows that structured prompting often outweighs domain-specific tuning.
Scaling to real patients
The team scaled the finalized pipeline to analyze 22,493 reports from 835 patients over a ten-year period. Without any human annotation, the AI extracted 37,125 datapoints across 45 clinical variables.
This structured dataset successfully mapped out longitudinal disease timelines. It confirmed established liver cancer risk factors, including male sex, smoking, diabetes, and viral hepatitis.
However, the study has limitations. The pipeline was developed on data from a single medical center. Hospital formatting varies widely, and the model might struggle with notes from other institutions.
Even so, the implications are clear. We do not need proprietary, closed-source models to parse complex medical histories. Open-source models, when properly constrained, can turn messy clinical text into structured research databases.
Read the full study in medRxiv.
