🧑🏼‍💻 Research - June 17, 2026

AI Survival Models Fail to Beat 1995 Formula

🌟 Stay Updated!
Join AI Health Hub to receive the latest insights in health and AI.

A new reanalysis reveals that highly praised machine learning survival models underperform both human doctors and a thirty-year-old statistical formula at predicting patient death at critical clinical milestones.

Why do we optimize clinical AI for abstract rankings when doctors must make decisions at specific time horizons? A physician deciding on intensive care needs to know if a patient will survive the next 60 days, not how they rank in a lifetime cohort. This mismatch is a quiet crisis in clinical machine learning.

For years, computer scientists have chased high concordance indexes to prove their survival models work. This new study suggests that optimizing for ranking across an entire timeline degrades performance at the exact moments that matter most to clinicians. It challenges the assumption that newer, more complex algorithms are inherently superior to simpler, time-tested tools.

The Bedside Performance Gap

Researchers reanalyzed the SUPPORT2 cohort of 9,105 critically ill adults across five United States centers. They used a stratified 70/15/15 split to compare a modern gradient-boosted survival model against two baselines: the attending physician’s prognosis and the original 1995 SUPPORT logistic regression model. The modern survival model achieved a respectable overall ranking concordance of 0.705. Yet, when evaluated at the critical 60-day mark, it stumbled.

The AI managed an area under the ROC curve (AUROC) of just 0.750. Meanwhile, human physicians scored 0.808 on the matched sample, and the 1995 model reached 0.827. This performance gap remained stable across eight independent data splits and was not a simple calibration error.

The Math Mismatch

Advanced neural networks, deep ranking models, and discrete-time models all failed to close the gap. The culprit is the training objective itself. When researchers replaced the ranking objective with timepoint-matched binary training, they recovered roughly half of the performance gap. This suggests that optimizing for overall survival ranking actively harms the model’s ability to predict specific clinical horizons. This echoes broader challenges in survival modeling, where complex neural architectures often struggle to outperform simpler baselines, as seen in dementia prediction research and microarray dataset survival analysis.

The study also exposes crucial limitations for both humans and machines. The physician advantage was conditional on the doctor actually choosing to provide an estimate. Meanwhile, the AI showed severe vulnerabilities. While discrimination was equitable across sex, race, and age, the machine learning models failed catastrophically during leave-one-disease-out validation. If a disease group was missing from the training data, the AI could not generalize.

Key Evaluation Metrics

  • The gradient-boosted model scored a 0.750 AUROC at 60 days, compared to 0.827 for the 1995 model.
  • Physicians outperformed the modern AI at 60 days with an AUROC of 0.808.
  • Replacing the ranking objective with timepoint-matched training recovered half the performance gap.
  • Severe performance drops occurred when the AI faced disease groups absent from training.

This finding challenges the industry’s obsession with general concordance indexes. If an AI cannot beat a 1995 logistic regression model at predicting 60-day mortality, it has no business guiding bedside decisions. Developers must stop training models to rank patients and start training them to answer the specific, time-bound questions that doctors actually ask.

Read the full preprint in medRxiv.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.