A new machine learning model can help hospitals predict which heart surgery patients will get stuck in the ICU, but its performance drop in external testing highlights a persistent hurdle for clinical AI.
When a patient undergoes heart bypass surgery, the first 24 hours in the ICU are a logistical and clinical pressure cooker. Doctors must decide who needs extended monitoring beyond 72 hours and who can safely step down. Making the wrong call wastes scarce ICU beds or puts patients at risk.
A new study attempts to solve this with an interpretable machine learning model. But its performance drop when moving from one hospital database to another reveals a hard truth. Algorithms trained on clean, single-center data often struggle in the messy real world.
The Performance Gap
Researchers pulled data from two massive databases to build and test their tool. They used 6,919 adult coronary artery bypass grafting (CABG) patients from the MIMIC-IV 3.1 database for development, splitting the data in a 7:3 ratio. For external validation, they tested the model on 5,972 patients from the eICU-CRD 2.0 database.
The algorithm, named CatBoost, selected just eight bedside features collected during the first 24 hours. These included 24-hour fluid intake, Charlson Comorbidity Index (CCI), SOFA score, SAPS-II, Glasgow Coma Scale (GCS), vasopressor use, congestive heart failure, and atrial fibrillation. It aimed to predict a prolonged ICU stay of more than 3 days.
The model performed well on its home turf but degraded when tested externally:
- Internal test set (n = 2,076): Area under the receiver-operating-characteristic curve (AUC) of 0.7739 (95% CI 0.7379–0.8099), calibration slope of 0.973, and Hosmer-Lemeshow p-value of 0.224.
- External validation cohort (n = 5,972): AUC dropped to 0.6452 (95% CI 0.6311–0.6602), with a calibration slope of 0.998 and an Integrated Calibration Index (ICI) of 0.023.
- At a threshold of 0.30: Sensitivity was 0.55, specificity was 0.65, positive predictive value was 0.40, and negative predictive value was 0.77.
This drop in discrimination from 0.77 to 0.64 is the real story here. While the model remains highly calibrated—meaning its predicted probabilities align well with actual outcomes—its ability to distinguish between patients who will stay and those who will leave is modest at best in new environments.
Why This Matters
We need to look at what actually drives these predictions. The top three features identified by SHapley Additive exPlanations (SHAP) were 24-hour fluid intake, CCI, and atrial fibrillation. This aligns with broader cardiovascular research.
For instance, post-operative complications like arrhythmias frequently dictate recovery trajectories, a reality also seen in studies on permanent pacemaker implantation risk after complex cardiac procedures. Similarly, managing multi-morbid patients requires balancing complex risk factors, as explored in comparative analyses of double valve replacement outcomes.
By focusing on just eight easily accessible bedside variables, the model avoids the “black box” problem. Clinicians can see exactly why the AI flags a patient. However, the modest sensitivity of 0.55 means the tool will miss nearly half of the patients who actually require prolonged stays.
It works best as a safety net to rule out low-risk patients, given its stronger negative predictive value of 0.77.
The Path Forward
Hospital administrators should not expect a plug-and-play solution. The positive net benefit shown in Decision Curve Analysis across thresholds of 0.20–0.40 suggests the tool adds value, but prospective validation is still missing. Until we see how clinicians interact with these risk scores in real-time, the algorithm remains a promising calculator rather than an active clinical partner.
Read the full study in BMC Medical Informatics and Decision Making.
