Standard AI matches foundation models in clinical tests
A new benchmark shows that expensive tabular foundation models offer almost no performance advantage over classic machine learning for predicting patient outcomes.
Why are health systems rushing to adopt massive, power-hungry foundation models when simpler algorithms can do the exact same job? A new study suggests the hype around tabular foundation models in medicine is running far ahead of reality. For routine clinical predictions, older and simpler machine learning methods perform just as well.
This finding challenges the industry assumption that bigger AI is always better. It suggests that for structured patient data, the math was already solved years ago. Clinical decision-making does not need a massive neural network to calculate cardiovascular risk or survival rates when a well-tuned random forest does it in milliseconds. The industry has fallen in love with the promise of universal models, but this benchmark serves as a cold bucket of water.
The clinical benchmark
Researchers set up a rigorous benchmark to compare a leading tabular foundation model, TabPFN, against 12 established machine learning methods. They tested the models across 12 binary clinical tasks using patient cohorts ranging from 788 to 139,528 individuals. The tasks covered critical clinical outcomes, including survival, metastasis, and disease status.
By using standardized preprocessing and bootstrapping, the researchers ensured that the comparison was completely fair. The results were remarkably flat. The expensive foundation model failed to show any meaningful superiority over classic, optimized algorithms, proving that added complexity does not translate to better bedside decisions.
The performance gap
The data reveals that the new model is mostly a lateral move, not a step forward. Here is how the performance broke down:
- TabPFN outperformed the best classic machine learning model in only 16.7% of the clinical tasks.
- Most differences in prediction accuracy, measured by AUROC, fell within a negligible margin of ±0.02.
- The foundation model required a median runtime 5.5 times longer than traditional machine learning.
- Practical deployment of the foundation model relied heavily on expensive GPU acceleration.
This lack of clear superiority is not an isolated finding. Other recent analyses, such as a study on perioperative classification and risk prediction, similarly show that tabular foundation models face severe practical and computational limitations in surgical settings. The trade-off between speed and accuracy simply does not favor the newer models.
Why this matters
This finding matters because clinical IT departments are currently budgeting millions for hardware upgrades. If a hospital can get the exact same predictive accuracy for cancer metastasis using a standard CPU running XGBoost as it can using a power-hungry GPU cluster, the choice is obvious. Choosing the heavier model wastes energy, money, and valuable processing time. It introduces unnecessary engineering debt into hospital systems that are already struggling with technical overhead.
However, we must be precise about these limits. This study only evaluated structured, tabular data. Foundation models still hold immense value for unstructured data like clinical notes or medical imaging, where traditional machine learning struggles. But for the structured spreadsheets that make up the bulk of electronic health records, the old tools remain king.
This analysis is based on research published in BMC Medical Informatics and Decision Making.
