โก Quick Summary
This study explores how data augmentation techniques can enhance the performance of machine learning models for cardiovascular disease (CVD) prediction. Notably, the SMOTE-augmented model achieved an impressive accuracy and AUC of 1.0, while also significantly altering the feature importance hierarchy.
๐ Key Details
- ๐ Dataset: Public CVD dataset
- โ๏ธ Technology: Extreme Gradient Boosting (XGBoost)
- ๐งฉ Models compared: Baseline, SMOTE-augmented, and WGAN-GP-augmented
- ๐ Performance metrics: Accuracy, F1-score, AUC
๐ Key Takeaways
- ๐ Data augmentation can significantly improve model performance in CVD prediction.
- ๐ก SMOTE-augmented model achieved a perfect accuracy and AUC of 1.0.
- ๐ Feature importance was notably altered by data augmentation strategies.
- ๐ ‘Slope’ emerged as the most critical predictor in augmented models.
- โ๏ธ For high-quality datasets, augmentation may prioritize predictive features over accuracy improvements.
- ๐ Understanding feature importance is crucial for clinical applications of machine learning models.
- ๐ This research highlights the need for careful evaluation of synthetic data impacts on model interpretability.

๐ Background
Cardiovascular disease (CVD) remains a leading cause of morbidity and mortality worldwide. Machine learning models have emerged as powerful tools for predicting CVD, yet their effectiveness is often hampered by limited dataset sizes and class imbalances. Data augmentation techniques, such as SMOTE and generative adversarial networks, offer potential solutions to these challenges, but their effects on model interpretability and feature importance are not well understood.
๐๏ธ Study
The study conducted an ablation analysis using a public CVD dataset to investigate the impact of various data augmentation strategies on the performance and feature importance of XGBoost models. Three models were developed: a baseline model trained on original data, a model augmented with SMOTE, and another using a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP).
๐ Results
All models demonstrated high predictive performance on an independent test set. The SMOTE-augmented model achieved a remarkable accuracy and AUC of 1.0. In terms of feature importance, the baseline model identified ‘oldpeak’ and ‘slope’ as top predictors, while ‘slope’ became the dominant feature in both the SMOTE and WGAN-GP models, with Gains of 27.49 and 36.68, respectively.
๐ Impact and Implications
The findings from this study underscore the transformative potential of data augmentation in reshaping predictive strategies for machine learning models in healthcare. By altering the hierarchy of feature importance, data augmentation can lead to more effective CVD prediction models. This emphasizes the importance of evaluating synthetic data’s impact on model interpretability before clinical application, ensuring that healthcare professionals can trust the insights derived from these advanced models.
๐ฎ Conclusion
This research highlights the significant role of data augmentation in enhancing machine learning models for CVD prediction. The ability to reshape feature importance can lead to improved predictive strategies, ultimately benefiting patient outcomes. As we continue to explore the integration of AI in healthcare, understanding the implications of synthetic data on model interpretability will be crucial for successful clinical applications.
๐ฌ Your comments
What are your thoughts on the impact of data augmentation in machine learning for healthcare? We would love to hear your insights! ๐ฌ Leave your comments below or connect with us on social media:
Data augmentation alters feature importance in XGBoost for CVD prediction.
Abstract
Machine learning models are powerful tools for cardiovascular disease (CVD) prediction, but their performance is often limited by dataset size and class imbalance. While data augmentation techniques can address these issues, their impact on model interpretability and the relative importance of clinical predictors remains poorly understood. This study investigates how different data augmentation strategies affect the performance and feature importance hierarchy of an Extreme Gradient Boosting (XGBoost) model for CVD prediction. This study conducted an ablation study using a public CVD dataset. Three XGBoost models were developed and compared: a baseline model trained on original data, a model trained with data augmented by the Synthetic Minority Over-sampling Technique (SMOTE), and a model using a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP). Model performance was evaluated using accuracy, F1-score, and AUC. Feature importance was quantified and compared across models using the Gain metric. All models demonstrated high predictive performance on the independent test set, with the SMOTE-augmented model achieving an accuracy and AUC of 1.0. Data augmentation fundamentally altered the model’s feature importance. In the baseline model, ‘oldpeak’ (Gain: 8.25) and ‘slope’ (Gain: 7.01) were the top predictors. In contrast, ‘slope’ became the single most dominant feature in both the SMOTE (Gain: 27.49) and WGAN-GP (Gain: 36.68) augmented models. Data augmentation can significantly reshape the predictive strategy of a high-performance machine learning model. For high-quality datasets, the primary effect of augmentation may be the re-prioritization of predictive features rather than a direct improvement in classification accuracy. These findings underscore the critical need to evaluate the impact of synthetic data on model interpretability before clinical application.
Author: [‘Chang S’, ‘Wang X’, ‘Luo Y’, ‘Jia L’]
Journal: Sci Rep
Citation: Chang S, et al. Data augmentation alters feature importance in XGBoost for CVD prediction. Data augmentation alters feature importance in XGBoost for CVD prediction. 2025; 15:41754. doi: 10.1038/s41598-025-26228-1