For this capstone project, I investigated how well machine learning models can predict heart disease, while also studying how the patient’s gender affects these predictions, as well as determining how well the same model performs across different regions. This project utilizes two clinical datasets from the publicly accessible UCI Machine Learning Repository under the collection “Heart Disease.” This repository is composed of data from patients across four distinct locations, capturing varying levels of heart disease severity. These datasets are part of a collection of databases, open to the public, and can be used by anyone looking to conduct empirical analysis.
For this study, I specifically focused on two of the four datasets available: the “processed. cleveland.data,” which contains patient records from Cleveland, and the “processed.va.data,” which includes data from the Veterans Administration in Long Beach. These datasets provided my foundation for analyzing heart disease risk factors and evaluating predictive models. This project evaluated the predictive accuracy of machine learning algorithms and compared its performance to traditional cardiovascular risk scores, such as the American College of Cardiology’s Atherosclerotic Cardiovascular Disease (ASCVD). Furthermore, I investigated the expectations of a machine learning models’ ability to surpass the predictive accuracy of established ASCVD risk scoring. I also examined whether significant differences arose when evaluating the model’s predictive performance for men and women, as well as when applying the same model to datasets from two different cities. This dual focus aims to uncover both gender-specific variations and location-based disparities in model effectiveness.
For my research, I applied the following machine learning algorithms: Random Forest, XGBoost, and an ensemble method combining them. These algorithms were applied to heart disease prediction datasets, using tailored preprocessing and feature engineering techniques.
Three distinct experiments were conducted, each labeled Transformation 1, Transformation 2, and Transformation 3. Transformation 1 normalized skewed data and applied squared transformations to capture non-linear relationships. Transformation 2 introduced engineered features such as the combination of ST depression (Oldpeak) with the slope of the ST segment. Transformation 3 incorporated a custom feature to account for gender differences. Exploratory analysis revealed that chest pain type, ST depression (Oldpeak), and exercise-induced angina were key predictors, while cholesterol and fasting blood sugar contributed little predictive value. The best experiment combined multiple elements from all these transformations and significantly improved the model’s performance and predictive accuracy.
The performance of the ASCVD risk score was also evaluated on both datasets to act as a benchmark against the machine learning models. For the Cleveland dataset, the ASCVD score achieved an accuracy of 69.64%, with a precision of 63.58%, recall of 79.14%, and an F1 score of 70.51%. Notably, the model performed better for females than males, with higher recall (88.89% vs. 77.19%) but lower precision (48.88% vs. 68.75%). This variability highlighted challenges in achieving consistent results for both men and women. On the VA Long Beach dataset, the ASCVD score achieved a higher accuracy of 76.82%, precision of 78.95%, and recall of 93.75%, yielding an F1 score of 85.71%. Due to an insufficient number of females in the VA Long Beach dataset, a sex-specific analysis was not feasible.
The Random Forest model utilizing Transformation 2 demonstrated the best overall predictive accuracy and better predictive accuracy compared to the ASCVD score, achieving 83.33% accuracy for the Cleveland dataset and 82.5% for the Long Beach dataset. However, gender-based disparities emerged in the model’s performance. The Random Forest Model performed better overall for the female subgroup in the Cleveland dataset, exhibiting a higher overall precision and recall, meaning that the model was better at making a correct prediction. However, the model exhibited a poor recall of 67% amongst patients with heart disease. This indicated that the model was not good at correctly predicting patients with heart disease. In contrast, the model had much more balanced predictive accuracy when evaluating the male subgroup.
The Transformation 2 experiment utilizing the Random Forest model performed well overall but demonstrated additional biases when evaluated on the Long Beach dataset. For example, in the Cleveland dataset, which had a relatively balanced distribution of heart disease and non-heart disease cases, the model was able to accurately identify both classes. However, in the VA Long Beach dataset, where heart disease cases were more prevalent, the model demonstrated a poor recall for the minority class in the dataset (“patients without heart disease”) while showing high precision and recall for the majority class (“heart disease”). This tendency resulted in a recall of 97% for heart disease cases, indicating the model was highly effective at identifying individuals with heart disease. The precision for heart disease was 83%, meaning 17% of those classified as having heart disease were false positives.
For non-heart disease cases, the model’s recall was only 40%, highlighting its inability to correctly identify most individuals without heart disease. This trade-off—favoring high recall for the majority class at the expense of accurately identifying minority class cases—illustrates the challenge of generalizing performance across datasets which include patients from different regions. The class imbalance only further reduces the ability to generalize these results.
Findings suggest that while my machine learning models hold promise for improving heart disease prediction, the gender imbalances and regional variations in these datasets limit the general utility of these findings. Due to the class imbalance, future research that utilizes these datasets should explore multilabel classification for varying disease severities and address sampling biases that arise from an overall low representation of women in these datasets. This will enhance both fairness and the model’s application across all populations.
Hi, this is a comment.
To get started with moderating, editing, and deleting comments, please visit the Comments screen in the dashboard.
Commenter avatars come from Gravatar.