Introduction
The central focus of my capstone project is to explore the effectiveness of machine learning models in predicting heart disease and assess its ability to generalize across different cities and biological sexes. This research highlights the importance of building models that not only achieve high accuracy within a specific dataset or geographic location but are also reliable when applied to both men and women separately, as well as when applied to more than one city in the United States. Traditional cardiovascular risk scores, while reliable, exhibit limitations that impact its performance (Talha, Elkhoudri, and Hilali, 2024). Machine learning models have been shown, in previous research, to have better predictive accuracy than some cardiovascular risk scores (Cho, Kim, Kang, et al., 2021).
Studies have shown that traditional cardiovascular risk models frequently lack validation across diverse populations, leading to miscalculations and reduced sensitivity when applied to groups outside their original development context (Talha, Elkhoudri, and Hilali, 2024). This inability to generalize poses a challenge when providing accurate predictions for underrepresented groups, particularly women. Cardiovascular disease is the leading cause of death globally, responsible for 17.7 million deaths in 2015—a number expected to rise to over 23.6 million annually by 2030. Despite its prevalence, women tend to be undertreated, with their symptoms frequently misdiagnosed or dismissed as non-cardiac issues (Woodward, 2019).
This disparity underscores the urgent need to develop more inclusive and accurate predictive models. Women’s risk factors are often underestimated (Abdullah, Beckett, Wilson, et al., 2024). To address these challenges, my project uses machine learning models trained on data from both men and women, with performance tested separately for each gender. By addressing the limitations of traditional risk scores and accounting for differences in predictive accuracy for each gender, my project aims to more wholistically evaluate the accuracy and fairness of heart disease prediction. This is shown in my project when I compared the accuracy of the heart disease machine learning models to the ACSVD risk scores. This evaluation can help contribute to more equitable healthcare outcomes, by acting as an additional point of consideration.
Literature review
Heart disease prediction methods (risk models)
Multiple cardiovascular risk models have been made throughout the decades intended to accurately predict the risk of cardiovascular disease for the patients that used it. Understanding the strengths and limitations for each of these models is crucial for effective risk stratification and prevention strategies.
Over the past three decades, numerous risk prediction models have been developed to estimate an individual’s likelihood of developing cardiovascular disease (CVD). Among these, the multivariate risk prediction model from the Framingham Study has been particularly influential in estimating future CVD risk (Cui, 2009). Additional models have been created in the United States, such as the Reynolds Risk Score for women, derived from data collected in the Women’s Health Study. Other efforts include a “multi-marker” risk model that incorporates 10 genetic markers, including C-reactive protein and B-type natriuretic peptide (Cui, 2009).
In Europe, separate cardiovascular risk prediction models were developed due to the limited applicability of the Framingham risk scores to the general European population without recalibration (Cui, 2009). For example, the SCORE equation, endorsed by the Third Joint European Task Force on cardiovascular prevention, has been validated in Spain. In the UK, the QRISK algorithm was created using a population-based clinical research database. Germany contributed both a simple PROCAM score and a more complex neural network model, with the PROCAM score recently updated. Scotland developed the ASSIGN risk score, which includes family history of CVD, based on data from the Scottish Heart Health Extended Cohort. Italy introduced the CUORE equation tailored for populations with a low incidence of coronary events (Cui, 2009). These diverse models reflect efforts to address regional differences in CVD risk and provide tailored tools for prevention.
Major limitations of cardiovascular risk scores
A study conducted by Talha, Elkhoudri, and Hilali, in 2024, summarized the best-known limitations of current cardiovascular risk models. Critical analysis revealed numerous limitations that impact performance. Each calculator demonstrates distinct advantages for one population while potentially encountering limitations with another. Some scores lack validation from external cohorts, while others seem to miscalculate risk when applied to populations outside of its origin, limiting its sensitivity, and being unable to explain all cardiac events (Talha, Elkhoudri, and Hilali, 2024).
Numerous cardiovascular risk assessment tools have been developed from large population studies, but only a few have undergone essential external validation. The most common models, American and European scores, have distinct characteristics and limitations. Understanding these limitations is crucial for improving the effectiveness of these tools (Talha, Elkhoudri, and Hilali, 2024).
Comparisons of established risk prediction models for cardiovascular disease
A study, reviewing 74 previous research articles, aimed to evaluate and compare established cardiovascular risk prediction models to assess their performance. Investigators evaluated the performance of two or more risk prediction models in the same populations. The study extracted information on design, assessed models, and outcomes, examining their performance in terms of discrimination, calibration, and reclassification, while also considering biases favoring newer or author-developed models (Siontis, Tzoulaki, Siontis, et al., 2012).
The review included 74 articles, covering 56 pairwise comparisons of eight models, such as two variants of the Framingham risk score, ASSIGN score, SCORE, PROCAM score, QRISK1 and QRISK2 algorithms, and the Reynolds risk score. Only 10 of the 56 comparisons showed more than a 5% relative difference in predictive performance based on the area under the receiver operating characteristic curve (AUC). This means that most models had similar discriminatory abilities, as most comparisons did not exceed the 5% threshold. This suggests that these risk prediction models perform comparably in distinguishing between individuals at high and low risk of cardiovascular events. The use of other statistical measures like discrimination, calibration, and reclassification was inconsistent (Siontis, Tzoulaki, Siontis, et al., 2012). Outcome selection bias was evident in 32 comparisons, where 78% of the time, the model originally developed using the selected outcome performed better. Additionally, authors tended to report better AUCs for models they developed, highlighting potential optimism bias (Siontis, Tzoulaki, Siontis, et al., 2012).
The conclusions suggest that while multiple cardiovascular risk prediction models exist, their comparisons would benefit from standardized reporting and consistent statistical evaluation. The reporting/evaluation of these risk models appear to be impacted by outcome selection and optimism biases, emphasizing the need for more rigorous and unbiased comparisons (Siontis, Tzoulaki, Siontis, et al., 2012).
Gender bias in assessing cardiovascular disease
Cardiovascular disease (CVD) rates are higher in males, which has led to it being seen as primarily a men’s issue. However, CVD is the leading cause of death and a major cause of disability for women globally. Women are often under-recognized and undertreated for CVD compared to men, and their symptoms can differ, leading to worse outcomes. Female patients treated by male cardiologists fare worse than male patients, while no such difference exists for female cardiologists. Clinical trials often focus on men, despite some drugs having different effects in women. Risk factors like diabetes and smoking increase CVD risk more in women, and factors related to pregnancy and reproductive health add to their vulnerability. Women’s health research is often focused on mother and child health and breast cancer, neglecting CVD and other non-communicable diseases. There is a need to broaden the definition of women’s health to include the entire lifecycle and emphasize CVD, with sex-specific research analyses becoming standard (Abdullah, Beckett, Wilson, et al., 2024).
Additionally, a study reviewed the evidence on gender bias in CVD diagnosis, prevention, and treatment. Following PRISMA guidelines, several databases from 19 studies were searched and analyzed. The findings showed that CVD is less reported in women, who often have milder symptoms or are misdiagnosed with gastrointestinal or anxiety issues. As a result, women’s risk factors are often overlooked, especially by male doctors. Women are given fewer diagnostic tests and are less likely to be referred to cardiologists or hospitalized. Even when hospitalized, women receive fewer coronary interventions and are prescribed fewer cardiovascular medications, except for antihypertensive and anti-anginal drugs. Women also tend to perceive themselves at lower risk for CVD than men. This review highlights that women receive fewer diagnostic tests and treatments for CVD, which affects their health outcomes, likely due to a lack of awareness about gender differences in CVD symptoms (Abdullah, Beckett, Wilson, et al., 2024).
Most of this research is based on a gender binary frame, which assumes the existence of only two distinct and opposite genders—male and female—often neglecting the experiences of non-binary and gender-diverse individuals. This binary approach reinforces systemic gaps in understanding the intersection of gender and health outcomes, as it fails to account for how non-binary individuals experience, report, and are treated for CVD. The focus on a binary framework not only limits the inclusivity of cardiovascular research but also perpetuates disparities by oversimplifying the complex interplay of biological sex and gender identity. Expanding research to include non-binary and gender-diverse populations is crucial to developing a more comprehensive understanding of cardiovascular health and ensuring equitable healthcare practices for all individuals.
Pre-existing machine learning methods for predicting cardiovascular disease
Using machine learning methods to predict cardiovascular disease has been an ongoing point of research in the last decade. The following study focused on improving risk prediction using machine learning on healthcare data from 222,998 Korean adults aged 40-79 without prior cardiovascular disease or lipid-lowering therapy. Traditional risk models showed moderate to good performance (C-statistics 0.70–0.80), with the pooled cohort equation (PCE) achieving a C-statistic of 0.738 (Cho, Kim, Kang, et al., 2021). Among various machine learning models tested, the neural network model performed best, with a C-statistic of 0.751, which was higher than PCE. It also showed better agreement between predicted and actual outcomes. Improvements were noted compared to other models like the Framingham risk score, systematic coronary risk evaluation, and QRISK3 (Cho, Kim, Kang, et al., 2021). The study concluded that machine learning algorithms could enhance cardiovascular risk prediction beyond existing models, making them valuable tools for risk assessment and clinical decision-making in healthy Korean adults (Cho, Kim, Kang, et al., 2021).
Additionally, in a study by Stephen F. Weng and colleagues, machine learning was assessed for improving cardiovascular risk prediction using data from 378,256 UK patients. Four algorithms (random forest, logistic regression, gradient boosting, neural networks) were compared to the American College of Cardiology guidelines for evaluating CVD risk. The best-performing algorithm, neural networks, had an AUC of 0.764, improving prediction accuracy by 3.6% over the established method. This approach identified more patients who could benefit from preventive treatment and reduced unnecessary interventions (Weng, Reps, Kai, et al., 2017).
Gender-based approach for diagnosing coronary heart disease
In 2019, Hogo published an article titled “A proposed gender‑based approach for diagnosis of coronary artery disease.” In this research article, two separate and individual models were trained and evaluated for each gender to determine whether the patient’s gender affects the structure and performance of a diagnosis model for coronary artery disease. The male diagnosis model achieved an accuracy of 95%, with a sensitivity of 96% and a specificity of 100%, while the female diagnosis model performed slightly better, with an accuracy of 96%, a sensitivity of 97%, and a specificity of 96%. The high-performance results overall highlight the success of the proposed gender-based approach for diagnosing coronary artery disease. The dataset used for this project is from the UCI Machine Learning Repository; specifically, the “Heart Disease Database” and the “Z-Alizadeh Sani Dataset,” which comprises records for 270 patients, each with 75 attributes (Hogo, 2020).
Supervised machine learning
My project uses multiple supervised machine learning algorithms on a clinical dataset to predict the presence of cardiovascular disease in patients. Supervised learning is a type of machine learning where the algorithm is trained on labeled data to make predictions or decisions (Mueller and Guido, 2016). It learns to map input data to the correct output. For this project, I conducted a classification task to predict whether a patient has heart disease. Classification is one of the two primary types of supervised learning problems in machine learning (Mueller and Guido, 2016). Specifically, this project focused on binary classification, which involves distinguishing between two distinct classes (Mueller and Guido, 2016). The results of different machine learning models, such as a Random Forest Classifier, and XGB Classifier, were compared with cross-validation, which was used to ensure the reliability of the performance estimates.
A random forest classifier is a type of ensemble method, that combines multiple decision trees to create a more powerful model. Decision trees are a type of diagram/logic that is widely used for classification tasks. Each decision tree is based off a hierarchy of “if/else” questions, leading to a decision tree. In a random forest model, each decision tree is slightly different from the others. The concept behind random forests is that while each individual tree can provide reasonably accurate predictions, it is prone to overfitting specific portions of the data. One decision tree might be overfit in one way, and another might be overfit in another. By averaging their results, one can retain the predictive power of decision trees while reducing overfitting. Random forests incorporate two levels of randomness: first, by selecting a random subset of data points to construct each tree, and second, by randomly selecting a subset of features to evaluate at each split (Mueller and Guido, 2016).
XGBoost, which stands for “Extreme Gradient Boosting”, is a highly efficient and scalable library for gradient boosting, specifically designed to optimize the training process of machine learning models. The XGBoost classifier is a gradient boost regression tree, which is another ensemble method that combines multiple decision trees to create a more powerful model. Unlike the random forest method, gradient boosting constructs trees sequentially, with each tree aiming to address the errors made by the preceding one (Mueller and Guido, 2016). The gradient boos trees are typically shallow, with a maximum depth of 1 to 5. These types of models would be considered “weak learners”, as they only capture a small portion of the data’s complexity. The main idea is to combine many of these models. Each tree can only provide good predictions on part of the data, and so more and more trees are added iteratively to improve performance (Mueller and Guido, 2016).
Gradient boosting models are generally more sensitive to parameter settings compared to random forest classifiers. This increased sensitivity means that the performance of gradient boosting can vary significantly depending on how parameters are tuned. However, when parameters are properly optimized, gradient boosting has the potential to achieve higher accuracy than random forests (Mueller and Guido, 2016).
Machine learning models were evaluated by calculating the F1-score, recall score, accuracy score, and precision score. The precision score measures how many predicted positive instances were correct. Accuracy measures the overall correctness of the model. The recall score focuses on how well a model evaluated negative prediction. For this dataset, a model is considered to have a negative prediction if a patient is not diagnosed at all with cardiovascular disease. The F1-score combines precision and recall into a single number, balancing the trade-offs. It is the average of precision and recall, with a greater emphasis on the smaller value. The score goes from 0 to 100%, with higher values indicating better performance. It is particularly useful for imbalanced datasets because it considers both false positives and false negatives (Mueller and Guido, 2016).
Recent Comments