List of Tables

Table 1

Basic Descriptives of the cleveland training data

Table 2

Variable descriptives based on heart disease presence

Table 3

Correlation Statistic between individual variables and heart disease presence

Table 4

Model Results

Figure 1

Data Management Plan Overview

Our data management plan ensures the organized, secure, and ethical handling of all project data. We will acquire datasets from the UCI Machine Learning Repository and follow their terms of use. The data will be stored securely on a personal computer. We will document all data processing steps, including cleaning, transformation, and analysis, ensuring transparency and reproducibility. The data is already anonymized for individual privacy. Access to the data will be restricted to authorized project members only. Upon project completion, we will submit our data and final project documentation to the CUNY Graduate Center Library’s digital repository, adhering to their guidelines for online digital deposits. This submission will ensure long-term preservation and accessibility of our work. For detailed guidance on data management and submission, we will refer to the library’s resources available on their website.

Digital References

Sex-Specific and Regional Analysis of Heart Disease Prediction Using Machine Learning Algorithms: Insights from the UCI Irvine Public Heart Disease Datasets (Cleveland and Long Beach)
Jonathan Asanjarani
City University of New York Graduate Center
DATA 79000: Capstone Project and Thesis
Advisor: Johanna Devaney
November 25th, 2024


Software and Tools Used

  1. Google Colab
    • Description: Cloud-based Python environment with GPU access for accelerated computation.
    • URL: https://colab.research.google.com
    • Accessed: November 2024
  2. Python
    • Version: 3.8
    • Description: High-level programming language used for data analysis, modeling, and visualization.
    • URL: https://www.python.org
    • Accessed: November 2024
  3. Scikit-learn
    • Version: 1.2.0
    • Description: Library for machine learning algorithms, preprocessing, and evaluation.
    • URL: https://scikit-learn.org/stable/
    • Accessed: November 2024
  4. XGBoost
    • Version: 1.6.0
    • Description: Gradient boosting library optimized for supervised learning tasks.
    • URL: https://xgboost.ai
    • Accessed: November 2024
  5. Pandas
    • Version: 1.4.3
    • Description: Data manipulation and analysis library for structured data.
    • URL: https://pandas.pydata.org
    • Accessed: November 2024
  6. NumPy
    • Version: 1.23.0
    • Description: Library for numerical computations and array processing.
    • URL: https://numpy.org
    • Accessed: November 2024
  7. Matplotlib
    • Version: 3.6.0
    • Description: Visualization library for static and interactive graphics.
    • URL: https://matplotlib.org
    • Accessed: November 2024
  8. Seaborn
    • Version: 0.12.2
    • Description: Statistical data visualization library built on Matplotlib.
    • URL: https://seaborn.pydata.org
    • Accessed: November 2024
  9. ASCVD Risk Calculator

Datasets

  1. Cleveland Heart Disease Dataset
  2. VA Long Beach Heart Disease Dataset

Guidelines and Methodological References

  1. Mueller, Andreas C., & Guido, Sarah
  2. Software Sustainability Institute

Additional Resources for Citing Software and Data

  1. Digital Curation Centre
  2. DataCite

A Note on Technical Specifications

This project used Google Collab as the development environment. Google Collab is a cloud-based Python platform providing access to GPUs for accelerated computation. Python (version 3.8) was used in the Google Collab environment, with additional libraries and frameworks included, such as Scikit-learn, XGBoost, Pandas, NumPy, Matplotlib, and Seaborn, as detailed in the References section. The dataset sources used were gathered from the the UCI Machine Learning Repository from the “Heart Disease” database. Two datasets from this database were used; Cleveland and VA Long Beach datasets. Data cleaning and preprocessing were conducted within Google Colab Notebooks using Python-based libraries, with datasets and code files stored in CSV, Python (.py), and Jupyter Notebook (.ipynb) formats.

Version control was maintained through a GitHub repository that hosted the project’s source code, processed datasets, and supplementary materials. The repository, accessible at [https://github.com/Jdasanja/masters_thesis_final], was updated regularly with a detailed commit history to ensure reproducibility. External tools included the ASCVD Risk Calculator, implemented via an open-source Python package available at [https://github.com/brandones/ascvd/tree/master].

Data Dictionary

Sex-Specific and Regional Analysis of Heart Disease Prediction Using Machine Learning
Algorithms: Insights from the UCI Irvine Public Heart Disease Datasets (Cleveland and Long
Beach)
Jonathan Asanjarani
City University of New York Graduate Center
DATA 79000: Capstone Project and Thesis
Advisor: Johanna Devaney
Significant Variables

  1. Age
    o Type: Integer
    o Description: Patient’s age in years.
  2. Sex
    o Type: Binary (0 for Female, 1 for Male)
    o Description: Biological sex of the patient.
  3. Cp (Chest Pain Type)
    o Type: Categorical (0–4)
    o Description: Chest pain severity levels, where higher values indicate more severe
    pain.
  4. Trestbps (Resting Blood Pressure)
    o Type: Continuous (mmHg)
    o Description: Resting blood pressure in millimeters of mercury. Transformed using
    logarithmic scaling to reduce skewness.
  5. Chol (Serum Cholesterol)
    o Type: Continuous (mg/dL)
    o Description: Serum cholesterol level in milligrams per deciliter. Transformed using
    logarithmic scaling to reduce skewness.
  6. Fbs (Fasting Blood Sugar)
    o Type: Binary (0 for <120 mg/dL, 1 for ≥120 mg/dL)
    o Description: Indicator of whether fasting blood sugar exceeds 120 mg/dL.
  7. Restecg (Resting ECG Results)
    o Type: Categorical (0–2)
    o Description: Results of resting electrocardiographic tests (e.g., normal, ST-T wave
    abnormality, left ventricular hypertrophy).
  8. Thalach (Maximum Heart Rate Achieved)
    o Type: Continuous (bpm)
    o Description: Maximum heart rate achieved during exercise. Transformed using a
    squared transformation to emphasize non-linear relationships.
  9. Exang (Exercise-Induced Angina)
    o Type: Binary (0 for No, 1 for Yes)
    o Description: Presence of exercise-induced angina (chest pain).
  10. Oldpeak
    o Type: Continuous
    o Description: ST depression induced by exercise relative to rest (ECG measure).
    Transformed using logarithmic scaling to reduce skewness.
  11. Slope (ST Segment Slope)
    o Type: Categorical (1 for Upsloping, 2 for Flat, 3 for Downsloping)
    o Description: The slope of the peak exercise ST segment.
  12. Ca (Number of Major Vessels)
    o Type: Integer (0–3)
    o Description: Number of major vessels (0–3) colored by fluoroscopy. Transformed
    using one-hot encoding.
  13. Thal (Thallium Stress Test Results)
    o Type: Categorical (3 for Normal, 6 for Fixed Defect, 7 for Reversible Defect)
    o Description: Results of thallium stress tests. Transformed using one-hot encoding.
  14. Oldpeak_Slope_Combined
    o Type: Continuous
    o Description: A derived feature combining Oldpeak (ST depression) and Slope (ECG
    segment pattern during peak exercise).
  15. Gender-Based Interaction Terms
    o Type: Continuous
    o Description: Interaction features created by multiplying the “Sex” feature with key
    variables like Chol and Trestbps to account for demographic-specific variations.
    Critical Functions
  16. Log Transformer
    o Purpose: Reduces skewness in variables like Chol, Trestbps, and Oldpeak.
    o Inputs: Skewed numerical features.
    o Outputs: Log-transformed features.
  17. Squared Transformation
    o Purpose: Captures non-linear relationships in features like Thalach.
    o Inputs: Thalach feature.
    o Outputs: Squared-transformed feature.
  18. Combine Oldpeak and Slope
    o Purpose: Creates a new feature to enhance model accuracy.
    o Inputs: Oldpeak and Slope features.
    o Outputs: Combined feature reflecting ST segment depression and slope interaction.
  19. Gender-Based Interaction Creation
    o Purpose: Generates gender-specific interaction terms to capture the influence of
    demographic variations on key features.
    o Inputs: Sex feature and numerical features such as Chol and Trestbps.
    o Outputs: Interaction features highlighting gender-based relevance.
    Classifiers Used
  20. Random Forest Classifier
    o Purpose: Constructs an ensemble of decision trees for binary classification.
    o Features: Robust against overfitting, useful for datasets with imbalanced classes.
    o Implementation: Optimized using GridSearchCV to select parameters like the
    number of estimators, maximum depth, and feature importance.
  21. XGBoost Classifier
    o Purpose: Gradient boosting algorithm designed for efficiency and performance in
    binary classification tasks.
    o Features: Focuses on minimizing loss functions with parallelized tree construction.
  22. Ensemble Method
    o Purpose: Combines predictions from Random Forest, XGBoost, and Logistic
    Regression to improve robustness.
    o Features: Weighted averaging of classifiers to leverage strengths of individual
    models.
  23. Logistic Regression
    o Purpose: Serves as a baseline model to compare linear relationships between
    features and outcomes.
    o Features: Interpretable and effective for datasets with linear separability.

Digital Manifest

Sex-Specific and Regional Analysis of Heart Disease Prediction Using Machine Learning Algorithms: Insights from the UCI Irvine Public Heart Disease Datasets (Cleveland and Long Beach)
Jonathan Asanjarani
City University of New York Graduate Center
DATA 79000: Capstone Project and Thesis
Advisor: Johanna Devaney


Project Components

1. Capstone Report (Print and Digital)

  • File Name: Project_Write_up12.30.24.docx.pdf
  • File Type: PDF
  • Description: Full written report detailing research objectives, methodology, results, and discussions.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/Project_Write_up12.30.24.docx.pdf

2. Exploratory Data Analysis (EDA) Notebook

  • File Name: EDA_4_binary_classification.ipynb
  • File Type: Google Collab Notebook (.ipynb)
  • Description: Python notebook detailing data cleaning, univariate, bivariate, and multivariate analyses, including visualization and statistical tests.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/EDA_4_binary_classification.ipynb

3. Machine Learning Model Implementation for Cleveland

  • File Name: ML_Algo_4_binary_classification.ipynb
  • File Type: Google Collab Notebook (.ipynb)
  • Description: Google Collab Notebook containing code for implementing and evaluating machine learning models (Random Forest, XGBoost, and ensemble methods) using the Cleveland dataset.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ML_Algo_4_binary_classification.ipynb

4. Machine Learning Model Implementation for VA Long Beach

  • File Name: ML_Algo_4_bin_classification_va_longbeach.ipynb
  • File Type: Google Collab Notebook (.ipynb)
  • Description: Google Collab Notebook containing code for implementing and evaluating machine learning models (Random Forest, XGBoost, and ensemble methods) using the VA Long Beach dataset.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ML_Algo_4_bin_classification_va_longbeach.ipynb

 5. Cleveland Processed Dataset

  • File Name: processed.cleveland.data
  • File Type: ZIP archive (contains .data files)
  • Description: Includes cleaned and transformed versions of the Cleveland dataset used in the study.
  • URL: https://github.com/Jdasanja/masters_thesis/blob/main/processed.cleveland.data

6. VA Long Beach Processed Datasets

  • File Name: processed.va.data
  • File Type: ZIP archive (contains .data files)
  • Description: Includes cleaned and transformed versions of the VA Long Beach dataset used in the study.
  • URL: https://github.com/Jdasanja/masters_thesis/blob/main/processed.va.data

7. Data Transformation Script Cleveland

  • File Name: ML_Algo_4_binary_classification.ipynb
  • File Type: Google Collab Notebook (.ipynb)
  • Description: Custom Python scripts for data preprocessing and feature engineering, including transformations applied to Cleveland dataset.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ML_Algo_4_binary_classification.ipynb

8. Data Transformation Script VA Long beach

  • File Name: ML_Algo_4_bin_classification_va_longbeach.ipynb
  • File Type: Google Collab Notebook (.ipynb)
  • Description: Custom Python scripts for data preprocessing and feature engineering, including transformations applied to VA Long Beach dataset.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ML_Algo_4_bin_classification_va_longbeach.ipynb

9. ASCVD Risk Score Implementation Cleveland

  • File Name: ACSVD_calculation_of_Cleveland.ipynb
  • File Type: Jupyter Notebook (.ipynb)
  • Description: Python notebook implementing the ASCVD Risk Calculator for the Cleveland dataset.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ACSVD_calculation_of_Cleveland.ipynb

10. ASCVD Risk Score Implementation VA Long Beach

  • File Name: ACSDV_Calculation_4_va_longbeach.ipynb
  • File Type: Jupyter Notebook (.ipynb)
  • Description: Python notebook implementing the ASCVD Risk Calculator for the Cleveland dataset.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ACSDV_Calculation_4_va_longbeach.ipynb

11. A Note on Technical Specifications

  • File Name: A Note on Technical Specifications.pdf
  • File Type: PDF
  • Description: PDF that provides an overview of the project’s development environment, data sources, processing methods, file formats, version control, and external tools used to ensure reproducibility and transparency.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/A%20Note%20on%20Technical%20Specifications.pdf

12. Data Dictionary

  • File Name: Data Dictionary.pdf
  • File Type: PDF
  • Description: PDF that outlines key variables, transformations, critical functions, and classifiers used in the project, providing detailed descriptions to ensure clarity and reproducibility.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/Data%20Dictionary.pdf

13. Digital References

  • File Name: Digital References.pdf
  • File Type: PDF
  • Description: PDF that provides detailed citations for all software, tools, datasets, and external resources used in the project, ensuring transparency and enabling reproducibility.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/Digital%20References.pdf

14. Data Management Plan

  • File Name: Data Management Plan Overview.pdf
  • File Type: PDF
  • Description: Comprehensive plan outlining data handling, storage, and ethical considerations.
  • URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/Data%20Management%20Plan%20Overview.pdf

Discussion & Findings

Discussion

Key findings:

             My project leveraged the Cleveland and VA Long Beach datasets, in the “Heart Disease” database, which was donated to the UCI Machine Learning Repository to explore the binary classification of heart disease presence, using the available demographic and clinical features. Through exploratory data analysis (EDA), data cleaning, transformation experiments, and model evaluation, several critical insights emerged. Transformation 2, which included logarithmic transformations for skewed features, a squared transformation for maximum heart rate, and the creation of a combined feature (old peak and slope), was identified as the most effective preprocessing strategy. This approach enhanced feature stability and predictive accuracy on the Cleveland dataset and was subsequently applied to the VA Long Beach dataset to assess regional generalizability.

The Random Forest Classifier for Transformation 2 consistently outperformed other models in terms of prediction accuracy and robustness across multiple train-test splits. On the Cleveland dataset, the Random Forest Classifier achieved an accuracy of 83.33%, with balanced precision (85.56%) and recall (83.33%), supported by consistent cross-validation performance. Applied to the VA Long Beach dataset, the unoptimized Random Forest Classifier demonstrated a performance with an accuracy of 82.5%, precision of 80%, and recall of 40% for the minority class. Unlike its optimized counterpart, this configuration effectively mitigated the neglect of minority class predictions, striking a better balance between the classes.

While the Random Forest Classifier outperformed the ASCVD risk score on average across both datasets, achieving higher accuracy and F1 scores, it did not resolve variability in female subgroup predictions. On the Cleveland dataset, the ASCVD score achieved an AUC-ROC of 78.08% for females compared to 68.87% for males, and the Random Forest model exhibited similar imbalances. Female recall for the ASCVD was higher (88.89%) but precision was considerably lower (48.88%), resulting in a less balanced F1 score (62.87%) compared to males (72.73%). On the VA Long Beach dataset, the ASCVD score achieved strong recall (93.75%) but limited discriminatory power, with an AUC-ROC of 60.98%. The lack of sufficient female representation in the VA Long Beach dataset precluded any sex-specific analysis, underscoring the importance of diverse and balanced datasets.

Gender-based analysis on the Cleveland dataset revealed clear disparities in machine learning model performance with the Random Forest Classifier. While the model achieved high overall accuracy and precision; the model demonstrated considerably lower recall for female patients with heart disease (0.67). This variability indicates that the model, while improving overall performance compared to the ASCVD score, did not effectively address gender-specific prediction inconsistencies. Features such as Oldpeak and slope emerged as strong indicators of heart disease presence, whereas weaker relationships were observed for cholesterol and fasting blood sugar, suggesting limited predictive value for these variables within these datasets.

In conclusion, the Random Forest Classifier demonstrated superior average performance compared to the ASCVD risk score but fell short of addressing variability in female subgroup predictions. These findings highlight the importance of future work to explore gender-specific adjustments and strategies for achieving equitable performance across demographic groups, while also emphasizing the need for diverse datasets to enhance generalizability.

Regional generalizability

VA Long Beach models with optimized parameters highlight a few possible implications. Parameter optimization of the Random Forest Classifier, while improving the overall accuracy, had significant drawbacks when applied to a dataset with an imbalanced class distribution. The optimized model prioritized the majority class, resulting in a complete inability to predict any instances of the minority class (those with heart disease). This led to a recall of 0% for the minority class, effectively excluding it from the model’s predictions. Although overall accuracy increased, this came at the cost of fairness and utility, as the model failed to capture critical instances of the minority class. These findings highlight that, in the case of the Random Forest Classifier, parameter optimization compromised the model’s balance, favoring the majority class performance while neglecting the minority class.

            The comparison between the Cleveland and VA Long Beach models without optimized parameters further highlights two possible implications: optimized parameters can overgeneralize for the majority class, or they may reduce the model’s ability to classify effectively across regions. These potential drawbacks are particularly obscured by the small number of patients in the minority class (those without heart disease) in the VA Long Beach dataset, which makes it challenging to draw definitive conclusions.

Without optimization, the Random Forest Classifier still achieved balanced performance for the minority class on the Cleveland dataset, while still maintaining reasonable performance for the VA Long Beach dataset. However, with optimized parameters, the VA Long Beach model completely failed to predict any instances of the minority class, suggesting that the optimization may have tailored the model too closely to the majority class, resulting in overgeneralization.

These outcomes suggest that the small sample size of the minority class amplifies the difficulty in determining whether the reduced performance is due to overgeneralization for the majority class or a lack of adaptability across regions. This underscores the importance of balancing datasets and carefully evaluating the impact of optimization on both regional performance and minority class predictions.

Sex-specific performance summary

            The model assessing the male population achieved an accuracy of 78.05% and an F1-score of 78.31%. For class 0 (no heart disease), it recorded a precision of 68%, recall of 81%, and an F1-score of 74%. For class 1 (heart disease), the model demonstrated a higher precision of 86% but a lower recall of 76%, resulting in an F1-score of 81%. The overall weighted averages for precision, recall, and F1-score were 80%, 78%, and 78%, respectively, reflecting moderately balanced performance on the male population. In contrast, the model applied to the female population exhibited higher overall accuracy (94.74%) and F1-score (94%). The precision for class 0 was higher (94% compared to 67% in the male model), the recall for class 0 was significantly higher as well at 100%, surpassing the male model’s recall of 81%. For class 1, the female model achieved perfect precision (100%) but a lower recall of 67%, compared to the male model’s recall of 76%.

These results highlight distinct performance differences between the male and female subpopulations. The model demonstrated better overall accuracy and precision for the female population, but its ability to detect heart disease cases (class 1) was slightly lower in recall compared to the male population. Conversely, the male model exhibited a more balanced trade-off between precision and recall for class 1 but at the cost of a higher false positive rate. These findings underscore the need for additional tuning to ensure the model performs consistently across gender-specific groups, avoiding potential biases in prediction outcomes.

Data limitations

            Due to the limitations of the Cleveland data, with the male test set comprising 41 samples and the female test set comprising only 19 samples, I am unable to perform cross-validation for these subsets. This restriction limits the ability to thoroughly assess model generalizability and robustness across gender-specific groups. As a result, the interpretation of the results must be approached with caution, as the insights drawn may not fully capture the broader performance trends for male and female populations.

The VA Long Beach dataset had significant limitations that must be addressed. One key issue is the severe gender imbalance, with only approximately 6 females included in the dataset. This small number makes it impossible to test the model by gender, as there are insufficient entries to draw any meaningful conclusions for female patients.

Additionally, the dataset suffers from a pronounced class imbalance, with most instances representing individuals with heart disease. This imbalance introduces challenges for the model, as there is limited data available to effectively train the minority class (individuals without heart disease). As a result, there is a strong expectation of underfitting for the minority class, where the model may fail to accurately predict or generalize for these cases.

Another critical issue is the missing data. Columns such as “ca” (99% missing values) and “thal” (83% missing values) have so few valid entries that they needed to be removed from the analysis.

To ensure a fair comparison when evaluating models, I had to create modified Cleveland models for comparison. This was done by removing the “ca” and “thal” columns, aligning the feature set with the limitations of this dataset. This approach allowed for a slightly more balanced evaluation when comparing the models trained on the Cleveland dataset to the Long Beach Models.

Lastly, The ASCVD (Atherosclerotic Cardiovascular Disease) Risk Calculator was applied to both the Cleveland and VA Long Beach datasets to estimate the 10-year cardiovascular risk for individual patients. However, due to missing or unavailable data, proxies were used to ensure compatibility with the ASCVD model. HDL cholesterol was assigned a placeholder value of 50, and ethnicity was uniformly set to non-Black (isBlack = False) because explicit data on this characteristic was not available. Hypertension status was derived from systolic blood pressure (SBP) values, with readings of 130 or higher classified as hypertensive. Diabetes status was inferred from fasting blood sugar (fbs), with values over 120 converted into a boolean indicator (diabetic = True). Smoking status was approximated using exercise-induced angina (exang), with the absence of angina interpreted as non-smoking status (False).

While these proxies enabled the datasets to be used for ASCVD risk estimation, they introduced approximations that deviate from the precise inputs required by the ASCVD model. Consequently, the calculated risk scores are not pure ASCVD scores, but rather adapted estimates. This reliance on proxies adds a layer of uncertainty to the analysis and necessitates cautious interpretation of the results.

Future directions

My research highlighted several areas for improvement and exploration in future research. After evaluating the datasets and outcomes, it became evident that a more effective approach might involve transitioning from binary classification to multilabel classification. Specifically, this approach could predict varying levels of heart disease severity rather than focusing solely on the binary presence or absence of the condition. This shift in focus is motivated by the observation that, aside from the Cleveland dataset, the other cities’ datasets predominantly consist of patients with some degree of heart disease. The relative scarcity of individuals without heart disease in these datasets diminishes the utility of binary classification and underscores the potential for a more nuanced multilabel approach.

Additionally, my research revealed a significant gender imbalance in the datasets, with fewer females represented compared to males. This raises critical questions about whether this disparity reflects sampling bias or is indicative of real-world clinical trends. Considering that cardiovascular disease is a leading cause of death among women, it is essential to investigate why females may be underrepresented in these datasets. Future research should aim to address this imbalance, ensuring equitable representation to enhance the generalizability and fairness of predictive models.

Incorporating these changes, future studies could develop more accurate and reproducible models that account for demographic disparities and focus on the varying levels of heart disease severity. This approach would not only provide richer clinical insights but also foster more inclusive and accurate models capable of addressing the diverse needs of populations affected by cardiovascular disease.

ASCVD (Atherosclerotic Cardiovascular Disease) Risk Score (Cleveland And VA Long Beach)

Atherosclerotic Cardiovascular Disease Risk Calculation on Cleveland Dataset

The 2013 ASCVD (Atherosclerotic Cardiovascular Disease) risk score was evaluated on the Cleveland dataset, yielding key performance metrics. The score achieved an accuracy of 69.64%, indicating that approximately 70% of predictions matched actual outcomes. Precision was 63.58%, reflecting the proportion of correctly identified positive cases among all predicted positives, while recall was 79.14%, demonstrating the model’s ability to capture most actual positive cases. The F1 score, balancing precision and recall, was 70.51%, signifying a moderate trade-off between the two. Additionally, the AUC-ROC score of 70.36% suggests a fair level of discriminatory ability between positive and negative cases. These results indicate that the 2013 ASCVD risk score provides reasonable predictive performance for the Cleveland dataset, with notable strengths in recall but areas for improvement in precision and overall accuracy.

Female

Male

Performance Comparison

The performance comparison between male and female subgroups reveals notable differences, particularly in the variability of scores for the female group. For females, the model achieved an accuracy of 73.19%, slightly higher than the accuracy of 69.67% observed for males. However, the precision for females was lower at 48.88% compared to 68.75% for males, indicating that the model was less reliable in identifying true positives among predicted positives for the female subgroup. Conversely, the recall for females was significantly higher at 88.89%, compared to 77.19% for males, suggesting that the model was more effective in identifying actual positive cases among females

The F1 score, which balances precision and recall, highlights the disparity between the subgroups. Females achieved an F1 score of 62.87%, reflecting the impact of lower precision despite high recall, whereas males had a more balanced F1 score of 72.73%. The AUC-ROC scores further emphasize this difference, with females achieving 78.08% compared to 68.87% for males, indicating better overall discrimination for the female subgroup.

The wide variability in the scores for the female group, particularly the sharp contrast between high recall and low precision, underscores potential challenges in the model’s consistency when applied to different demographic subgroups. This variability suggests that further optimization or subgroup-specific adjustments may be necessary to ensure more balanced and equitable performance across genders.

Atherosclerotic Cardiovascular Disease Risk Calculation on VA Long Beach Dataset

The 2013 ASCVD (Atherosclerotic Cardiovascular Disease) risk score was evaluated on the VA Long Beach dataset, achieving an accuracy of 76.82%, indicating that over three-quarters of the predictions aligned with the actual outcomes. The precision of 78.95% highlights the model’s reliability in identifying true positives among predicted positives, while the recall of 93.75% demonstrates its ability to effectively detect the majority of actual positive cases. The F1 score, balancing precision and recall, was 85.71%, reflecting robust overall performance. However, the AUC-ROC score of 60.98% suggests limited discriminatory power between positive and negative classes, indicating room for improvement in distinguishing cases.

It is important to note that due to the small number of females in the VA Long Beach dataset, a sex-specific analysis was not feasible. This limitation restricts the ability to assess the model’s performance across different demographic subgroups and emphasizes the need for more diverse and balanced datasets in future analyses.

Male Vs. Female

Is the Best Performing Models More Effective for Male vs. Female Population?

The highest-performing models were identified in Experiment 2, showcasing robust predictive capabilities. The Random Forest classifier emerged as the top performer, achieving a mean accuracy of 88.33%, a mean precision of 91.79%, a mean recall of 82.00%, and a mean F1-score of 83.67%. This model was tested separately on male and female populations to analyze gender-specific performance variations.

Random Forest Classifier test with optimized parameters for female population: transformation 2.

The model achieved an overall accuracy of 94.74% on the female test set, with a macro-average F1-score of 83% and a weighted average F1-score of 94%. It demonstrated strong performance in identifying individuals without heart disease (class 0), achieving a precision of 94%, a recall of 100%, and an F1-score of 97%. However, the performance for individuals with heart disease (class 1) was considerably weaker, with a recall of only 67%, indicating that the model struggled to identify all positive cases accurately. While the precision for class 1 was 100%, the low recall highlights a significant imbalance in the model’s ability to predict outcomes for this group.

Due to the limited sample size of the female test set (19 samples) and the inability to perform cross-validation, it is not possible to directly assess overfitting. However, the high accuracy and F1-scores, combined with the disproportionately low recall for class 1, suggest that the model may be overfitting to the test data, particularly favoring class 0. These results underscore the need for further data and evaluation to ensure the model’s generalizability and balanced performance across all classes.

Random Forest Classifier test with optimized parameters for male population: transformation 2.

The model assessing the male population achieved an accuracy of 78.05% and an F1-score of 78.31%. For class 0 (no heart disease), it recorded a precision of 67%, recall of 88%, and an F1-score of 76%. For class 1 (heart disease), the model demonstrated a higher precision of 90% but a lower recall of 72%, resulting in an F1-score of 80%. The overall weighted averages for precision, recall, and F1-score were 81%, 78%, and 78%, respectively, reflecting moderately balanced performance on the male population.

In contrast, the model applied to the female population exhibited higher overall accuracy (94.74%) and F1-score (94%). While the precision for class 0 was slightly lower (94% compared to 67% in the male model), the recall for class 0 was significantly higher at 100%, surpassing the male model’s recall of 88%. For class 1, the female model achieved perfect precision (100%) but a lower recall of 67%, compared to the male model’s recall of 72%.

Comparison of Performance.

            The model assessing the male population achieved an accuracy of 78.05% and an F1-score of 78.31%. For class 0 (no heart disease), it recorded a precision of 67%, recall of 88%, and an F1-score of 76%. For class 1 (heart disease), the model demonstrated a higher precision of 90% but a lower recall of 72%, resulting in an F1-score of 80%. The overall weighted averages for precision, recall, and F1-score were 81%, 78%, and 78%, respectively, reflecting moderately balanced performance on the male population.

In contrast, the model applied to the female population exhibited higher overall accuracy (94.74%) and F1-score (94%). While the precision for class 0 was lower (94% compared to 67% in the male model), the recall for class 0 was significantly higher at 100%, surpassing the male model’s recall of 88%. For class 1, the female model achieved perfect precision (100%) but a lower recall of 67%, compared to the male model’s recall of 72%.

These results highlight distinct performance differences between the male and female subpopulations. The model demonstrated better overall accuracy and precision for the female population, but its ability to detect heart disease cases (class 1) was slightly lower in recall compared to the male population. Conversely, the male model exhibited a more balanced trade-off between precision and recall for class 1 but at the cost of a higher false positive rate. These findings underscore the need for additional tuning to ensure the model performs consistently across gender-specific groups, avoiding potential biases in prediction outcomes.

Transformation 3: Cleveland Only

Optimizing Feature Engineering

In this third experiment, the focus is on enhancing the feature engineering component to improve model performance through targeted transformations. The following transformations were applied: (1) a logarithmic transformation for Resting Blood Pressure (trestbps) and Cholesterol (chol) to reduce skewness and stabilize variance; (2) a squared transformation of Maximum Heart Rate (thalach), which emphasizes the importance of higher values in this feature; and (3) the creation of a combined feature using Oldpeak and Slope, aimed at capturing their interaction and improving predictive power. Additionally, gender-based feature engineering was introduced to account for potential differences across genders.

Gender Based Feature Engineering:

            This process introduced new features to capture gender-specific patterns within the dataset. Six gender-specific features were implemented to identify and capture gender-related patterns within the dataset, enhancing the model’s ability to discern variations between male and female groups. The first feature, `thalach_norm_gender`, normalizes the maximum heart rate achieved (`thalach`) within each gender group by dividing individual values by the mean `thalach` for that group. Similarly, the second feature, `chol_norm_gender`, normalizes serum cholesterol (`chol`) values by dividing them by the mean cholesterol value for the respective gender.

Additionally, gender-specific indicators were introduced to flag deviations from the typical range within each gender group. The feature `thalach_above_median_gender` is a binary indicator (0 or 1) that signals whether a person’s `thalach` exceeds the median `thalach` value for their gender. In parallel, `chol_above_median_gender` functions as a binary indicator, identifying individuals whose `chol` values surpass the median cholesterol level within their gender group. Together, these features provide a nuanced representation of gender-specific patterns, potentially improving the predictive performance of models trained on the dataset.

Random Forest Classifier.

The Random Forest classifier, implemented without parameter optimization, demonstrated an accuracy of 83.33%, indicating that the majority of predictions were correct. The precision of 83.33% suggests the model was effective in minimizing false positives, while the recall, also at 83.33%, indicates a balanced ability to identify true positives. The F1-score of 83.28% reflects the overall mean of precision and recall, showcasing strong model performance. The confusion matrix shows the model correctly classified 28 true negatives and 22 true positives, with 4 false positives and 6 false negatives.

Cross-validation results further evaluated the model’s consistency, with a mean accuracy of 78.33%, precision of 75.19%, recall of 72.29%, and F1-score of 72.15%.

Grid Search Overview.

The grid search process evaluated 80 different hyperparameter combinations across three folds, resulting in 240 total model fits. The best parameters identified were a `max_depth` of 3, `n_estimators` of 50, and a `random_state` of 2024. With these hyperparameters, the model achieved a best cross-validated F1-weighted score of 0.8488, indicating improved balance between precision and recall compared to the default settings. This highlights the positive impact of hyperparameter tuning on the Random Forest model’s performance.

Application of Optimized Parameters for Random Forest Classifier.

The application of optimized parameters to the Random Forest classifier resulted in an accuracy of 83.33%, consistent with the unoptimized model. The precision improved slightly to 84.38%, while recall remained at 83.33%. The F1-score of 83.18% reflects balanced performance but shows a minor decrease compared to previous results. The confusion matrix indicates 29 true negatives, 21 true positives, 3 false positives, and 7 false negatives.

Cross-validation results reveal a mean accuracy of 78.33%, mean precision of 74.17%, mean recall of 76.14%, and mean F1-score of 72.42%. While these results are reasonable, there is a noticeable overall drop in performance metrics compared to prior experiments. This suggests that the optimized parameters, though improving precision, might have introduced trade-offs that slightly reduced the model’s ability to generalize consistently across folds. Further fine-tuning or exploration of different parameter combinations may be needed to enhance performance.

XGBoost Classifier.

The XGBoost classifier, run without optimized parameters, achieved an accuracy of 86.67%, outperforming the second experiment (Random Forest with optimized parameters). The precision for class 0 was 85% and for class 1 was 88%, with an average precision of 87%. Recall values were 91% for class 0 and 82% for class 1, resulting in a macro-average recall of 86%. The F1-score was 88% for class 0 and 85% for class 1, yielding an overall F1-score of 87%.

Cross-validation results show a mean accuracy of 78.33%, a mean precision of 78.33%, a mean recall of 70.43%, and a mean F1-score of 73%. While the cross-validated metrics are similar to those of the second experiment, the XGBoost classifier demonstrates slightly better performance when applied to the test set. These results highlight XGBoost’s potential effectiveness, even without parameter optimization, though further tuning could enhance its performance further.

Grid Search Overview.

The grid search process evaluated 108 hyperparameter combinations across three folds, resulting in a total of 324 model fits. The optimal parameters identified were a `colsample_bytree` of 1.0, a `learning_rate` of 0.2, a `max_depth` of 4, `n_estimators` of 150, and a `subsample` of 0.8. Using these parameters, the model achieved a best cross-validation accuracy of 82.28%. These results highlight the effectiveness of hyperparameter tuning in enhancing the model’s performance and demonstrate a solid balance between model complexity and predictive accuracy.

Application of Optimized Parameters for XGBoost Classifier

The XGBoost classifier achieved an accuracy of 86.67% on the test data, demonstrating strong performance in predicting outcomes. The precision was 85% for class 0 and 88% for class 1, while the recall was 91% for class 0 and 82% for class 1. The overall F1-score was 87%, reflecting a balanced performance between precision and recall.

Cross-validation results, however, show a slight decline in performance. The mean accuracy across folds was 76.67%, with a mean precision of 75.16%, a mean recall of 70.43%, and a mean F1-score of 71.55%. This divergence suggests that the model is achieving better performance on the test data at the expense of slightly reduced generalization during cross-validation. This pattern implies a degree of overfitting, as the model may be tailored too closely to the training and test sets, potentially limiting its ability to generalize effectively to new, unseen data.

Ensemble Method.

The ensemble method produced strong results on the test set, with an accuracy of 85.00%, an F1-score of 84.91%, a precision of 85.26%, and a recall of 85.00%. The classification report indicates consistent and balanced performance across both classes, with macro and weighted averages of 85%, showcasing the model’s reliability on the test data.

However, the cross-validation results showed a significant drop in all metrics compared to the test set and were notably lower than those observed in previous experiments. The mean accuracy was 76.67%, with a mean precision of 78.93%, a mean recall of 76.67%, and a mean F1-score of 76.09%.

While the test set performance was better than previous experiments, this substantial drop in cross-validation scores suggests overfitting, as the model performs better on the test set at the expense of generalizability across unseen data. This highlights a potential imbalance that needs addressing to improve the model’s robustness.