List of Tables
Table 1
Basic Descriptives of the cleveland training data

Table 2
Variable descriptives based on heart disease presence

Table 3
Correlation Statistic between individual variables and heart disease presence

Table 4
Model Results

Figure 1

Data Management Plan Overview
Our data management plan ensures the organized, secure, and ethical handling of all project data. We will acquire datasets from the UCI Machine Learning Repository and follow their terms of use. The data will be stored securely on a personal computer. We will document all data processing steps, including cleaning, transformation, and analysis, ensuring transparency and reproducibility. The data is already anonymized for individual privacy. Access to the data will be restricted to authorized project members only. Upon project completion, we will submit our data and final project documentation to the CUNY Graduate Center Library’s digital repository, adhering to their guidelines for online digital deposits. This submission will ensure long-term preservation and accessibility of our work. For detailed guidance on data management and submission, we will refer to the library’s resources available on their website.
Digital References
Sex-Specific and Regional Analysis of Heart Disease Prediction Using Machine Learning Algorithms: Insights from the UCI Irvine Public Heart Disease Datasets (Cleveland and Long Beach)
Jonathan Asanjarani
City University of New York Graduate Center
DATA 79000: Capstone Project and Thesis
Advisor: Johanna Devaney
November 25th, 2024
Software and Tools Used
- Google Colab
- Description: Cloud-based Python environment with GPU access for accelerated computation.
- URL: https://colab.research.google.com
- Accessed: November 2024
- Python
- Version: 3.8
- Description: High-level programming language used for data analysis, modeling, and visualization.
- URL: https://www.python.org
- Accessed: November 2024
- Scikit-learn
- Version: 1.2.0
- Description: Library for machine learning algorithms, preprocessing, and evaluation.
- URL: https://scikit-learn.org/stable/
- Accessed: November 2024
- XGBoost
- Version: 1.6.0
- Description: Gradient boosting library optimized for supervised learning tasks.
- URL: https://xgboost.ai
- Accessed: November 2024
- Pandas
- Version: 1.4.3
- Description: Data manipulation and analysis library for structured data.
- URL: https://pandas.pydata.org
- Accessed: November 2024
- NumPy
- Version: 1.23.0
- Description: Library for numerical computations and array processing.
- URL: https://numpy.org
- Accessed: November 2024
- Matplotlib
- Version: 3.6.0
- Description: Visualization library for static and interactive graphics.
- URL: https://matplotlib.org
- Accessed: November 2024
- Seaborn
- Version: 0.12.2
- Description: Statistical data visualization library built on Matplotlib.
- URL: https://seaborn.pydata.org
- Accessed: November 2024
- ASCVD Risk Calculator
- Version: GitHub Repository
- Description: Python implementation of the ASCVD Risk Calculator for cardiovascular risk prediction.
- URL: https://github.com/brandones/ascvd/tree/master
- Accessed: November 2024
Datasets
- Cleveland Heart Disease Dataset
- Source: UCI Machine Learning Repository
- Description: Dataset used for binary classification of heart disease presence.
- URL: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
- Accessed: November 2024
- VA Long Beach Heart Disease Dataset
- Source: UCI Machine Learning Repository
- Description: Dataset used for regional generalization of machine learning models.
- URL: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
- Accessed: November 2024
Guidelines and Methodological References
- Mueller, Andreas C., & Guido, Sarah
- Title: Introduction to Machine Learning with Python
- Publisher: O’Reilly Media
- Publication Date: 2016
- URL: https://github.com/dlsucomet/MLResources/blob/master/books/[ML]%20Introduction%20to%20Machine%20Learning%20with%20Python%20(2017).pdf
- Software Sustainability Institute
- Title: How to Cite and Describe Software
- URL: https://www.software.ac.uk/how-cite-and-describe-software
- Accessed: November 2024
Additional Resources for Citing Software and Data
- Digital Curation Centre
- Title: How to Cite Datasets and Link to Publications
- Authors: Ball, A., & Duke, M.
- Publisher: Digital Curation Centre
- Publication Date: 2011
- URL: http://www.dcc.ac.uk/resources/how-guides/cite-datasets
- Accessed: November 2024
- DataCite
- Title: Why Cite Data?
- URL: https://www.datacite.org/
- Accessed: November 2024
A Note on Technical Specifications
This project used Google Collab as the development environment. Google Collab is a cloud-based Python platform providing access to GPUs for accelerated computation. Python (version 3.8) was used in the Google Collab environment, with additional libraries and frameworks included, such as Scikit-learn, XGBoost, Pandas, NumPy, Matplotlib, and Seaborn, as detailed in the References section. The dataset sources used were gathered from the the UCI Machine Learning Repository from the “Heart Disease” database. Two datasets from this database were used; Cleveland and VA Long Beach datasets. Data cleaning and preprocessing were conducted within Google Colab Notebooks using Python-based libraries, with datasets and code files stored in CSV, Python (.py), and Jupyter Notebook (.ipynb) formats.
Version control was maintained through a GitHub repository that hosted the project’s source code, processed datasets, and supplementary materials. The repository, accessible at [https://github.com/Jdasanja/masters_thesis_final], was updated regularly with a detailed commit history to ensure reproducibility. External tools included the ASCVD Risk Calculator, implemented via an open-source Python package available at [https://github.com/brandones/ascvd/tree/master].
Data Dictionary
Sex-Specific and Regional Analysis of Heart Disease Prediction Using Machine Learning
Algorithms: Insights from the UCI Irvine Public Heart Disease Datasets (Cleveland and Long
Beach)
Jonathan Asanjarani
City University of New York Graduate Center
DATA 79000: Capstone Project and Thesis
Advisor: Johanna Devaney
Significant Variables
- Age
o Type: Integer
o Description: Patient’s age in years. - Sex
o Type: Binary (0 for Female, 1 for Male)
o Description: Biological sex of the patient. - Cp (Chest Pain Type)
o Type: Categorical (0–4)
o Description: Chest pain severity levels, where higher values indicate more severe
pain. - Trestbps (Resting Blood Pressure)
o Type: Continuous (mmHg)
o Description: Resting blood pressure in millimeters of mercury. Transformed using
logarithmic scaling to reduce skewness. - Chol (Serum Cholesterol)
o Type: Continuous (mg/dL)
o Description: Serum cholesterol level in milligrams per deciliter. Transformed using
logarithmic scaling to reduce skewness. - Fbs (Fasting Blood Sugar)
o Type: Binary (0 for <120 mg/dL, 1 for ≥120 mg/dL)
o Description: Indicator of whether fasting blood sugar exceeds 120 mg/dL. - Restecg (Resting ECG Results)
o Type: Categorical (0–2)
o Description: Results of resting electrocardiographic tests (e.g., normal, ST-T wave
abnormality, left ventricular hypertrophy). - Thalach (Maximum Heart Rate Achieved)
o Type: Continuous (bpm)
o Description: Maximum heart rate achieved during exercise. Transformed using a
squared transformation to emphasize non-linear relationships. - Exang (Exercise-Induced Angina)
o Type: Binary (0 for No, 1 for Yes)
o Description: Presence of exercise-induced angina (chest pain). - Oldpeak
o Type: Continuous
o Description: ST depression induced by exercise relative to rest (ECG measure).
Transformed using logarithmic scaling to reduce skewness. - Slope (ST Segment Slope)
o Type: Categorical (1 for Upsloping, 2 for Flat, 3 for Downsloping)
o Description: The slope of the peak exercise ST segment. - Ca (Number of Major Vessels)
o Type: Integer (0–3)
o Description: Number of major vessels (0–3) colored by fluoroscopy. Transformed
using one-hot encoding. - Thal (Thallium Stress Test Results)
o Type: Categorical (3 for Normal, 6 for Fixed Defect, 7 for Reversible Defect)
o Description: Results of thallium stress tests. Transformed using one-hot encoding. - Oldpeak_Slope_Combined
o Type: Continuous
o Description: A derived feature combining Oldpeak (ST depression) and Slope (ECG
segment pattern during peak exercise). - Gender-Based Interaction Terms
o Type: Continuous
o Description: Interaction features created by multiplying the “Sex” feature with key
variables like Chol and Trestbps to account for demographic-specific variations.
Critical Functions - Log Transformer
o Purpose: Reduces skewness in variables like Chol, Trestbps, and Oldpeak.
o Inputs: Skewed numerical features.
o Outputs: Log-transformed features. - Squared Transformation
o Purpose: Captures non-linear relationships in features like Thalach.
o Inputs: Thalach feature.
o Outputs: Squared-transformed feature. - Combine Oldpeak and Slope
o Purpose: Creates a new feature to enhance model accuracy.
o Inputs: Oldpeak and Slope features.
o Outputs: Combined feature reflecting ST segment depression and slope interaction. - Gender-Based Interaction Creation
o Purpose: Generates gender-specific interaction terms to capture the influence of
demographic variations on key features.
o Inputs: Sex feature and numerical features such as Chol and Trestbps.
o Outputs: Interaction features highlighting gender-based relevance.
Classifiers Used - Random Forest Classifier
o Purpose: Constructs an ensemble of decision trees for binary classification.
o Features: Robust against overfitting, useful for datasets with imbalanced classes.
o Implementation: Optimized using GridSearchCV to select parameters like the
number of estimators, maximum depth, and feature importance. - XGBoost Classifier
o Purpose: Gradient boosting algorithm designed for efficiency and performance in
binary classification tasks.
o Features: Focuses on minimizing loss functions with parallelized tree construction. - Ensemble Method
o Purpose: Combines predictions from Random Forest, XGBoost, and Logistic
Regression to improve robustness.
o Features: Weighted averaging of classifiers to leverage strengths of individual
models. - Logistic Regression
o Purpose: Serves as a baseline model to compare linear relationships between
features and outcomes.
o Features: Interpretable and effective for datasets with linear separability.
Digital Manifest
Sex-Specific and Regional Analysis of Heart Disease Prediction Using Machine Learning Algorithms: Insights from the UCI Irvine Public Heart Disease Datasets (Cleveland and Long Beach)
Jonathan Asanjarani
City University of New York Graduate Center
DATA 79000: Capstone Project and Thesis
Advisor: Johanna Devaney
Project Components
1. Capstone Report (Print and Digital)
- File Name: Project_Write_up12.30.24.docx.pdf
- File Type: PDF
- Description: Full written report detailing research objectives, methodology, results, and discussions.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/Project_Write_up12.30.24.docx.pdf
2. Exploratory Data Analysis (EDA) Notebook
- File Name: EDA_4_binary_classification.ipynb
- File Type: Google Collab Notebook (.ipynb)
- Description: Python notebook detailing data cleaning, univariate, bivariate, and multivariate analyses, including visualization and statistical tests.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/EDA_4_binary_classification.ipynb
3. Machine Learning Model Implementation for Cleveland
- File Name: ML_Algo_4_binary_classification.ipynb
- File Type: Google Collab Notebook (.ipynb)
- Description: Google Collab Notebook containing code for implementing and evaluating machine learning models (Random Forest, XGBoost, and ensemble methods) using the Cleveland dataset.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ML_Algo_4_binary_classification.ipynb
4. Machine Learning Model Implementation for VA Long Beach
- File Name: ML_Algo_4_bin_classification_va_longbeach.ipynb
- File Type: Google Collab Notebook (.ipynb)
- Description: Google Collab Notebook containing code for implementing and evaluating machine learning models (Random Forest, XGBoost, and ensemble methods) using the VA Long Beach dataset.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ML_Algo_4_bin_classification_va_longbeach.ipynb
5. Cleveland Processed Dataset
- File Name: processed.cleveland.data
- File Type: ZIP archive (contains .data files)
- Description: Includes cleaned and transformed versions of the Cleveland dataset used in the study.
- URL: https://github.com/Jdasanja/masters_thesis/blob/main/processed.cleveland.data
6. VA Long Beach Processed Datasets
- File Name: processed.va.data
- File Type: ZIP archive (contains .data files)
- Description: Includes cleaned and transformed versions of the VA Long Beach dataset used in the study.
- URL: https://github.com/Jdasanja/masters_thesis/blob/main/processed.va.data
7. Data Transformation Script Cleveland
- File Name: ML_Algo_4_binary_classification.ipynb
- File Type: Google Collab Notebook (.ipynb)
- Description: Custom Python scripts for data preprocessing and feature engineering, including transformations applied to Cleveland dataset.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ML_Algo_4_binary_classification.ipynb
8. Data Transformation Script VA Long beach
- File Name: ML_Algo_4_bin_classification_va_longbeach.ipynb
- File Type: Google Collab Notebook (.ipynb)
- Description: Custom Python scripts for data preprocessing and feature engineering, including transformations applied to VA Long Beach dataset.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ML_Algo_4_bin_classification_va_longbeach.ipynb
9. ASCVD Risk Score Implementation Cleveland
- File Name: ACSVD_calculation_of_Cleveland.ipynb
- File Type: Jupyter Notebook (.ipynb)
- Description: Python notebook implementing the ASCVD Risk Calculator for the Cleveland dataset.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ACSVD_calculation_of_Cleveland.ipynb
10. ASCVD Risk Score Implementation VA Long Beach
- File Name: ACSDV_Calculation_4_va_longbeach.ipynb
- File Type: Jupyter Notebook (.ipynb)
- Description: Python notebook implementing the ASCVD Risk Calculator for the Cleveland dataset.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/ACSDV_Calculation_4_va_longbeach.ipynb
11. A Note on Technical Specifications
- File Name: A Note on Technical Specifications.pdf
- File Type: PDF
- Description: PDF that provides an overview of the project’s development environment, data sources, processing methods, file formats, version control, and external tools used to ensure reproducibility and transparency.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/A%20Note%20on%20Technical%20Specifications.pdf
12. Data Dictionary
- File Name: Data Dictionary.pdf
- File Type: PDF
- Description: PDF that outlines key variables, transformations, critical functions, and classifiers used in the project, providing detailed descriptions to ensure clarity and reproducibility.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/Data%20Dictionary.pdf
13. Digital References
- File Name: Digital References.pdf
- File Type: PDF
- Description: PDF that provides detailed citations for all software, tools, datasets, and external resources used in the project, ensuring transparency and enabling reproducibility.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/Digital%20References.pdf
14. Data Management Plan
- File Name: Data Management Plan Overview.pdf
- File Type: PDF
- Description: Comprehensive plan outlining data handling, storage, and ethical considerations.
- URL: https://github.com/Jdasanja/masters_thesis_final/blob/main/Data%20Management%20Plan%20Overview.pdf
Recent Comments