Sex-Specific and Regional Analysis of Heart Disease Prediction Using Machine Learning
Algorithms: Insights from the UCI Irvine Public Heart Disease Datasets (Cleveland and Long
Beach)
Jonathan Asanjarani
City University of New York Graduate Center
DATA 79000: Capstone Project and Thesis
Advisor: Johanna Devaney
Significant Variables
- Age
o Type: Integer
o Description: Patient’s age in years. - Sex
o Type: Binary (0 for Female, 1 for Male)
o Description: Biological sex of the patient. - Cp (Chest Pain Type)
o Type: Categorical (0–4)
o Description: Chest pain severity levels, where higher values indicate more severe
pain. - Trestbps (Resting Blood Pressure)
o Type: Continuous (mmHg)
o Description: Resting blood pressure in millimeters of mercury. Transformed using
logarithmic scaling to reduce skewness. - Chol (Serum Cholesterol)
o Type: Continuous (mg/dL)
o Description: Serum cholesterol level in milligrams per deciliter. Transformed using
logarithmic scaling to reduce skewness. - Fbs (Fasting Blood Sugar)
o Type: Binary (0 for <120 mg/dL, 1 for ≥120 mg/dL)
o Description: Indicator of whether fasting blood sugar exceeds 120 mg/dL. - Restecg (Resting ECG Results)
o Type: Categorical (0–2)
o Description: Results of resting electrocardiographic tests (e.g., normal, ST-T wave
abnormality, left ventricular hypertrophy). - Thalach (Maximum Heart Rate Achieved)
o Type: Continuous (bpm)
o Description: Maximum heart rate achieved during exercise. Transformed using a
squared transformation to emphasize non-linear relationships. - Exang (Exercise-Induced Angina)
o Type: Binary (0 for No, 1 for Yes)
o Description: Presence of exercise-induced angina (chest pain). - Oldpeak
o Type: Continuous
o Description: ST depression induced by exercise relative to rest (ECG measure).
Transformed using logarithmic scaling to reduce skewness. - Slope (ST Segment Slope)
o Type: Categorical (1 for Upsloping, 2 for Flat, 3 for Downsloping)
o Description: The slope of the peak exercise ST segment. - Ca (Number of Major Vessels)
o Type: Integer (0–3)
o Description: Number of major vessels (0–3) colored by fluoroscopy. Transformed
using one-hot encoding. - Thal (Thallium Stress Test Results)
o Type: Categorical (3 for Normal, 6 for Fixed Defect, 7 for Reversible Defect)
o Description: Results of thallium stress tests. Transformed using one-hot encoding. - Oldpeak_Slope_Combined
o Type: Continuous
o Description: A derived feature combining Oldpeak (ST depression) and Slope (ECG
segment pattern during peak exercise). - Gender-Based Interaction Terms
o Type: Continuous
o Description: Interaction features created by multiplying the “Sex” feature with key
variables like Chol and Trestbps to account for demographic-specific variations.
Critical Functions - Log Transformer
o Purpose: Reduces skewness in variables like Chol, Trestbps, and Oldpeak.
o Inputs: Skewed numerical features.
o Outputs: Log-transformed features. - Squared Transformation
o Purpose: Captures non-linear relationships in features like Thalach.
o Inputs: Thalach feature.
o Outputs: Squared-transformed feature. - Combine Oldpeak and Slope
o Purpose: Creates a new feature to enhance model accuracy.
o Inputs: Oldpeak and Slope features.
o Outputs: Combined feature reflecting ST segment depression and slope interaction. - Gender-Based Interaction Creation
o Purpose: Generates gender-specific interaction terms to capture the influence of
demographic variations on key features.
o Inputs: Sex feature and numerical features such as Chol and Trestbps.
o Outputs: Interaction features highlighting gender-based relevance.
Classifiers Used - Random Forest Classifier
o Purpose: Constructs an ensemble of decision trees for binary classification.
o Features: Robust against overfitting, useful for datasets with imbalanced classes.
o Implementation: Optimized using GridSearchCV to select parameters like the
number of estimators, maximum depth, and feature importance. - XGBoost Classifier
o Purpose: Gradient boosting algorithm designed for efficiency and performance in
binary classification tasks.
o Features: Focuses on minimizing loss functions with parallelized tree construction. - Ensemble Method
o Purpose: Combines predictions from Random Forest, XGBoost, and Logistic
Regression to improve robustness.
o Features: Weighted averaging of classifiers to leverage strengths of individual
models. - Logistic Regression
o Purpose: Serves as a baseline model to compare linear relationships between
features and outcomes.
o Features: Interpretable and effective for datasets with linear separability.
Recent Comments