Discussion & Findings
Discussion Key findings: My project leveraged the Cleveland and VA Long Beach datasets, in the “Heart Disease” database, which was donated to the UCI Machine Learning Repository to explore the binary classification of heart disease presence, using the available demographic and clinical features. Through exploratory data analysis (EDA), data cleaning, transformation experiments, and model […]
ASCVD (Atherosclerotic Cardiovascular Disease) Risk Score (Cleveland And VA Long Beach)
Atherosclerotic Cardiovascular Disease Risk Calculation on Cleveland Dataset The 2013 ASCVD (Atherosclerotic Cardiovascular Disease) risk score was evaluated on the Cleveland dataset, yielding key performance metrics. The score achieved an accuracy of 69.64%, indicating that approximately 70% of predictions matched actual outcomes. Precision was 63.58%, reflecting the proportion of correctly identified positive cases among all […]
Male Vs. Female
Is the Best Performing Models More Effective for Male vs. Female Population? The highest-performing models were identified in Experiment 2, showcasing robust predictive capabilities. The Random Forest classifier emerged as the top performer, achieving a mean accuracy of 88.33%, a mean precision of 91.79%, a mean recall of 82.00%, and a mean F1-score of 83.67%. […]
Transformation 3: Cleveland Only
Optimizing Feature Engineering In this third experiment, the focus is on enhancing the feature engineering component to improve model performance through targeted transformations. The following transformations were applied: (1) a logarithmic transformation for Resting Blood Pressure (trestbps) and Cholesterol (chol) to reduce skewness and stabilize variance; (2) a squared transformation of Maximum Heart Rate (thalach), […]
Transformation 2: Cleveland and VA Long Beach
Optimizing Feature Engineering In this second experiment, the focus is on enhancing the feature engineering component to improve model performance through targeted transformations. The following transformations will be applied: (1) a logarithmic transformation for Resting Blood Pressure and Cholesterol to reduce skewness and stabilize variance; (2) a squared transformation of Maximum Heart Rate, which emphasizes […]
Transformation 1: Cleveland Only
Optimizing Feature Engineering Two custom transformers are applied to the first transformation. The custom transformers, “Log Transformer” and “Square Transformer”, are defined using BaseEstimator and TransformerMixin to enable specialized transformations within a preprocessing pipeline. The “Log Transformer” applies a logarithmic transformation (log1p, which calculates log(x+1)) to specified columns, helping to reduce skewness and handle wide-ranging […]
Exploratory Data Analysis (EDA on X_train, Cleveland Only)
Basic Descriptives of the training set: Univariate analysis of the training set: The dataset reveals several key patterns about the participants and their heart health indicators. Most participants are middle-aged, falling between 55 and 65 years old, with males making up roughly two-thirds of the dataset. When it comes to those who experience chest pain, […]
Materials and Methods
Data The UCI Machine Learning Repository is a comprehensive resource that provides databases, domain theories, and data generators widely utilized by the machine learning community for evaluating models. For the this project, I utilized the database titled “Heart Disease” available in the UCI machine Learning Repository. The “Heart Disease” database from the UCI Machine Learning […]
Literature Review
Introduction The central focus of my capstone project is to explore the effectiveness of machine learning models in predicting heart disease and assess its ability to generalize across different cities and biological sexes. This research highlights the importance of building models that not only achieve high accuracy within a specific dataset or geographic location but […]
Abstract
For this capstone project, I investigated how well machine learning models can predict heart disease, while also studying how the patient’s gender affects these predictions, as well as determining how well the same model performs across different regions. This project utilizes two clinical datasets from the publicly accessible UCI Machine Learning Repository under the collection […]



Recent Comments