Home » Uncategorized » Data Dictionary

Data Dictionary

Sex-Specific and Regional Analysis of Heart Disease Prediction Using Machine Learning
Algorithms: Insights from the UCI Irvine Public Heart Disease Datasets (Cleveland and Long
Beach)
Jonathan Asanjarani
City University of New York Graduate Center
DATA 79000: Capstone Project and Thesis
Advisor: Johanna Devaney
Significant Variables

  1. Age
    o Type: Integer
    o Description: Patient’s age in years.
  2. Sex
    o Type: Binary (0 for Female, 1 for Male)
    o Description: Biological sex of the patient.
  3. Cp (Chest Pain Type)
    o Type: Categorical (0–4)
    o Description: Chest pain severity levels, where higher values indicate more severe
    pain.
  4. Trestbps (Resting Blood Pressure)
    o Type: Continuous (mmHg)
    o Description: Resting blood pressure in millimeters of mercury. Transformed using
    logarithmic scaling to reduce skewness.
  5. Chol (Serum Cholesterol)
    o Type: Continuous (mg/dL)
    o Description: Serum cholesterol level in milligrams per deciliter. Transformed using
    logarithmic scaling to reduce skewness.
  6. Fbs (Fasting Blood Sugar)
    o Type: Binary (0 for <120 mg/dL, 1 for ≥120 mg/dL)
    o Description: Indicator of whether fasting blood sugar exceeds 120 mg/dL.
  7. Restecg (Resting ECG Results)
    o Type: Categorical (0–2)
    o Description: Results of resting electrocardiographic tests (e.g., normal, ST-T wave
    abnormality, left ventricular hypertrophy).
  8. Thalach (Maximum Heart Rate Achieved)
    o Type: Continuous (bpm)
    o Description: Maximum heart rate achieved during exercise. Transformed using a
    squared transformation to emphasize non-linear relationships.
  9. Exang (Exercise-Induced Angina)
    o Type: Binary (0 for No, 1 for Yes)
    o Description: Presence of exercise-induced angina (chest pain).
  10. Oldpeak
    o Type: Continuous
    o Description: ST depression induced by exercise relative to rest (ECG measure).
    Transformed using logarithmic scaling to reduce skewness.
  11. Slope (ST Segment Slope)
    o Type: Categorical (1 for Upsloping, 2 for Flat, 3 for Downsloping)
    o Description: The slope of the peak exercise ST segment.
  12. Ca (Number of Major Vessels)
    o Type: Integer (0–3)
    o Description: Number of major vessels (0–3) colored by fluoroscopy. Transformed
    using one-hot encoding.
  13. Thal (Thallium Stress Test Results)
    o Type: Categorical (3 for Normal, 6 for Fixed Defect, 7 for Reversible Defect)
    o Description: Results of thallium stress tests. Transformed using one-hot encoding.
  14. Oldpeak_Slope_Combined
    o Type: Continuous
    o Description: A derived feature combining Oldpeak (ST depression) and Slope (ECG
    segment pattern during peak exercise).
  15. Gender-Based Interaction Terms
    o Type: Continuous
    o Description: Interaction features created by multiplying the “Sex” feature with key
    variables like Chol and Trestbps to account for demographic-specific variations.
    Critical Functions
  16. Log Transformer
    o Purpose: Reduces skewness in variables like Chol, Trestbps, and Oldpeak.
    o Inputs: Skewed numerical features.
    o Outputs: Log-transformed features.
  17. Squared Transformation
    o Purpose: Captures non-linear relationships in features like Thalach.
    o Inputs: Thalach feature.
    o Outputs: Squared-transformed feature.
  18. Combine Oldpeak and Slope
    o Purpose: Creates a new feature to enhance model accuracy.
    o Inputs: Oldpeak and Slope features.
    o Outputs: Combined feature reflecting ST segment depression and slope interaction.
  19. Gender-Based Interaction Creation
    o Purpose: Generates gender-specific interaction terms to capture the influence of
    demographic variations on key features.
    o Inputs: Sex feature and numerical features such as Chol and Trestbps.
    o Outputs: Interaction features highlighting gender-based relevance.
    Classifiers Used
  20. Random Forest Classifier
    o Purpose: Constructs an ensemble of decision trees for binary classification.
    o Features: Robust against overfitting, useful for datasets with imbalanced classes.
    o Implementation: Optimized using GridSearchCV to select parameters like the
    number of estimators, maximum depth, and feature importance.
  21. XGBoost Classifier
    o Purpose: Gradient boosting algorithm designed for efficiency and performance in
    binary classification tasks.
    o Features: Focuses on minimizing loss functions with parallelized tree construction.
  22. Ensemble Method
    o Purpose: Combines predictions from Random Forest, XGBoost, and Logistic
    Regression to improve robustness.
    o Features: Weighted averaging of classifiers to leverage strengths of individual
    models.
  23. Logistic Regression
    o Purpose: Serves as a baseline model to compare linear relationships between
    features and outcomes.
    o Features: Interpretable and effective for datasets with linear separability.

Leave a comment

Your email address will not be published. Required fields are marked *