Home » Uncategorized » Exploratory Data Analysis (EDA on X_train, Cleveland Only)

Exploratory Data Analysis (EDA on X_train, Cleveland Only)

Basic Descriptives of the training set:

  • Age: The average age is approximately 54.77 years, ranging from 34 to 77 years. The interquartile range (IQR) is 48 to 62 years.
  • Sex: The proportion of males in the dataset is higher, with a mean value of 0.675 (coded as 1 for males and 0 for females).
  • Chest Pain Type (cp): The average chest pain type is 3.186, with a range from 1 to 4.
  • Resting Blood Pressure (trestbps): The average resting blood pressure is 132.27 mmHg, ranging from 94 to 200 mmHg. The IQR is 120 to 140 mmHg.
  • Cholesterol (chol): The mean cholesterol level is 249.41 mg/dL, ranging from 126 to 564 mg/dL. The IQR is 212 to 277 mg/dL.
  • Fasting Blood Sugar (fbs): The mean fasting blood sugar is 0.169, indicating a low prevalence of high fasting blood sugar (coded as 1 for high and 0 for normal).
  • Resting Electrocardiographic Results (restecg): The average is 0.983, with values ranging from 0 to 2.
  • Maximum Heart Rate Achieved (thalach): The mean maximum heart rate is 149.84 bpm, ranging from 88 to 195 bpm. The IQR is 136 to 166 bpm.
  • Exercise-Induced Angina (exang): The mean value is 0.346, suggesting less frequent occurrence of angina (likely coded as 1 for yes and 0 for no).
  • ST Depression (oldpeak): The mean value is 1.062, with a range from 0 to 6.2. The IQR is 0 to 1.8.
  • Slope of the Peak Exercise ST Segment (slope): The average slope is 1.586, ranging from 1 to 3.

Univariate analysis of the training set:

The dataset reveals several key patterns about the participants and their heart health indicators. Most participants are middle-aged, falling between 55 and 65 years old, with males making up roughly two-thirds of the dataset. When it comes to those who experience chest pain, the majority experience the highest severity (type 4). Both resting blood pressure and cholesterol levels show right-skewed distributions, indicating that while most values are moderate, some individuals have significantly higher levels. Only a small proportion of participants have elevated fasting blood sugar, suggesting that this issue is less common in the dataset. Resting electrocardiogram results appear to be normally distributed, showing a balanced spread of values.

The exang variable, which indicates the presence of exercise-induced angina (1 for presence, 0 for absence), reveals that most participants do not experience angina during exercise. This suggests that exercise-induced chest pain is less common in this dataset. However, the subset of individuals with a positive value for exang may represent those at higher risk for underlying heart issues, as angina during exercise is often a significant indicator of coronary artery disease.

The maximum heart rate distribution has a strong left skew, meaning most participants have a maximum heart rate above 140 beats per minute. The old peak feature, which measures changes in the ST segment of an ECG during exercise, indicates that most participants have values between 0 and 2, reflecting minimal ST depression and better heart health. However, a smaller group with higher values (up to 6) may be at greater risk for heart issues.

The slope feature, which describes the shape of the ST segment during peak exercise, shows that most participants have a flat slope (category 2), often associated with underlying heart problems. A smaller number have an upsloping slope (category 1), typically linked to better heart health, or a down sloping slope (category 3), which is more often connected to severe heart disease. Together, these features provide valuable insights into the varying levels of heart health risk within the dataset.

Data transformation.

To extract meaningful results from our machine learning models, it is important to account for outlier values. An outlier indicates a value which is significantly different in value from the rest of the dataset. These values can negatively affect how well a machine learning model can generalize results, as they affect the performance and accuracy of a model (Mueller and Guido, 2016).  Outliers can be removed or accounted for using data transformation. Methods, such as logarithmic transformation, can reduce the impact of large outlier values. Alternatively, we can use square root transformation as well which is suitable for positively skewed data (Mueller and Guido, 2016).

For the following experiment, trestbps (resting blood pressure) and chol (cholesterol), and oldpeakwill be transformed using a logarithmic scale because their distributions are right-skewed.

The visualization presents side-by-side histograms showing the original and log-transformed distributions for resting blood pressure, cholesterol, and oldpeak (ST depression). The histograms on the left, representing the original data, reveal a strong right skew for all three features. On the right, the histograms display the distributions after log transformation, where we observe a clear reduction in skewness. For resting blood pressure and cholesterol, the transformed data shows a significantly more symmetrical shape, indicating that the log transformation effectively normalized these distributions. For oldpeak, while the transformation reduces the skewness, the distribution remains slightly asymmetrical due to the heavy concentration of values near zero. These visual comparisons illustrate how log transformation reshapes the data, making it better suited for modeling and statistical analysis.

Thalach (maximum heart rate achieved) has a left-skewed distribution. Therefore, I will apply a square transformation to determine if this transformation better normalizes the distribution. These visual comparisons highlight the impact of squared transformation on the data’s structure, making it clearer how the technique improves the suitability of these features for modeling and statistical analysis

In the first histogram, we can see that there is a slightly longer tail extending toward the lower values, indicating a right skewness. The outlying values seem to be patients with a low maximum heart rate. The histogram on the right represents the squared-transformed data. Squaring amplifies the range of values, particularly for higher maximum heart rates, while slightly smoothing the irregularities in the original data. This results in a slightly more normal distribution of values, although the overall symmetry of the distribution is preserved.

The graphic shows a pair plot (scatterplot matrix) of the dataset, which depicts the relationships between distinct variables. The diagonals of the pair plot contain histograms depicting the distribution of specific features. For example, features such as age and thalach (maximum heart rate) have continuous distributions with values that span a large range. In contrast, features such as sex, cp (chest pain type), fbs (fasting blood sugar), restecg, exang (exercise-induced angina), and slope have distinct values, indicating that they are most likely categorical or binary.

           Off-diagonal scatter plots show correlations between pairs of variables. For example, age and cholesterol show a clear trend in which greater cholesterol levels are related with older individuals.

There seems to be a negative relationship between thalach (maximum heart rate) and num (heart disease indicator), suggesting that people with higher heart rates are less likely to have severe heart disease. Additionally, cp (chest pain type) shows clusters, reflecting how different chest pain types are distributed across other variables. These patterns give valuable insights into the dataset’s structure and possible relationships between variables.

Distribution of heart disease presence in the training data.

The bar chart illustrates a binary classification for heart disease, distinguishing between presence (values 1–4) and absence (value 0).

 The two bars represent the number of cases for individuals without heart disease (labeled as 0) and those with heart disease (labeled as 1). The distribution is slightly imbalanced, with more individuals in the “no heart disease” category compared to the “heart disease” category. In the training set, 128 patients did not have heart disease while there were 109 patients that indicated some level of presence for heart disease. This imbalance, while not extreme, could influence model performance. It is important to note that in a population-representative sample, most people are more likely to be free of heart disease than to have it.  

Analyzing Individual Variable Impact on Heart Disease Presence.

Point-Biserial Correlation

The point-biserial correlation is a statistical measurement that assesses the relationship between a dichotomous variable and a continuous variable. It quantifies the strength and direction of the association, making it a valuable tool in analyzing mixed variable types. The point-biserial correlation is appropriate when one variable represents a binary outcome, and the other is measured on a continuous scale.

For the purposes of this research analysis, we are evaluating whether there is a significant relationship between the presence of heart disease, a dichotomous variable, and each of the following independent variables that are continuous in nature.

Chi-Square Test

The Chi-squared test is a statistical evaluation tool, indicating that there is a relationship between two entities. In categorical analysis, chi-square tests are used to determine if observed patterns are likely to be purely random. A chis-square test is appropriate when we are looking at the frequency of different categories.

For the purposes of this research analysis, we are identifying if there is a correlation between the presence of heart disease, and each of the following independent variables that are categorical in nature.

Age:

On average, individuals with heart disease tend to be older, suggesting a potential positive relationship between age and heart disease. To investigate this further, a swarm plot was generated, and a chi-square test was performed to assess the significance of the correlation

The swarm plot highlights a noticeable clustering of values in the older age ranges for patients with heart disease (1) compared to those without heart disease (0). This positive correlation between age and the presence of heart disease is further supported by the Point-Biserial Correlation Coefficient of 0.1990, with a p-value of 0.002085. Since the p-value is well below the alpha threshold of 0.05, this indicates a statistically significant, albeit weak, positive correlation between age and heart disease presence in this dataset.

Sex:

There are several key points to note regarding the differences in heart disease prevalence between males and females in this dataset. First, as previously mentioned, the dataset contains roughly twice as many women as men. Among biological females, the number without heart disease is significantly higher than those with heart disease. Conversely, for males, the number with heart disease exceeds those without. Additionally, the disparity between males with and without heart disease is more pronounced compared to the difference observed among females. This significant association is supported by a P-value of 3.26e-04, which is well below the alpha threshold of 0.05, further emphasizing the link between gender and heart disease presence.

Chest Pain:

The average chest pain score is approximately one level higher for patients with heart disease compared to those without. Additionally, patients without heart disease appear to have a slightly larger standard deviation and a wider interquartile range, indicating greater variability in chest pain scores within this group.

The analysis reveals a significant association between chest pain types (cp) and the presence of heart disease, as evidenced by the Chi-Square test results. The test statistic of 65.29635 and a p-value of 4.33e-14, well below the alpha level of 0.05, confirm this relationship. The stacked bar chart shows that among patients with heart disease, chest pain type 4.0 (asymptomatic chest pain) is predominant, while patients without heart disease exhibit a more varied distribution across chest pain types. The count plot further emphasizes this pattern, with chest pain types 3.0 (non-anginal pain) and 4.0 being significantly more frequent among patients with heart disease, whereas types 1.0 (typical angina) and 2.0 (atypical angina) are more evenly distributed or slightly higher among those without heart disease. These findings underscore the importance of chest pain types, particularly 3.0 and 4.0, as key indicators of heart disease presence.

Resting Blood Pressure:

The average blood pressure is significantly higher in individuals with heart disease compared to those without. Additionally, the interquartile range is slightly wider among individuals with heart disease, indicating greater variability in blood pressure within this group. There also appears to be a slightly larger standard deviation, suggesting that blood pressure levels are more dispersed among individuals with heart disease.

The swarm plot displays the distribution of resting blood pressure (trestbps) for individuals with and without heart disease. While the distributions for both groups overlap, there is a slight tendency for individuals with heart disease (1) to have higher resting blood pressure values compared to those without heart disease (0). The Point-Biserial Correlation Coefficient is 0.1632, with a p-value of 0.01188. Since the p-value is less than 0.05, the result suggests a statistically significant, albeit weak, positive correlation between resting blood pressure and the presence of heart disease in this dataset.

Cholesterol.

Among patients without heart disease, the mean cholesterol level is 246.172, with a standard deviation of 56.375, indicating greater variability in cholesterol levels. This group also has a wider range, with cholesterol levels spanning from a minimum of 126 to a maximum of 564. In contrast, the 109 patients with heart disease have a slightly higher average cholesterol level of 253.211 and a narrower range, with cholesterol levels ranging from 164 to 409.

Patients with heart disease tend to have higher average cholesterol levels and median values compared to those without heart disease. However, individuals without heart disease exhibit a wider cholesterol range and greater variability, reflecting more diverse cholesterol profiles within this group.

The box-and-whisker plot illustrates the distribution of cholesterol levels based on heart disease presence (0 = no heart disease, 1 = heart disease). Both groups have similar interquartile ranges, with a slightly higher median cholesterol level in patients with heart disease. Individuals without heart disease exhibit slightly greater variability in cholesterol levels, as indicated by the wider whiskers and more extreme outliers. The cholesterol levels for individuals with heart disease are more concentrated, with fewer outliers.

The Point-Biserial Correlation Coefficient is 0.0661, with a p-value of 0.31068. Since the p-value is greater than 0.05, the test indicates that there is no statistically significant correlation between cholesterol levels and the presence of heart disease.

Fasting blood sugar > 120mg/dl

 Most individuals in both groups have normal fasting blood sugar levels.  The proportion of individuals with abnormal levels in fasting blood sugar was similar between thos with and without heart disease. There is generally significantly more people in the dataset with normal blood sugar levels. The Chi-Square statistic is 0.00129, with a p-value of 0.9710. Since the p-value is much greater than the alpha threshold of 0.05, there is no statistically significant association between fbs and heart disease presence.

Resting Elctrocardiogram Results:

The bar graph indicates that most individuals without heart disease have normal resting electrocardiogram results (restecg = 0). A smaller proportion of these individuals have a resting electrocardiogram value of 2, which indicates probable or definite left ventricular hypertrophy. In contrast, the majority of individuals with heart disease have a resting electrocardiogram result of 2, while only a small number have normal results.

The Chi-Square statistic is 35.63319, with a p-value of 2.38e-09. Since the p-value is significantly below 0.05, there is a statistically significant association between restecg and heart disease presence. These findings highlight that abnormal electrocardiographic results (restecg = 2) are strongly associated with the presence of heart disease, while normal results (restecg = 0) are more prevalent among individuals without heart disease.

Maximum Heart Rate:

The average maximum heart rate  was higher among individuals without heart disease, than those with heart disease. The heart rate range is slightly broader in patients with heart disease, with a lower minimum value. The results suggest that a lower maximum heart rate is associated with heart disease presence.

The following swarm plot further denotes this. The distribution for individuals with heart disease is more concentrated in the lower range (below 160). Individuals without heart disease generally achieve higher maximum heart rates. However, the distribution is more spread out, with a higher concentration in the upper range (above 160).

            The negative correlation coefficient indicates an inverse relationship between thalach and heart disease presence—higher heart rates are associated with a lower likelihood of heart disease. The p-value is far below the threshold of 0.05, indicating a statistically significant relationship. Based on these results, we can conclude that lower maximum heart rates are strongly associated with the presence of heart disease.

Exercise Induced Angina

The majority of individuals without heart disease also do not have exercise-induced angina (`exang = 0`), though a small proportion do. In contrast, a notable portion of individuals with heart disease have exercise-induced angina (`exang = 1`), While fewer individuals with heart disease do not. This indicates that exercise-induced angina is significantly more common in individuals with heart disease, whereas its absence is more prevalent among those without heart disease. The p-value of 2.38e-09, being far below the threshold of 0.05, confirms a statistically significant association between `exang` and heart disease presence. The Chi-Square test results further validate that exercise-induced angina is a strong indicator distinguishing individuals with heart disease from those without.

Oldpeak.

The distribution of Oldpeak for individuals without heart disease is sharply concentrated near 0, with a clear peak, and values rarely exceed 2. In contrast, among individuals with heart disease, the distribution is broader and more spread out, with values extending up to 6. This indicates a noticeable shift toward higher Oldpeak values in those with heart disease compared to those without.

The p-value, being below 0.05, confirms a statistically significant association between Oldpeak and heart disease presence. Therefore, higher Oldpeak values (indicating greater ST depression) are more common in individuals with heart disease, while lower Oldpeak values are primarily observed in those without heart disease. These findings suggest that Oldpeak is a significant factor associated with heart disease presence.

Slope.

Among individuals without heart disease, a slope value of 1 is the most common. In contrast, a smaller proportion of individuals with heart disease exhibit this slope value. A slope value of 2 is more frequent among individuals with heart disease and appears to be the most common slope in this group. While slope value 3 is rare overall, it is slightly more prevalent in individuals with heart disease. The p-value, being far below the threshold of 0.05, confirms a statistically significant association between slope and heart disease presence.

Correlation amongst individual variables.

The correlation heatmap illustrates the relationships between all variables in the dataset. Among these,Oldpeak (ST depression induced by exercise) and Slope (the slope of the ST segment) exhibit the strongest correlation, with a moderately strong positive value of 0.59, indicating a close relationship. This correlation makes sense within the context of the dataset, as both variables are related to stress tests and reflect the heart’s response to physical exertion. Higher Oldpeak values, which indicate greater ST depression, seem to be associated with abnormal ST segment slopes, such as a flat or down sloping ST segment, which are markers of ischemia and heart disease. The correlation heatmap shows that ca (number of major vessels) has a positive correlation with age (0.37) and thal (0.25). Given the context of the dataset, the number of blocked or major vessels tends to increase with age, possibly due to the natural progression of cardiovascular disease. Similarly, thal (a thalassemia-related variable indicating heart stress test results) is also associated with the severity of heart conditions, which seems to become more pronounced as age increases.

There are also a few notable negative correlations in the dataset. As age increases, the maximum heart rate achieved (thalach) tends to decrease, which aligns with the natural decline in cardiovascular efficiency as people age. Additionally, the presence of exercise-induced angina (exang) is associated with lower maximum heart rates, as individuals with angina often experience reduced heart performance during physical exertion. Lastly, there is a negative relationship between the ST segment slope (slope) and maximum heart rate, suggesting that abnormal ST segment slopes (e.g., flat or down sloping) are often linked to lower heart rate levels.

Chest pain type (cp) and exercise-induced angina (exang) have a moderate positive correlation (0.36), suggesting that certain chest pain types may be linked to the occurrence of exercise-induced angina.

Features such as cholesterol (chol) and fasting blood sugar (fbs) show very weak correlations with other variables, suggesting limited interdependence within the dataset.

In conclusion, the heatmap highlights important connections between features in the dataset. There is a clear relationship between slope, exercise-induced angina (exang), and ST depression (Oldpeak), showing that these features are closely linked and likely reflect similar patterns of heart stress.

Other expected trends are also visible, such as age being negatively correlated with maximum heart rate (thalach), as heart rate naturally decreases with age, and vascular blockages (ca) increasing with age. Additionally, thal (a variable referring to the results of a thalassemia-related stress test, which measures blood flow abnormalities) is positively correlated with vascular blockages (ca) and other clinical indicators, reinforcing known relationships with heart conditions. On the other hand, features like cholesterol (chol) and fasting blood sugar (fbs) show very weak connections to other variables, suggesting they are less related within this dataset. Overall, the heatmap shows where strong relationships exist, particularly among heart-related features like Oldpeak, slope, and thal, while also identifying variables with limited connections.


Leave a comment

Your email address will not be published. Required fields are marked *