Optimizing Feature Engineering
In this second experiment, the focus is on enhancing the feature engineering component to improve model performance through targeted transformations. The following transformations will be applied: (1) a logarithmic transformation for Resting Blood Pressure and Cholesterol to reduce skewness and stabilize variance; (2) a squared transformation of Maximum Heart Rate, which emphasizes the importance of higher values in this feature; and (3) the creation of a combined feature using Oldpeak and Slope, aimed at capturing their interaction and improving predictive power.
Oldpeak measures ST depression induced by exercise relative to rest, which is a critical indicator of heart disease severity, as it reflects changes in the heart’s electrical activity under stress. Slope represents the slope of the peak exercise ST segment, indicating whether the ST segment rises, remains flat, or declines after exercise. Both features are commonly used in cardiology to assess the heart’s ability to respond to physical stress.
A high correlation of 0.59 between Oldpeak and Slope suggests a significant linear relationship, indicating that as Oldpeak values change, the Slope of the ST segment is also likely to vary. This strong correlation highlights the potential predictive synergy between these features, motivating their combination to create a single feature that better captures this relationship. The goal of this experiment is to evaluate whether these transformations and the combined feature improve the model’s ability to predict heart disease more accurately.
The combination was performed using an addition method, where the new feature oldpeak_slope_combined was created by summing the values of Oldpeak (ST depression induced by exercise) and Slope (the slope of the peak exercise ST segment). This approach was chosen based on the strong correlation (0.59) between the two variables, suggesting that their combined contribution could provide additional predictive power for the model. To ensure smooth integration into the pipeline, the custom transformer drops the original Oldpeak and slope columns after creating the combined feature. The modified data was subsequently standardized and processed alongside other transformed and encoded features. The inclusion of this combined feature aims to better represent the relationship between Oldpeak and Slope, potentially capturing their interaction to improve the model’s predictive accuracy.
Cleveland
Random Forrest Classifier.


The model achieved strong performance on the test set, with an accuracy of 83.33%, an F1 score of 0.8318, a precision of 0.8438, and a recall of 0.8333. The confusion matrix shows that 29 true negatives and 21 true positives were correctly predicted, while there were 3 false positives and 7 false negatives.
Cross-validation results further demonstrate the model’s consistency. The mean accuracy across folds was 81.67%, with precision averaging 0.8083, recall at 0.7614, and an F1 score of 0.7744. While there was some variability in individual fold results, particularly in recall, the overall performance indicates a balanced and reliable model. The model effectively identifies positive and negative classes, though some misclassifications, particularly false negatives, suggest areas for potential improvement in recall.
This Featured Engineering experiment show a slight improvement in cross-validation results compared to the first transformation, which had a mean accuracy across folds as 81.67%, with a mean precision of 0.775, a mean recall of 0.7914, and a mean F1 score of 0.7761.
Grid Search Overview.

A grid search was conducted to identify the optimal hyperparameters for the model using 3-fold cross-validation across 80 candidate combinations, resulting in a total of 240 fits. The best-performing parameters were found to be max_depth = 2, n_estimators = 200, and random_state = 42. These settings achieved a best cross-validated weighted F1-score of 0.8342 for the training data, indicating strong and balanced model performance across all folds. The optimized hyperparameters contribute to improved generalization and highlight the model’s ability to effectively handle the dataset with minimal overfitting.
Application of Optimized Parameters for Random Forrest Classifier.


The model demonstrated strong performance on the test set, achieving an accuracy of 83.33%, an F1 score of 0.8345, a precision of 0.8556, and a recall of 0.8333. The confusion matrix highlights that the model correctly classified 30 true negatives and 20 true positives, with 2 false positives and 8 false negatives. These results indicate a balanced performance, with a slight trade-off between precision and recall for the positive class.
Cross-validation results confirm the model’s consistency across folds. The mean accuracy was 88.33%, with a mean precision of 91.79%, a mean recall of 82%, and a mean F1 score of 0.8367. The precision remained high, reflecting the model’s ability to minimize false positives, while recall variability suggests opportunities for improvement in detecting true positives.
Overall, the model exhibits strong generalization capabilities, supported by its stable cross-validation performance and robust test set results.
XGB Forrest Classifier Applied to Data Transformation.
The XGB Classifier achieved an accuracy of 86.67% on the test set, showing strong overall performance. For individuals without heart disease (Class 0), the model achieved a precision of 0.85, a recall of 0.91, and an F1 score of 0.88, indicating a high ability to correctly identify those without the condition. For individuals with heart disease (Class 1), the model demonstrated a precision of 0.88, a recall of 0.82, and an F1 score of 0.85, showing solid performance in identifying true positives while maintaining precision. The macro and weighted averages for precision, recall, and F1 score were all approximately 0.87, reflecting balanced performance across both classes.
Cross-validation results showed a slightly lower mean accuracy of 76.67%, with a mean precision of 0.75, mean recall of 0.74, and a mean F1 score of 0.72. While accuracy and precision were stable, the recall exhibited variability across folds, ranging from 0.4 to 1.0, suggesting occasional challenges in identifying individuals with heart disease.
Overall, the XGB Classifier performed well on the test set, effectively distinguishing between individuals with and without heart disease. However, variability in recall across cross-validation folds highlights the need for further fine-tuning to ensure more consistent detection of positive cases.
Grid Search Overview.

A grid search with 3-fold cross-validation was performed to identify the optimal hyperparameters for the model. A total of 108 candidate combinations were evaluated, resulting in 324 fits. The best-performing parameters were: colsample_bytree = 1.0 (using all features at each split), learning_rate = 0.1 (controlling the step size for learning), max_depth = 2 (limiting tree depth to prevent overfitting), n_estimators = 50 (the number of boosting rounds), and subsample = 0.8 (using 80% of the training data per tree). These hyperparameters yielded a best cross-validation accuracy of 83.54% for the training data, reflecting strong and consistent performance across the folds while balancing model complexity and generalization.
Application of Optimized Parameters with XGB Boost Classifier


The model achieved an accuracy of 80% on the test set, performing well across both classes. For individuals without heart disease (Class 0), it had a precision of 0.76, a recall of 0.91, and an F1 score of 0.83, showing strong performance in identifying true negatives. For individuals with heart disease (Class 1), the model recorded a precision of 0.86, a recall of 0.68, and an F1 score of 0.76, indicating slightly lower performance in detecting true positives.
The cross-validation results were consistent, with a mean accuracy of 80%, a mean precision of 0.75, a mean recall of 0.80, and a mean F1 score of 0.78. Although the model performed reliably overall, the variability in recall values across folds (ranging from 0.6 to 1.0) suggests occasional difficulty in detecting positive cases, which presents an opportunity for further improvement.
Ensemble Method.
The same ensemble method was applied using the newly optimized parameters on the newly transformed dataset.

The ensemble model demonstrated solid performance in both cross-validation and test set evaluations. During cross-validation, the model achieved a mean accuracy of 81.67%, a mean precision of 83.65%, a mean recall of 81.67%, and a mean F1 score of 81.03%, showing consistent and balanced results across folds. On the test set, the model achieved an accuracy of 80.00%, with a precision of 81.00%, a recall of 80.00%, and an F1 score of 0.7966, confirming reliable generalization to unseen data.
The classification report indicates strong performance for both classes. For individuals without heart disease (Class 0), the model achieved a precision of 0.76, a recall of 0.91, and an F1 score of 0.83, reflecting its ability to correctly identify negative cases. For individuals with heart disease (Class 1), it achieved a precision of 0.86, a recall of 0.68, and an F1 score of 0.76, showing slightly lower recall for positive cases. The macro and weighted averages for precision, recall, and F1 score were all approximately 0.80, indicating balanced overall performance. These results highlight the model’s robustness, with room for improvement in detecting positive cases.
VA Long Beach
An additional dataset from the United States was contributed by the Veterans Administration of Long Beach, California. The VA Long Beach Healthcare System, formerly known as Naval Hospital Long Beach, encompasses a network of Veterans Administration facilities in Long Beach and nearby cities. This dataset includes 200 patients, most of whom have some degree of heart disease, and features the same 14 columns as the Cleveland dataset.
Data Limitations.
The dataset has significant limitations that must be addressed. One key issue is the severe gender imbalance, with only approximately 6 females included in the dataset. This small number makes it impossible to test the model by gender, as there are insufficient entries to draw any meaningful conclusions for female patients.
Additionally, the dataset suffers from a pronounced class imbalance, with most instances representing individuals with heart disease. This imbalance introduces challenges for the model, as there is limited data available to effectively train the minority class (individuals without heart disease). As a result, there is a strong expectation of underfitting for the minority class, where the model may fail to accurately predict or generalize for these cases.
Another critical issue is the missing data. Columns such as “ca” (99% missing values) and “thal” (83% missing values) have so few valid entries that they will need to be removed from the analysis.
To assess the broader applicability of the top-performing models from the Cleveland dataset to other regions, I will evaluate their performance on this dataset. To ensure a fair comparison, I will modify the Cleveland models by removing the “ca” and “thal” columns, aligning the feature set with the limitations of this dataset. This approach allows for a rigorous evaluation of whether the success of these models in the Cleveland dataset translates effectively to data from different regions.
Random Forest Classifier with Optimized Parameters on the VA Long Beach Dataset


The Random Forest Classifier, applied to the VA Long Beach dataset with optimized parameters, demonstrates a high level of accuracy, achieving an overall score of 75%. However, upon closer inspection, the model fails to predict any instances of the minority class (those with heart disease). This is evident in the confusion matrix, where no predictions were made for the true positive class (0). Consequently, the model’s precision, recall, and F1 score for the minority class are all 0. With the optimized parameters, the classifier is heavily biased toward the majority class (1), which represents individuals without heart disease, yielding perfect precision and recall for this group. This imbalance significantly affects the macro-average metrics, with an F1 score of 0.43. The cross-validation results also reinforce these findings, with consistent performance across folds, but the model’s inability to generalize for the minority class remains a critical issue.
Random Forest Classifier without Optimized Parameters on the VA Long Beach Dataset

The new model, without optimized parameters, demonstrates an improvement in terms of its ability to predict instances of the minority class (those with heart disease). While the overall accuracy is slightly higher at 82.5%, compared to the 75% accuracy of the optimized model, the primary improvement lies in its balanced performance across both classes.
In this model, the confusion matrix reveals that the classifier correctly predicts 4 out of 10 instances of the minority class (0), resulting in a recall of 40%. Additionally, the precision for the minority class is 80%, with an F1 score of 0.53, indicating a more balanced trade-off between precision and recall for this class. For the majority class (1), the model maintains high precision (83%) and recall (97%), resulting in an F1 score of 0.89.
The macro-average F1 score of 0.71 and weighted-average F1 score of 0.82 suggest better overall performance compared to the previous model with optimized parameters, which failed to predict any minority class instances. This highlights that, while parameter optimization may increase accuracy for the majority class, it can lead to a complete neglect of the minority class, whereas a more generic configuration balances predictions across both classes more effectively.
Random Forest Classifier with Optimized Parameters on the Adjusted Cleveland Dataset
For comparison, I adjusted the Cleveland dataset to better align it with the characteristics of the VA Long Beach dataset, providing a clearer perspective on how the model would perform in predicting the presence and absence of heart disease across different regions. The Random Forest Classifier with optimized parameters achieved an overall accuracy of 75%, indicating similar predictive performance as observed in the previous experiments.


The model performed reasonably well for both outcomes. For predicting the absence of heart disease (class 0), it correctly identified 28 out of 32 instances, achieving a recall of 88%, a precision of 72%, and an F1 score of 0.79. For predicting the presence of heart disease (class 1), it correctly identified 17 out of 28 instances, with a recall of 61%, a precision of 81%, and an F1 score of 0.69. These results show that the model maintains a balanced ability to predict both classes, though there is a slight bias toward predicting the absence of heart disease. The macro-average F1 score of 0.74 and weighted-average F1 score of 0.75 further highlight the balanced performance. Cross-validation results reinforce this consistency, with mean accuracy and F1 scores of 0.75 and 0.69, respectively, and an average precision of 0.86 and recall of 0.57.
Performance Comparison
When comparing the Random Forest Classifier with optimized parameters applied to the VA Long Beach dataset and the adjusted Cleveland dataset, notable differences in the model’s ability to predict heart disease emerge. While both models achieved an overall accuracy of 75%, their performance in predicting the presence of heart disease (class 1) differed significantly.
For the VA Long Beach dataset, the model completely failed to predict any instances of heart disease. This failure is evident from the zero precision, recall, and F1 scores for class 1, highlighting a strong bias toward predicting the absence of heart disease (class 0). As a result, the model is ineffective for identifying individuals with heart disease in this dataset.
In contrast, the model applied to the adjusted Cleveland dataset demonstrated improved performance for predicting heart disease. It achieved a recall of 61%, precision of 81%, and an F1 score of 0.69 for class 1, showing a more balanced approach in predicting both outcomes. The macro-average F1 score of 0.74 for the Cleveland dataset surpasses the 0.43 observed for the VA Long Beach dataset, indicating better overall performance. These findings emphasize the importance of dataset-specific adjustments and reveal that while optimized parameters may improve accuracy, they cannot fully address class imbalances or dataset-specific challenges without additional considerations.
Random Forest Classifier without Optimized Parameters on the Adjusted Cleveland Dataset

The Random Forest Classifier applied to the adjusted Cleveland dataset without optimized parameters achieved an overall accuracy of 75%, matching the performance of the model with optimized parameters. However, the performance of the two models diverges in other areas. Without optimization, the model maintained balanced performance across both classes, predicting 28 out of 32 instances correctly for the absence of heart disease (class `0`) with a precision of 72%, recall of 88%, and an F1 score of 0.79. For the presence of heart disease (class `1`), the model correctly predicted 17 out of 28 instances, yielding a precision of 81%, recall of 61%, and an F1 score of 0.69.
In comparison, the Cleveland model with optimized parameters demonstrated similar performance for class `0`, with a precision of 72%, recall of 88%, and an F1 score of 0.79. However, for class `1`, the optimized model also predicted 17 out of 28 instances correctly but showed slightly lower recall and slightly higher precision, at 81% and 61%, respectively. This suggests that optimization did not significantly alter the model’s ability to handle class imbalances but maintained the same overall predictive strength.
Both models achieved macro-average F1 scores of 0.74 and weighted-average F1 scores of 0.75, highlighting that the lack of parameter optimization did not hinder the model’s performance in this case. The cross-validation results were also consistent, further indicating that the model without optimization performs similarly to the optimized model in this scenario. Overall, this comparison suggests that parameter optimization had minimal impact on the adjusted Cleveland dataset, and the model’s inherent structure was sufficient for maintaining balanced performance across both classes.
Regional Performance Comparison.
The comparison between the Cleveland and VA Long Beach models without optimized parameters highlights two possible implications: optimized parameters may lead to overgeneralization for the majority class (absence of heart disease), or they may reduce the model’s ability to generalize effectively across regions. These potential drawbacks are particularly obscured by the small number of patients in the minority class (those without heart disease), which makes it challenging to draw definitive conclusions.
Without optimization, the Random Forest Classifier achieved balanced performance for the minority class on the Cleveland dataset, with a recall of 61% and an F1 score of 0.69, while still maintaining reasonable performance for the VA Long Beach dataset. However, with optimized parameters, the VA Long Beach model completely failed to predict any instances of the minority class, suggesting that the optimization may have tailored the model too closely to the majority class, resulting in overgeneralization.
These outcomes suggest that the small sample size of the minority class amplifies the difficulty in determining whether the reduced performance is due to overgeneralization for the majority class or a lack of adaptability across regions. This underscores the importance of balancing datasets and carefully evaluating the impact of optimization on both regional performance and minority class predictions.
Recent Comments