Home » Uncategorized » Transformation 1: Cleveland Only

Transformation 1: Cleveland Only

Optimizing Feature Engineering

 Two custom transformers are applied to the first transformation. The custom transformers, “Log Transformer” and “Square Transformer”, are defined using BaseEstimator and TransformerMixin to enable specialized transformations within a preprocessing pipeline. The “Log Transformer” applies a logarithmic transformation (log1p, which calculates log(x+1)) to specified columns, helping to reduce skewness and handle wide-ranging values. It includes an __init__ method to specify the columns for transformation, a fit method as a placeholder for pipeline compatibility, and a transform method that applies the logarithmic transformation to the selected columns.

Similarly, the “Square Transformer” squares the values of specified columns to emphasize larger differences or handle non-linear relationships. Like the “Log Transformer”, it has an “__init__” method for defining the columns to transform, a fit method for compatibility, and a transform method that performs the squaring operation. These custom transformers provide flexibility for preprocessing specific features in a dataset and integrate seamlessly into scikit-learn pipelines.

Logarithmic and square transformations are applied to specific features in the dataset to enhance data preprocessing. Logarithmic transformation is used on resting blood pressure, and cholesterol to reduce skewness and normalize their distributions, ensuring that these features are better suited for machine learning models. Square transformation is applied to maximum heart rate to capture potential non-linear relationships and emphasize larger differences in the data. These transformations help tailor the preprocessing pipeline to the characteristics of the dataset.

Additionally, continuous features such as age, sex, chest pain, fasting blood sugar, resting electrocardiogram results, exercise induced angina, oldpeak, and slope are normalized using `StandardScaler` to ensure they have a mean of zero and a standard deviation of one. Categorical features, including ‘number of major vessels (ca)’, and ‘thalassemia (thal)’, are converted into binary format using OneHotEncoder. Any columns not explicitly specified for transformation are dropped, ensuring the preprocessing pipeline is both precise and adaptable to the dataset’s needs. Any columns not explicitly specified for transformation are dropped, ensuring the preprocessing pipeline is both precise and adaptable to the dataset’s needs.

Experiment 1: Random Forrest Classifier Applied to Data Transformation.

The first model attempt demonstrated solid performance, achieving an overall accuracy of 85%. From the confusion matrix, the model correctly identified 29 negative cases (Class 0) and 22 positive cases (Class 1), while misclassifying 3 negatives as positives (false positives) and 6 positives as negatives (false negatives). For individuals without heart disease (Class 0), the precision, recall, and F1-score were 0.83, 0.91, and 0.87, respectively, showcasing the model’s strong ability to identify negative cases. For individuals with heart disease (Class 1), the precision was 0.88, recall was 0.79, and the F1-score was 0.83. The slightly lower recall indicates the model missed some positive cases. Both the macro and weighted averages for precision, recall, and F1-score were 0.85, reflecting balanced performance across both classes. While the model performs slightly better at identifying negative cases, its high precision for both classes suggests that most predictions are accurate.

To further evaluate these results, the test data underwent K-Fold Cross-Validation.  K-Fold Cross-Validation is a resampling technique used to evaluate and validate machine learning models by splitting the dataset into multiple subsets (folds). It helps assess a model’s performance by testing it on different subsets of the data, ensuring that the evaluation is robust and not overly dependent on a particular split. The dataset is divided into K equal-sized folds. For example, with 5-fold cross-validation (K=5), the dataset is split into 5 subsets. In each iteration, one-fold is used as the validation set, and the remaining K-1 folds are used as the training set. This process is repeated K times, where each fold gets to be the validation set once. After each iteration, the model’s performance (e.g., accuracy, precision, recall, or F1-score) is calculated on the validation set. The final model performance is obtained by averaging the metrics across all K iterations.

The cross-validation results highlight that the Random Forest model is generally reliable, with a mean accuracy of 80% and balanced precision (0.775), recall (0.77), and F1-score (0.77). However, the variability between folds, particularly in recall (ranging from 0.60 to 1.00), suggests some inconsistencies in detecting positive cases depending on the data fold. This variability may indicate room for improvement, such as hyperparameter tuning or addressing class imbalances. Overall, the model performs well but can be further refined for more stable performance across folds.

Parameter Grid Search.

Parameter Grid Search was used to improve performance. Parameter Grid Search, commonly referred to as Grid Search, is a technique used in machine learning to systematically find the best combination of hyperparameters for a model. The goal is to optimize the model’s performance by evaluating it with different hyperparameter settings. Machine learning models often have hyperparameters that cannot be learned directly from the data (e.g., regularization strength, tree depth, or kernel type). The choice of these hyperparameters significantly impacts the model’s performance. Grid Search automates the process of trying multiple combinations of hyperparameters to determine which configuration performs best.

The model evaluated 80 different hyperparameter combinations using 3-fold cross-validation, resulting in a total of 240 fits (80 candidates × 3 folds).  The optimal combination of hyperparameters was found to be a “max_depth” of 2, “n_estimators “of 50, and “random_state” set to 123.

Application of Optimized Parameters for Random Forrest Classifier.

Based on the model’s performance, the first model achieved higher accuracy on the test set by about 2%, with a better overall balance between precision and recall as reflected in its higher F1 score. It also demonstrated superior precision, with fewer false positives, and better recall, indicating fewer false negatives. 

However, the cross-validation results show that the second model performed slightly better overall, particularly in mean accuracy and recall across folds, suggesting greater consistency during validation.

In summary, the first model performs better on the test set, with higher accuracy, F1 score, and reduced false negatives. Meanwhile, the second model demonstrates slightly better stability during cross-validation, particularly in accuracy and recall. 

XGB Forrest Classifier Applied to Data Transformation.

The XGBoost Classifier achieved strong overall performance with an accuracy of 0.85 on the test set. The model performed well across both classes, as reflected in the consistent macro and weighted averages for precision, recall, and F1-score, all at 0.85.

For Class 0 (Negative), the model achieved a precision of 0.83, a recall of 0.91, and an F1-score of 0.87, indicating it is particularly effective at identifying true negatives. For Class 1 (Positive), the model delivered a precision of 0.88, a recall of 0.79, and an F1-score of 0.83, showing strong precision with fewer false positives.

Overall, the XGBoost Classifier demonstrates balanced and reliable performance, achieving a good trade-off between precision and recall for both classes.

The XGBoost Classifier achieved a mean accuracy of 75% across the folds, showing moderate consistency. The mean precision of 0.7167 and mean recall of 0.7329 indicate a good balance between false positives and false negatives, while the mean F1 score of 0.7155 reflects overall balanced performance. However, there is some variability in the results, particularly in recall, which ranges from 0.4 to 1.0, suggesting the need for further tuning to improve stability.

Grid Search Overview.

The XGBoost Classifier tested 108 combinations of settings using 5-fold cross-validation. The best results came with: `colsample_bytree` = 1.0, `learning_rate` = 0.1, `max_depth` = 2, `n_estimators` = 50, and `subsample` = 0.8. This gave a cross-validation accuracy of 83.54%, showing good performance.

Application of Optimized Parameters with XGB Boost Classifier.

The performance of the XGBoost Classifier on the test data, using optimized parameters, dropped slightly to 80% accuracy. For individuals without heart disease, the model achieved a precision of 0.76, recall of 0.91, and an F1-score of 0.83. For individuals with heart disease, the model recorded a precision of 0.86, recall of 0.68, and an F1-score of 0.76.

The cross-validation results were consistent, with a mean accuracy of 0.80, mean precision of 0.7538, mean recall of 0.8014, and a mean F1-score of 0.7755. 

Although the XGBoost Classifier initially performed worse on the test set before parameter optimization, it showed significantly improved consistency after applying K-fold cross-validation.

Comparison of Model Performance: XGBoost Classifier vs. Random Forest Classifier

            The Random Forest Classifier performed best when ran with optimized parameters. While the original model may have performed better on the test set, the cross-validation results show that the second model performed slightly better overall, particularly in mean accuracy and recall across folds, suggesting greater consistency during validation.

The same is true for the XGBoost Classifier, which initially performed worse on the test set before parameter optimization, however, showed improved consistency after applying K-fold cross-validation.

The XGBoost Classifier and the Random Forest Classifier performed similarly in terms of mean accuracy, precision, recall, and F1 score during cross-validation. While the XGBoost Classifier had a slightly higher recall (about 1%), the Random Forest Classifier demonstrated marginally better precision (just under 2%).

Ensemble method.

Ensemble methods are machine learning techniques that combine multiple models to achieve better performance than any individual model could alone. By aggregating predictions from several models, ensemble methods reduce variance, bias, and improve generalization. The key idea is that a group of “weak learners” can work together to form a stronger, more robust “ensemble” model.

In this study, we utilized an ensemble method to optimize classification performance by combining three distinct machine learning models: Random Forest Classifier, XGBoost Classifier, and Logistic Regression.

Base Classifiers

Three models were selected to form the ensemble due to their complementary strengths. The Random Forest Classifier is a bagging-based method known for its ability to reduce variance and handle overfitting by training multiple decision trees on bootstrap samples of the data. For this model, we set the following hyperparameters: n_estimators=50 (number of trees), max_depth=2 (to limit tree depth), and random_state=123 (for reproducibility).

The XGBoost Classifier, a boosting algorithm, was chosen for its ability to iteratively improve predictions by focusing on misclassified instances from prior iterations. The hyperparameters were optimized as follows: colsample_bytree=1.0 (use all features for splits), learning_rate=0.1 (to control the step size), max_depth=2 (limit tree depth), n_estimators=50 (number of boosting rounds), subsample=0.8 (sample 80% of data at each iteration), and random_state=42 (for reproducibility). XGBoost is particularly effective for capturing complex relationships in data while minimizing bias.

Lastly, Logistic Regression was incorporated as a linear model baseline. This model is widely used for its simplicity and interpretability, offering a linear decision boundary for classification. The solver=’saga’ was used for optimization, with penalty=’l2′ to apply ridge regularization and max_iter=10000 to ensure convergence during training. A random state of 42 was applied to ensure consistency. The ensemble model was constructed using a Voting Classifier with soft voting. In soft voting, the predicted probabilities from each base model are averaged, and the class with the highest probability is assigned as the final prediction.

Ensemble Method: Model Results Summary.

The updated results show that the ensemble model performs comparably to other models on the test data, achieving an accuracy of 80.00%, a precision of 0.81, a recall of 0.80, and an F1 score of 0.7966. For individuals without heart disease (Class 0), the model delivered a precision of 0.76 and a recall of 0.91, while for individuals with heart disease (Class 1), it achieved a precision of 0.86 and a recall of 0.68.

During cross-validation, the model achieved consistent performance with a mean accuracy of 80.00%, a mean precision of 82.26%, a mean recall of 80.00%, and a mean F1 score of 79.40%. These results indicate that the model effectively balances precision and recall while maintaining stable performance across folds. The strong cross-validation scores suggest that the ensemble model generalizes well to unseen data, highlighting its reliability and robustness.


Leave a comment

Your email address will not be published. Required fields are marked *