A comprehensive machine learning pipeline for binary classification tasks with automated feature selection, hyperparameter optimization, and model explainability.
- Automated Data Preprocessing: Handles missing values, data cleaning, and type conversions
- Iterative Imputation: Uses decision tree-based imputation for missing data
- Feature Selection: Boruta-SHAP algorithm for robust feature selection
- Multiple Models: Trains 6 different classifiers (LR, SVM, RF, XGBoost, LightGBM, Neural Network)
- Bayesian Optimization: Automated hyperparameter tuning using scikit-optimize
- Comprehensive Evaluation: AUC, calibration, discrimination metrics, and permutation tests
- Model Explainability: SHAP values for feature importance and instance-level explanations
pip install -r requirements.txtEdit the configuration in src/utils.py to customize pipeline parameters:
config = {
'random_state': 42, # Random seed
'test_size': 0.2, # Test set proportion
'cv_folds': 5, # Cross-validation folds
'n_bayes_iter': 50, # Bayesian optimization iterations
'missing_threshold': 0.4, # Drop vars with >40% missing
'boruta_percentile': 90, # Boruta feature selection threshold
'boruta_pvalue': 0.05, # Boruta p-value threshold
'max_impute_iter': 100, # Maximum imputation iterations
'n_permutations': 1000, # Permutation test iterations
'output_dir': 'output' # Output directory
}python main.pyfrom src.preprocessing import clean_dataframe, drop_high_missing_vars
from src.imputation import DataImputer
from src.feature_selection import BorutaFeatureSelector
from src.model_training import ModelTrainer
# Load and clean data
df = pd.read_csv('data/data.csv')
df_clean = clean_dataframe(df, outcome_col='diagnosis',
outcome_mapping={'M': 1, 'B': 0})
# Impute missing values
imputer = DataImputer()
X_imputed, _ = imputer.fit_transform(X)
# Feature selection
selector = BorutaFeatureSelector(percentile=90)
X_selected = selector.fit_transform(X_train, y_train)
# Train models
trainer = ModelTrainer(n_iter=50, cv=5)
results = trainer.train_all_models(X_train, y_train)After running the pipeline, the following files are generated:
imputer.pkl- Fitted imputerscaler.pkl- Fitted feature scalermodel_lr.pkl,model_xgb.pkl, etc. - Trained models
evaluation_results.xlsx- Comprehensive metrics for all modelsselected_features.xlsx- Features selected by Borutamissingness_report.xlsx- Missing data analysis
roc_curves_test.png- ROC curves for all models
shap_bar_xgboost.png- Feature importance bar plotshap_beeswarm_xgboost.png- Feature effect beeswarm plotfeature_importance_xgboost.csv- Detailed feature importanceexplainer_xgboost.pkl- Saved SHAP explainer
All models use Bayesian hyperparameter optimization for best performance:
| Model | Description |
|---|---|
| Logistic Regression | Linear baseline model |
| Support Vector Machine (SVM) | Non-linear kernel methods |
| Random Forest | Ensemble of decision trees |
| XGBoost | Gradient boosting with regularization |
| LightGBM | Fast gradient boosting |
| Neural Network | Multi-layer perceptron |
- AUC (Area Under the ROC Curve)
- Discrimination slope
- Calibration slope
- Observed-to-Expected (O/E) ratio
- Accuracy
- Sensitivity (Recall)
- Specificity
- Positive Predictive Value (PPV/Precision)
- Negative Predictive Value (NPV)
- Permutation test p-value
- Cohen's kappa
- ROC curves
- Calibration curves
- Risk distributions
SupervisedML/
├── data/ # Place your data.csv here
├── src/
│ ├── __init__.py
│ ├── preprocessing.py # Data cleaning, missingness analysis
│ ├── imputation.py # IterativeImputer utilities
│ ├── feature_selection.py # Boruta-SHAP wrapper
│ ├── model_training.py # Model training with Bayesian optimization
│ ├── evaluation.py # Metrics, ROC, calibration curves
│ ├── explainability.py # SHAP analysis
│ └── utils.py # General helpers (I/O, config, scaling)
├── output/ # Generated outputs
│ ├── models/ # Saved trained models
│ ├── results/ # CSV/Excel results
│ ├── plots/ # Visualization outputs
│ └── shap/ # SHAP analysis outputs
├── main.py # Main pipeline script
├── requirements.txt
└── README.md
from src.evaluation import ModelEvaluator, plot_roc_curves
evaluator = ModelEvaluator()
results_df = evaluator.evaluate_multiple_models(
models_dict, X_train, y_train, X_test, y_test
)
plot_roc_curves(models_dict, X_test, y_test,
save_path='roc_curves.png')from src.explainability import SHAPExplainer
explainer = SHAPExplainer(model, X_train)
explainer.calculate_shap_values(X_test)
explainer.plot_bar(max_display=15)
explainer.plot_beeswarm(max_display=15)
# Explain single instance
instance_explanation = explainer.explain_instance(
X_test.iloc[[0]], n_top_features=5
)-
Data Format: Ensure your
data.csvincludes:- An
idcolumn for sample identifiers - An outcome column (default:
diagnosis) - Feature columns
- An
-
Customization: Modify the configuration in
src/utils.pybefore running the pipeline -
Model Selection: Review
evaluation_results.xlsxto compare model performance -
Feature Importance: Check SHAP outputs to understand feature contributions
Note: All models are trained using Bayesian optimization to ensure optimal hyperparameter selection for your specific dataset.