This project builds and evaluates a credit risk classification model using XGBoost and Scikit-Learn.
The goal is to predict whether a loan applicant is likely to be a good or bad credit risk, based on structured features from the classic German Credit dataset.
Pipeline summary:
- EDA: dataset inspection, class balance, and feature typing.
- Feature Engineering: numeric / categorical separation, encoding, and data preparation.
- Modeling: baseline XGBoost training with log-loss objective.
- Evaluation: accuracy, ROC-AUC, confusion matrix, and class-wise metrics.
- Hyperparameter Tuning: randomized search with stratified 5-fold CV.
- Threshold Optimization: F1/precision-recall trade-off for class 0 (bad credit).
- Model Export: final tuned model serialized with
joblib.
credit-classification-xgboost/
βββ data/ # Input data (not versioned)
β βββ german_credit_cleaned.csv
βββ notebooks/
β βββ 01_eda.ipynb # Main Jupyter workflow
βββ models/
β βββ xgb_best_model.joblib # Trained XGBoost model
βββ src/
β βββ credit_risk/ # Future Python package (helpers, pipelines)
βββ scripts/ # Automation or CLI utilities
βββ reports/ # Figures, plots, analysis outputs
β βββ figures/
βββ docs/ # Optional documentation
βββ requirements.txt # Environment dependencies
βββ README.md
Python 3.13 (virtual environment)
Install dependencies:
pip install -r requirements.txtRun the main notebook:
jupyter notebook notebooks/01_eda.ipynbOutputs:
- Baseline + tuned XGBoost metrics
- ROC curve and threshold analysis
- Serialized model in
models/xgb_best_model.joblib
StratifiedKFoldensures balanced folds for the binary target.- Randomized search (40 iterations) achieved ROC-AUC β 0.79.
- Decision threshold tuning improved interpretability for loan approval policy.
- The notebook is modular β each markdown section corresponds to a distinct training phase.