A reproducible ML system from end-to-end data ingestion, feature engineering, model comparison, SMOTE-based imbalance correction, and CLI-driven training & evaluation workflows.
Demonstrates how to go from raw CSV → cleaned features → baseline models → reproducible CLI pipeline, with optional SMOTE to address severe class imbalance.
| Goal | Description |
|---|---|
| Prediction Task | Predict whether a patient died during hospitalization (DIED = 1). |
| Dataset | Clinical + administrative attributes (age, sex, length of stay, DRG, charges, diagnosis codes). |
| Class Imbalance | Positive class ≈ 9% → minority recall is the key metric. |
| Key Finding | SMOTE significantly improves DIED=1 recall across multiple models; Decision Tree delivered the strongest baseline. |
| Architecture Diagram | Heart Attack ML Pipeline |
|---|---|
![]() |
![]() |
- Core: Python, pandas, scikit-learn, imbalanced-learn
- ML Pipeline: modular scripts, SMOTE support, artifact saving
- Environment: virtualenv + VS Code / terminal
- Extras: Jupyter Notebook for EDA & visualizations
<project-dir>/
├─ data/
│ ├─ raw/ # Raw CSV dataset (place whole\_table.csv here)
│ └─ processed/ # Cleaned data, metrics, trained models (.joblib)
├─ notebooks/
│ └─ heart_attack_eda.ipynb # EDA + visualizations
├─ src/
│ ├─ __init__.py
│ ├─ config.py # Paths & constants
│ ├─ data_loader.py # CSV loading + basic cleaning (e.g., coercing CHARGES numeric)
│ ├─ features.py # Preprocessing (OHE for categoricals, scaling for numerics)
│ ├─ models.py # Model factory, SMOTE, training logic,training implementation
│ └─ evaluate.py # Metric utilities (accuracy, recall, reports)
├─ scripts/
│ ├─ train.py # CLI script for model training
│ └─ evaluate_model.py # CLI script for model evaluation
├─ requirements.txt
└─ README.md
The project trains six classic ML models:
- Naive Bayes
- KNN
- Decision Tree
- Logistic Regression
- SVM
- MLP (Neural Network)
SMOTE is optionally applied to mitigate heavy class imbalance.
Metrics evaluated:
- Accuracy
- Recall for class 0 and 1
- Emphasis: positive-class recall (DIED=1)
Below are the key model insights, including feature importance rankings and
a visualization of class distribution before and after applying SMOTE.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
Put the dataset here:
<project-dir>/data/raw/whole_table.csv
Or pass a custom path via --raw-csv.
Example: Decision Tree + SMOTE
cd <project-dir>
python -m scripts.train --model decision_tree --smoteExample: Logistic Regression (no SMOTE)
bash
python -m scripts.train --model logistic_regression
Outputs generated:
data/processed/trained_model.joblibdata/processed/metrics.json
python -m scripts.evaluate_model --model-path data/processed/trained_model.joblib
jupyter notebook
# open: notebooks/heart_attack_eda.ipynbThe notebook mirrors the full workflow (EDA → preprocessing → modeling → visualization).
- CHARGES may contain "." — coerced to numeric with
errors="coerce" - Missing values handled in preprocessing
- SMOTE activated via
--smote - Minority recall emphasized because DIED=1 is medically critical
- Designed as a teaching example, not a clinical decision tool
This project is for educational purposes only and not intended for clinical or medical use.



