A Machine Learning Approach using the PaySim Dataset
Fraud Cases are very difficult to find and practise on, as they remain highly confidential. There are some datasets on Kaggle, Paysim dataset is one of them.
This project compares six machine learning classifiers for detecting fraudulent mobile money transactions, using the synthetic PaySim dataset.
The dataset simulates real transaction patterns from an African mobile money service to enable safe experimentation without revealing confidential data.
Fraud detection is challenging due to:
- Extremely imbalanced data (only 0.129% fraudulent transactions)
- The need to balance accuracy with interpretability in finance
- The high cost of false negatives (missed fraud cases)
- Source: PaySim on Kaggle
- Size: 6.36 million rows, 11 columns
- Target Variable:
isFraud(1 = fraud, 0 = genuine) - Preprocessing:
- Removed non-informative columns:
step,nameOrig,nameDest,isFlaggedFraud - Label-encoded categorical
type - Standardized numerical features
- Balanced dataset via Random Undersampling (8,213 fraud + 8,213 non-fraud)
- Removed non-informative columns:
- Logistic Regression (LR)
- Naive Bayes (NB)
- Decision Tree (DT)
- Random Forest (RF)
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Feature Selection & Encoding
- Data Standardization
- Class Balancing (Random Undersampling)
- Train-Test Split (70% / 30%)
- Model Training
- Hyperparameter Tuning (Grid Search for LR, SVM, KNN)
- Evaluation using:
- Accuracy
- Precision
- Recall
- F1-score
- ROC-AUC
| Model | Accuracy | Precision | Recall | F1-score | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.910 | 0.961 | 0.853 | 0.904 | β |
| Naive Bayes | 0.675 | 0.892 | 0.393 | 0.546 | β |
| Decision Tree | 0.992 | 0.989 | 0.995 | 0.992 | 0.992 |
| Random Forest | 0.992 | 0.988 | 0.996 | 0.992 | 0.999 |
| KNN | 0.956 | 0.946 | 0.965 | 0.956 | β |
| SVM | 0.914 | 0.959 | 0.863 | 0.909 | β |
- Random Forest achieved the highest recall (99.6%) and AUC (0.999) β best at catching fraud.
- Decision Tree is almost as accurate (recall: 99.5%) but far more interpretable.
- Naive Bayes underperformed for this dataset, despite good results in credit card fraud literature.
- For finance, interpretability can outweigh minor performance gains. As a result, we use Decision Tree for explainability data science
- Feature Importance (Random Forest)
- Confusion Matrices for each classifier
- ROC Curves comparison
