Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotel Booking Demand Prediction #734

Merged
merged 8 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions Hotel Booking Demand Prediction/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Hotel Booking Demand Dataset

**Source**: [Hotel Booking Demand - Kaggle](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand)

**Description**:
This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.
All personally identifying information has been removed from the data.
119,391 changes: 119,391 additions & 0 deletions Hotel Booking Demand Prediction/Dataset/hotel_bookings.csv

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
102 changes: 102 additions & 0 deletions Hotel Booking Demand Prediction/Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Hotel Booking Demand Prediction

## 🎯 **Goal**

Predicting cancellations based on booking data to estimate demand for hotel rooms.

## 🧵 **Dataset**

[Hotel Booking Demand Dataset](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand)

## 🧾 **Description**

This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. The problem is binary classification of cancellation status to estimate hotel booking demand.

## 🧮 **What I have done**

1. Exploratory analysis of features: cleaning, preprocessing and data visualization.
2. Feature engineering:
* re-categorizing categorical features based on target splits
* target-encoding high-cardinality categorical features
* discretizing numerical features with low number of unique values
3. Feature selection:
* Statistical tests - Pearson correlation, Mutual information scores, ANOVA F-test, Chi-squared test of independence
* Model-based feature importances using Extremely-Randomized Trees.
4. Created a holdout set for testing using Stratified sampling to maintain imbalance ratio.
5. Training and validation of: Logistic Regression, Naive Bayes, K-nearest neighbours, Decision Tree, Random Forest, AdaBoost, Multi-Layer Perceptron, and gradient-boosting trees (XGBoost, CatBoost, LightGBM).
6. Model ensembling using averaging of predictions with different configurations.
7. Models were tuned and evaluated based on ROC-AUC score instead of Accuracy, since the target classes are imbalanced.

## 🚀 **Models Implemented**

* Logistic Regression
* Naive Bayes: Gaussian
* K-Nearest Neighbours
* Decision Tree
* Random Forest
* AdaBoost
* Neural network: Multi-layer Perceptron
* Gradient-boosting models: XGBoost, CatBoost, LightGBM
* Model Ensembling: Simple/Power/Weighted averaging

## 📚 **Libraries Needed**

* Pandas
* Numpy
* Scikit-learn
* XGBoost
* CatBoost
* LightGBM
* Matplotlib
* Seaborn

## 📊 **Exploratory Data Analysis Results**

**Feature distributions**
![Image](../Images/featdist_leadtime.png)
![Image](../Images/featdist_arrivalweek.png)
![Image](../Images/featdist_arrivaldayofmonth.png)
![Image](../Images/featdist_staysweekend.png)
![Image](../Images/featdist_staysweekday.png)
![Image](../Images/featdist_totalstay.png)
![Image](../Images/featdist_adults.png)
![Image](../Images/featdist_adr.png)

**Feature selection**:
Correlation between features:
![Image](../Images/featselect_corrfeatures.png)
Correlation with target:
![Image](../Images/featselect_corrtarget.png)
Mutual Information:
![Image](../Images/featselect_mutualinfo.png)
Model-based feature importances:
![Image](../Images/featselect_modelfimp.png)

## 📈 **Performance of the Models**

Models were evaluated based on ROC-AUC score due imbalanced class ratio.

| Model configuration | ROC-AUC Score
|:-----|:-----:
| Logistic Regression | 0.8470
| Gaussian Naive Bayes | 0.7944
| K-Nearest Neighbours | 0.8810
| Decision Tree | 0.8820
| Random Forest | 0.8958
| AdaBoost | 0.8959
| Multi-layer Perceptron | 0.9039
| XGBoost | 0.9138
| LightGBM | 0.9146
| CatBoost | 0.9154
| Simple averaging | 0.9108
| Power averaging | 0.9062
| **Weighted averaging** | **0.9159**

## 📢 **Conclusion**

Trained a variety of models and created ensembles using averaging methods. Used ROC-AUC score to evaluate for imbalanced classification, and the best performance was shown by the Weighted-averaging ensemble.

## ✒️ **Your Signature**

Siddhant Tiwari
([Github](https://www.github.com/siddhant4ds) - [Kaggle](https://www.kaggle.com/sid4ds) - [LinkedIn](https://www.linkedin.com/in/siddhant-tiwari-ds/))

Large diffs are not rendered by default.

26 changes: 26 additions & 0 deletions Hotel Booking Demand Prediction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Hotel Booking Demand Prediction

## Project structure

.
├── Dataset
│   ├── hotel_bookings.csv
│   └── README.md
├── Images
│   ├── featdist_adr.png
│   ├── featdist_adults.png
│   ├── featdist_arrivaldayofmonth.png
│   ├── featdist_arrivalweek.png
│   ├── featdist_leadtime.png
│   ├── featdist_staysweekday.png
│   ├── featdist_staysweekend.png
│   ├── featdist_totalstay.png
│   ├── featselect_corrfeatures.png
│   ├── featselect_corrtarget.png
│   ├── featselect_modelfimp.png
│   └── featselect_mutualinfo.png
├── Model
│   ├── eda_modeling_ensembling.ipynb
│   └── README.md
├── requirements.txt
└── README.md
8 changes: 8 additions & 0 deletions Hotel Booking Demand Prediction/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
pandas==2.2.1
numpy==1.26.4
matplotlib==3.8.4
seaborn==0.13.2
scikit-learn==1.5.0
xgboost==2.1.0
catboost==1.2.5
lightgbm==4.5.0
Loading