Tools used: Python (Numpy, Pandas, Scikit-learn, Matplotlib)
Content: Exploratory data analysis, imputing null values, feature engineering, model building and tuning.
Kaggle Notebook: https://www.kaggle.com/code/youusha/spaceship-titanic-80-5-data-imputing-focus
Notes:
This is what I did to get an 80.5% accuracy on my Spaceship Titanic competition submission, as pretty much a beginner. But it is far from perfect and I would really appreciate any constructive feedback on this project.
Since I knew extremely little about all the machine learning models and hyperparameters and what-not when I was working on this, I just decided to follow the basics and do my best to fill null values as accurately as possible. Then I just trained and predicted using the basic models that I knew, and a couple I know nothing about (haha dw I'll learn them soon).
If you wish, you can directly use the final dataset that I created in this notebook and used for prediction here.
And this is the notebook in this competition that helped me a lot on this project. Basically gave me a sense on how to view data, and how to create more from what I have. The new features that I use are the ones created in this notebook.
With all that out of the way, let's get started :)
First, we are going to import the libraries and modules we will be using:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
And the datasets:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
Let's take a look at the data:
train_df.head()
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | Transported | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy | False |
1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines | True |
2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent | False |
3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent | False |
4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines | True |
train_df.isna().any()
PassengerId False
HomePlanet True
CryoSleep True
Cabin True
Destination True
Age True
VIP True
RoomService True
FoodCourt True
ShoppingMall True
Spa True
VRDeck True
Name True
Transported False
dtype: bool
Observation: All the columns have null values, except for PassengerId
and Transported
.
Let's look at exactly how many nulls we have:
print('Sum of nulls:')
train_df.isna().sum()
Sum of nulls:
PassengerId 0
HomePlanet 201
CryoSleep 217
Cabin 199
Destination 182
Age 179
VIP 203
RoomService 181
FoodCourt 183
ShoppingMall 208
Spa 183
VRDeck 188
Name 200
Transported 0
dtype: int64
So, the first thing we have to do is start filling these null values, or the ml models won't work. The first thing I want to do is figure out a way to fill up the CryoSleep
values.
I am much more comfortable with doing all this on a new copy, just in case I mess up.
train_df_copy = train_df.copy()
Now I want temporarily to make an Expenses
column. I'm making this column because if someone is in cryosleep, they are not spending any money. So, knowing someone's expenses can help us impute values for CryoSleep
.
train_df_copy['Expenses'] = train_df_copy[['RoomService', 'FoodCourt',
'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)
Since we can't really guess anyone's age, I'll just impute these null values with the median. An interesting thing I observed is that only people who are 13+ have expenses. I guess the little kids don't get pocket money. Quite unfair.
train_df_copy.Age = train_df_copy.Age.fillna(train_df_copy.Age.median())
Let's take a look at our dataset now:
train_df_copy.head()
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | Transported | Expenses | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy | False | 0.0 |
1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines | True | 736.0 |
2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent | False | 10383.0 |
3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent | False | 5176.0 |
4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines | True | 1091.0 |
We can see our newly created columns after Transported
. Funnily enough, the first person on the dataset, Maham, didn't spend any money, and they weren't even in cryosleep. Maybe they are broke? In that case I can relate with them.
This is where the real fun begins. First, we're going to make a new column for cryosleep, with all values equal to False (or 0):
train_df_copy['Cryosleep'] = 0
Now, for every row where Expenses
is 0
, we're going to put 1
as the value. Because if someone has not spent any money, they are proably in cryosleep. But don't worry, we'll deal with the exceptions, like Maham, later.
train_df_copy.loc[train_df_copy['Expenses'] == 0, 'Cryosleep'] = 1
Now, we are going to set this column's value to 1
wherever the original CryoSleep
is equal to True.
train_df_copy.loc[train_df_copy.CryoSleep.astype('str') == 'True', 'Cryosleep'] = 1
Conversely, we will put it to 0
wherever CryoSleep
is False.
train_df_copy.loc[train_df_copy.CryoSleep.astype('str') == 'False', 'Cryosleep'] = 0
Let's take a look at this new column now:
train_df_copy.head()
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | Transported | Expenses | Cryosleep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy | False | 0.0 | 0 |
1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines | True | 736.0 | 0 |
2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent | False | 10383.0 | 0 |
3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent | False | 5176.0 | 0 |
4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines | True | 1091.0 | 0 |
What we have done here is:
- First, we set all values for cryosleep as false.
- Next, we set cryosleep as true for everyone who hasn't spent any money.
- Finally, we used the original
Cryosleep colum
, to correct cryosleep status for the people who haven't spent any money, but aren't in cryosleep. Just in case our last step incorrectly classified them as being in cryosleep.
Logical, right?
Now, let's just replace the original column with this one. There's probably a better way of doing this than how I did it here here:
train_df_copy['Cryosleep'] = train_df_copy['Cryosleep'].astype('bool')
train_df_copy['CryoSleep'] = train_df_copy['Cryosleep']
train_df_copy.drop('Cryosleep',axis=1,inplace=True)
Let's take another look at our dataset now:
train_df_copy.head()
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | Transported | Expenses | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy | False | 0.0 |
1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines | True | 736.0 |
2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent | False | 10383.0 |
3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent | False | 5176.0 |
4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines | True | 1091.0 |
We have now replaced the values of our original CryoSleep
column, that had missing values, with the values of our newly created Cryosleep
column which doesn't have any null values. Then we dropped our new column.
Our new column also states accurately that Maham is not in cryosleep, and he still hasn't spent any money on amenities, i.e., RoomService
,FoodCourt
,ShoppingMall
,Spa
and VRDeck
.
The new column shouldn't have any null values now. Let's check just in case:
train_df_copy.CryoSleep.isnull().any()
False
Since the only important person in this dataset is Maham, we don't need the names column. (Or maybe we actually do and can use to to further improve prediction, but I'm just not good enough to figure out how to do that yet.)
train_df_copy.drop('Name',axis=1,inplace=True)
Now for the amenities, we can easily impute null values for Cryosleep
== True, since we know they are going to be zero as the person is in cryosleep.
train_df_copy.loc[train_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']] = 0
train_df_copy.loc[train_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']].isna().sum()
RoomService 0
FoodCourt 0
ShoppingMall 0
Spa 0
VRDeck 0
dtype: int64
Before dealing with the rest of the amenities' values, let's make some more new columns to aid us.
train_df_copy['Adults'] = train_df_copy['Age'] >= 13
I know 13 year olds aren't adults, okay. What I mean is that they are able to spend money at this age. Unike a certain someone we know of. If someone has any spare change, do let me know.
I'm not picking on Maham, I just want him to enjoy his journey on the spaceship titanic to the absolute fullest, especially with the tragedy that happens. To be honest, I am extremely happy that he didn't get transported into who-knows-what dimension. He is still with us, and we are all grateful for that, I am sure.
Jokes aside, let's make a column now that tells us if someone is 13+ and is spending money.
train_df_copy['Adult_and_spending'] = (train_df_copy['Expenses'] > 0) & (train_df_copy['Age'] >=13)
Let's take a look at the rows that are True for our new Adult_and_spending
column:
train_df_copy.loc[train_df_copy.Adult_and_spending == True]
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Transported | Expenses | Adults | Adult_and_spending | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | True | 736.0 | True | True |
2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | False | 10383.0 | True | True |
3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | False | 5176.0 | True | True |
4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | True | 1091.0 | True | True |
5 | 0005_01 | Earth | False | F/0/P | PSO J318.5-22 | 44.0 | False | 0.0 | 483.0 | 0.0 | 291.0 | 0.0 | True | 774.0 | True | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8687 | 9275_03 | Europa | False | A/97/P | TRAPPIST-1e | 30.0 | False | 0.0 | 3208.0 | 0.0 | 2.0 | 330.0 | True | 3540.0 | True | True |
8688 | 9276_01 | Europa | False | A/98/P | 55 Cancri e | 41.0 | True | 0.0 | 6819.0 | 0.0 | 1643.0 | 74.0 | False | 8536.0 | True | True |
8690 | 9279_01 | Earth | False | G/1500/S | TRAPPIST-1e | 26.0 | False | 0.0 | 0.0 | 1872.0 | 1.0 | 0.0 | True | 1873.0 | True | True |
8691 | 9280_01 | Europa | False | E/608/S | 55 Cancri e | 32.0 | False | 0.0 | 1049.0 | 0.0 | 353.0 | 3235.0 | False | 4637.0 | True | True |
8692 | 9280_02 | Europa | False | E/608/S | TRAPPIST-1e | 44.0 | False | 126.0 | 4688.0 | 0.0 | 0.0 | 12.0 | True | 4826.0 | True | True |
5040 rows Ă— 16 columns
So there are 5040 people who are 13+ and are spending money.
Now we are going to impute the values for our amenities.
We know if someone is not an adult and has zero expenses, they are either below 13, which means they definitely haven't spent on any amenities, or they are in cryosleep, which again means they definitely haven't spent on amenities.
So, wherever we have Adult_and_spending
== False, we'll impute them with 0
.
train_df_copy.RoomService = train_df_copy.RoomService.fillna(train_df_copy.RoomService.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'RoomService'] = 0
train_df_copy.FoodCourt = train_df_copy.FoodCourt.fillna(train_df_copy.FoodCourt.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'FoodCourt'] = 0
train_df_copy.ShoppingMall = train_df_copy.ShoppingMall.fillna(train_df_copy.ShoppingMall.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'ShoppingMall'] = 0
train_df_copy.Spa = train_df_copy.Spa.fillna(train_df_copy.Spa.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'Spa'] = 0
train_df_copy.VRDeck = train_df_copy.VRDeck.fillna(train_df_copy.VRDeck.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'VRDeck'] = 0
Neat. Now we are done with imputing these columns as well.
Let's take a look:
train_df_copy[['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']].isna().sum()
RoomService 0
FoodCourt 0
ShoppingMall 0
Spa 0
VRDeck 0
dtype: int64
Perfect.
For the remaining columns, we can't figure out what values to fill in this manner. So we are just going to fill them with the values that the majority of people have in the dataset, i.e., the mode.
train_df_copy.HomePlanet.mode()
0 Earth
Name: HomePlanet, dtype: object
train_df_copy.Destination.mode()
0 TRAPPIST-1e
Name: Destination, dtype: object
train_df_copy.VIP.mode()
0 False
Name: VIP, dtype: object
So, these are the values we will be imputing with.
train_df_copy.HomePlanet = train_df_copy.HomePlanet.fillna('Earth')
train_df_copy.Destination = train_df_copy.Destination.fillna('TRAPPIST-1e')
train_df_copy.VIP = train_df_copy.VIP.fillna('False')
train_df_copy.VIP = train_df_copy.VIP.astype('bool')
Aaand done!
Let's see how much we are done:
train_df_copy.isnull().sum()
PassengerId 0
HomePlanet 0
CryoSleep 0
Cabin 199
Destination 0
Age 0
VIP 0
RoomService 0
FoodCourt 0
ShoppingMall 0
Spa 0
VRDeck 0
Transported 0
Expenses 0
Adults 0
Adult_and_spending 0
dtype: int64
The cabin is the only column that remains with null values!
Filling this is not easy due to my limited skill. I am just going to use ffill to fill these null values. What that does is basically use the previous value to impute the missing one.
So, for example, if we have a dataset like:
[1, 2, 3, null, 4]
If we use ffill on this, it'll become:
[1, 2, 3, 3, 4].
train_df_copy['Cabin'] = train_df_copy.Cabin.fillna(method='ffill')
train_df_copy.isnull().sum()
PassengerId 0
HomePlanet 0
CryoSleep 0
Cabin 0
Destination 0
Age 0
VIP 0
RoomService 0
FoodCourt 0
ShoppingMall 0
Spa 0
VRDeck 0
Transported 0
Expenses 0
Adults 0
Adult_and_spending 0
dtype: int64
And so, we are done with imputing. Time to move on to feature engineering.
These are the features that I am going to add to this dataset (again, I got the idea for them here).
train_df_copy['Group_nums'] = train_df_copy.PassengerId.apply(lambda x: x.split('_')).apply(lambda x: x[0])
train_df_copy['Grouped'] = ((train_df_copy['Group_nums'].value_counts() > 1).reindex(train_df_copy['Group_nums'])).tolist()
train_df_copy['Deck'] = train_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[0])
train_df_copy['Side'] = train_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[2])
train_df_copy['Has_expenses'] = train_df_copy['Expenses'] > 0
train_df_copy['Is_Embryo'] = train_df_copy['Age'] == 0
These specifiy:
- If someone was alone or in a group.
- Which deck someone was in.
- Which side (Starboard or Port).
- If the passenger was 0 years old (i.e, an embryo).
Let's get rid of our temporary columns:
train_df_copy.drop(['Adult_and_spending','Group_nums','Expenses'],axis=1,\
inplace=True)
This is our final dataset:
train_df_copy.head()
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Transported | Adults | Grouped | Deck | Side | Has_expenses | Is_Embryo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | False | True | False | B | P | False | False |
1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | True | True | False | F | S | True | False |
2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | False | True | True | A | S | True | False |
3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | False | True | True | A | S | True | False |
4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | True | True | False | F | S | True | False |
Saving it just in case.
train_df_copy.to_csv('Cleaned and imputed data.csv',index=False)
Since even our test data has missing values, we have to do all that to our test data as well.
test_df_copy = test_df.copy()
test_df_copy['Expenses'] = test_df_copy[['RoomService', 'FoodCourt',
'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)
test_df_copy.Age = test_df_copy.Age.fillna(test_df_copy.Age.median())
test_df_copy['Adult_spending_awake'] = (test_df_copy['Expenses'] > 0)\
& (test_df_copy['Age'] >= 13)\
& (test_df_copy['CryoSleep'] == False)
test_df_copy['Cryosleep'] = 0
test_df_copy.loc[test_df_copy['Expenses'] == 0, 'Cryosleep'] = 1
test_df_copy.loc[test_df_copy.CryoSleep.astype('str') == 'True', 'Cryosleep'] = 1
test_df_copy.loc[test_df_copy.CryoSleep.astype('str') == 'False', 'Cryosleep'] = 0
test_df_copy['Cryosleep'] = test_df_copy['Cryosleep'].astype('bool')
test_df_copy['CryoSleep'] = test_df_copy['Cryosleep']
test_df_copy.drop('Cryosleep',axis=1,inplace=True)
test_df_copy.drop('Name',axis=1,inplace=True)
test_df_copy.loc[test_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']] = 0
test_df_copy['Adults'] = test_df_copy['Age'] >= 13
test_df_copy['Adult_and_spending'] = (test_df_copy['Expenses'] > 0) & (test_df_copy['Age'] >=13)
test_df_copy.loc[test_df_copy.Adult_and_spending == True]
test_df_copy.RoomService = test_df_copy.RoomService.fillna(test_df_copy.RoomService.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'RoomService'] = 0
test_df_copy.FoodCourt = test_df_copy.FoodCourt.fillna(test_df_copy.FoodCourt.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'FoodCourt'] = 0
test_df_copy.ShoppingMall = test_df_copy.ShoppingMall.fillna(test_df_copy.ShoppingMall.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'ShoppingMall'] = 0
test_df_copy.Spa = test_df_copy.Spa.fillna(test_df_copy.Spa.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'Spa'] = 0
test_df_copy.VRDeck = test_df_copy.VRDeck.fillna(test_df_copy.VRDeck.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'VRDeck'] = 0
test_df_copy.HomePlanet = test_df_copy.HomePlanet.fillna('Earth')
test_df_copy.Destination = test_df_copy.Destination.fillna('TRAPPIST-1e')
test_df_copy.VIP = test_df_copy.VIP.fillna('False')
test_df_copy.VIP = test_df_copy.VIP.astype('bool')
test_df_copy['Cabin'] = test_df_copy.Cabin.fillna(method='ffill')
test_df_copy['Group_nums'] = test_df_copy.PassengerId.apply(lambda x: x.split('_')).apply(lambda x: x[0])
test_df_copy['Grouped'] = ((test_df_copy['Group_nums'].value_counts() > 1).reindex(test_df_copy['Group_nums'])).tolist()
test_df_copy['Deck'] = test_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[0])
test_df_copy['Side'] = test_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[2])
test_df_copy['Has_expenses'] = test_df_copy['Expenses'] > 0
test_df_copy['Is_Embryo'] = test_df_copy['Age'] == 0
test_df_copy.columns
test_df_copy.drop(['Expenses', 'Adult_spending_awake', 'Adult_and_spending','Adults'],axis=1, inplace=True)
test_df_copy.to_csv('Cleaned and imputed test data.csv',index=False)
Simple enough.
Time to build some models.
Let's import Logistic Regression. I'm also going to import train-test split, just for some light evaluation.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Now, we import the csv's that we saved earlier.
df_train = pd.read_csv('Cleaned and imputed data.csv')
df_test = pd.read_csv('Cleaned and imputed test data.csv')
df_train.head()
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Transported | Adults | Grouped | Deck | Side | Has_expenses | Is_Embryo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | False | True | False | B | P | False | False |
1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | True | True | False | F | S | True | False |
2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | False | True | True | A | S | True | False |
3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | False | True | True | A | S | True | False |
4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | True | True | False | F | S | True | False |
df_test.head()
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Group_nums | Grouped | Deck | Side | Has_expenses | Is_Embryo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0013_01 | Earth | True | G/3/S | TRAPPIST-1e | 27.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 13 | False | G | S | False | False |
1 | 0018_01 | Earth | False | F/4/S | TRAPPIST-1e | 19.0 | False | 0.0 | 9.0 | 0.0 | 2823.0 | 0.0 | 18 | False | F | S | True | False |
2 | 0019_01 | Europa | True | C/0/S | 55 Cancri e | 31.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 19 | False | C | S | False | False |
3 | 0021_01 | Europa | False | C/1/S | TRAPPIST-1e | 38.0 | False | 0.0 | 6652.0 | 0.0 | 181.0 | 585.0 | 21 | False | C | S | True | False |
4 | 0023_01 | Earth | False | F/5/S | TRAPPIST-1e | 20.0 | False | 10.0 | 0.0 | 635.0 | 0.0 | 0.0 | 23 | False | F | S | True | False |
All looks good.
Now we are going to do some feature selection.
df_train.dtypes
PassengerId object
HomePlanet object
CryoSleep bool
Cabin object
Destination object
Age float64
VIP bool
RoomService float64
FoodCourt float64
ShoppingMall float64
Spa float64
VRDeck float64
Transported bool
Adults bool
Grouped bool
Deck object
Side object
Has_expenses bool
Is_Embryo bool
dtype: object
features = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP',
'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
'Grouped', 'Deck', 'Has_expenses', 'Side', 'Is_Embryo']
These are the features that I decided to use for model training and testing. I don't know if these are the best ones. So you can try different ones, and could even get a better result than mine!
Now we will assign the data in the training set to feature and target variables, and do a train-test-split split for evaluation.
X = pd.get_dummies(df_train[features])
y = df_train['Transported']
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1)
Let's fit and score:
model = LogisticRegression(max_iter=10000)
model.fit(X_train,y_train)
model.score(X_test,y_test)
0.8003679852805887
Not bad.
Since we actually have to predict the test set that Kaggle has provided, we want to use all of the train data to train the model. The more data the model gets to learn from, the better the prediction.
model2 = LogisticRegression(max_iter=10000)
model2.fit(X,y)
model2.score(X,y)
0.792016565052341
Let's predict our test set now and save it:
y_pred_log2 = model2.predict(pd.get_dummies(df_test[features]))
Now I'll use the only other classification model I knew at the time, K-Neighbors Classifier.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
And use GridSearchCV to get the optimal K value (code commented out as it takes time to run):
# knn = KNeighborsClassifier()
# param_grid = {'n_neighbors':np.arange(2,15)}
# knn_gscv = GridSearchCV(knn, param_grid, cv=5)
# knn_gscv.fit(X,y)
# knn_gscv.best_params_
knn2 = KNeighborsClassifier(n_neighbors=14)
knn2.fit(X,y)
knn2.score(X,y)
0.8149085471068676
And save:
y_pred_knn = knn2.predict(pd.get_dummies(df_test[features]))
Now, I did see that the model that seemed to perform great on this data is Gradient Boosting Classifier. So I looked it up and just used it with default hyperparameters:
from sklearn.ensemble import GradientBoostingClassifier
gbr = GradientBoostingClassifier(random_state = 1)
# Fit to training set
gbr.fit(X, y)
gbr.score(X,y)
0.8130679857356494
Seems slightly worse than our K-Neighbors Classifier. But still, we'll keep its predictions as well.
pred_y_gbr = gbr.predict(pd.get_dummies((df_test[features])))
Since Gradient Boosting was performing well, and I had also stumbled upon Extreme Gradient Boosting, it only seems logical to try that out as well (maybe we'll get extremely good results):
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X,y)
xgb.score(X,y)
0.8887610721269987
y_pred_xgb = xgb.predict(pd.get_dummies((df_test[features])))
The last thing I want to do is tune the Gradient Boost further using GSCV (my 4gb laptop dies when running this ok, so yes I will comment it out again):
# gbc = GradientBoostingClassifier()
# parameters = {
# "n_estimators":[5,50,100],
# "max_depth":[1,3,5],
# "learning_rate":[0.01,0.1,1]
# }
# from sklearn.model_selection import GridSearchCV
# from sklearn.model_selection import RandomizedSearchCV
# cv = RandomizedSearchCV(gbc, parameters, n_iter=27, scoring='accuracy', n_jobs=-1, cv=5, random_state=1)
# cv.fit(X,y)
# cv.best_params_
gbc1 = GradientBoostingClassifier(n_estimators=50,max_depth=5,learning_rate=0.1) #best params from gscv
gbc1.fit(X,y)
gbc1.score(X,y)
0.831013459105027
pred_y_gbr2 = gbc1.predict(pd.get_dummies((df_test[features])))
And so, we are done!
Time for submission.
# Logist_out2 = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported': y_pred_log2})
# Logist_out2.to_csv('submission.csv',index=False)
Logistic Regression competition Score = 0.79448
# knn_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported': y_pred_knn})
# knn_out.to_csv('submission.csv',index=False)
KNN competition score = 0.79261
# xgb_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported':y_pred_xgb.astype('bool')})
# xgb_out.to_csv('submission.csv',index=False)
XGB competition score = 0.79307
# gbr_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported': pred_y_gbr})
# gbr_out.to_csv('submission.csv',index=False)
Gradient Boost competition score = 0.80056
gbc_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported':pred_y_gbr2})
gbc_out.to_csv('submission.csv',index=False)
Tuned Gradient Boost competition score = 0.80476
And so, we have a winner.