Spaceship Titanic

Tools used: Python (Numpy, Pandas, Scikit-learn, Matplotlib)

Content: Exploratory data analysis, imputing null values, feature engineering, model building and tuning.

Kaggle Notebook: https://www.kaggle.com/code/youusha/spaceship-titanic-80-5-data-imputing-focus

Notes:

This is what I did to get an 80.5% accuracy on my Spaceship Titanic competition submission, as pretty much a beginner. But it is far from perfect and I would really appreciate any constructive feedback on this project.

Since I knew extremely little about all the machine learning models and hyperparameters and what-not when I was working on this, I just decided to follow the basics and do my best to fill null values as accurately as possible. Then I just trained and predicted using the basic models that I knew, and a couple I know nothing about (haha dw I'll learn them soon).

If you wish, you can directly use the final dataset that I created in this notebook and used for prediction here.

And this is the notebook in this competition that helped me a lot on this project. Basically gave me a sense on how to view data, and how to create more from what I have. The new features that I use are the ones created in this notebook.

With all that out of the way, let's get started :)

Exploratory Data Analysis

First, we are going to import the libraries and modules we will be using:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

And the datasets:

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

Let's take a look at the data:

train_df.head()

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Name	Transported
0	0001_01	Europa	False	B/0/P	TRAPPIST-1e	39.0	False	0.0	0.0	0.0	0.0	0.0	Maham Ofracculy	False
1	0002_01	Earth	False	F/0/S	TRAPPIST-1e	24.0	False	109.0	9.0	25.0	549.0	44.0	Juanna Vines	True
2	0003_01	Europa	False	A/0/S	TRAPPIST-1e	58.0	True	43.0	3576.0	0.0	6715.0	49.0	Altark Susent	False
3	0003_02	Europa	False	A/0/S	TRAPPIST-1e	33.0	False	0.0	1283.0	371.0	3329.0	193.0	Solam Susent	False
4	0004_01	Earth	False	F/1/S	TRAPPIST-1e	16.0	False	303.0	70.0	151.0	565.0	2.0	Willy Santantines	True

train_df.isna().any()

PassengerId     False
HomePlanet       True
CryoSleep        True
Cabin            True
Destination      True
Age              True
VIP              True
RoomService      True
FoodCourt        True
ShoppingMall     True
Spa              True
VRDeck           True
Name             True
Transported     False
dtype: bool

Observation: All the columns have null values, except for PassengerId and Transported.

Let's look at exactly how many nulls we have:

print('Sum of nulls:')
train_df.isna().sum()

Sum of nulls:





PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

So, the first thing we have to do is start filling these null values, or the ml models won't work. The first thing I want to do is figure out a way to fill up the CryoSleep values.

Imputing Null Values

I am much more comfortable with doing all this on a new copy, just in case I mess up.

train_df_copy = train_df.copy()

Now I want temporarily to make an Expenses column. I'm making this column because if someone is in cryosleep, they are not spending any money. So, knowing someone's expenses can help us impute values for CryoSleep.

train_df_copy['Expenses'] = train_df_copy[['RoomService', 'FoodCourt',
                                           'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)

Since we can't really guess anyone's age, I'll just impute these null values with the median. An interesting thing I observed is that only people who are 13+ have expenses. I guess the little kids don't get pocket money. Quite unfair.

train_df_copy.Age = train_df_copy.Age.fillna(train_df_copy.Age.median())

Let's take a look at our dataset now:

train_df_copy.head()

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Name	Transported	Expenses
0	0001_01	Europa	False	B/0/P	TRAPPIST-1e	39.0	False	0.0	0.0	0.0	0.0	0.0	Maham Ofracculy	False	0.0
1	0002_01	Earth	False	F/0/S	TRAPPIST-1e	24.0	False	109.0	9.0	25.0	549.0	44.0	Juanna Vines	True	736.0
2	0003_01	Europa	False	A/0/S	TRAPPIST-1e	58.0	True	43.0	3576.0	0.0	6715.0	49.0	Altark Susent	False	10383.0
3	0003_02	Europa	False	A/0/S	TRAPPIST-1e	33.0	False	0.0	1283.0	371.0	3329.0	193.0	Solam Susent	False	5176.0
4	0004_01	Earth	False	F/1/S	TRAPPIST-1e	16.0	False	303.0	70.0	151.0	565.0	2.0	Willy Santantines	True	1091.0

We can see our newly created columns after Transported. Funnily enough, the first person on the dataset, Maham, didn't spend any money, and they weren't even in cryosleep. Maybe they are broke? In that case I can relate with them.

This is where the real fun begins. First, we're going to make a new column for cryosleep, with all values equal to False (or 0):

train_df_copy['Cryosleep'] = 0

Now, for every row where Expenses is 0, we're going to put 1 as the value. Because if someone has not spent any money, they are proably in cryosleep. But don't worry, we'll deal with the exceptions, like Maham, later.

train_df_copy.loc[train_df_copy['Expenses'] == 0, 'Cryosleep'] = 1

Now, we are going to set this column's value to 1 wherever the original CryoSleep is equal to True.

train_df_copy.loc[train_df_copy.CryoSleep.astype('str') == 'True', 'Cryosleep'] = 1

Conversely, we will put it to 0 wherever CryoSleep is False.

train_df_copy.loc[train_df_copy.CryoSleep.astype('str') == 'False', 'Cryosleep'] = 0

Let's take a look at this new column now:

train_df_copy.head()

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Name	Transported	Expenses
0	0001_01	Europa	False	B/0/P	TRAPPIST-1e	39.0	False	0.0	0.0	0.0	0.0	0.0	Maham Ofracculy	False	0.0
1	0002_01	Earth	False	F/0/S	TRAPPIST-1e	24.0	False	109.0	9.0	25.0	549.0	44.0	Juanna Vines	True	736.0
2	0003_01	Europa	False	A/0/S	TRAPPIST-1e	58.0	True	43.0	3576.0	0.0	6715.0	49.0	Altark Susent	False	10383.0
3	0003_02	Europa	False	A/0/S	TRAPPIST-1e	33.0	False	0.0	1283.0	371.0	3329.0	193.0	Solam Susent	False	5176.0
4	0004_01	Earth	False	F/1/S	TRAPPIST-1e	16.0	False	303.0	70.0	151.0	565.0	2.0	Willy Santantines	True	1091.0

What we have done here is:

First, we set all values for cryosleep as false.
Next, we set cryosleep as true for everyone who hasn't spent any money.
Finally, we used the original Cryosleep colum, to correct cryosleep status for the people who haven't spent any money, but aren't in cryosleep. Just in case our last step incorrectly classified them as being in cryosleep.

Logical, right?

Now, let's just replace the original column with this one. There's probably a better way of doing this than how I did it here here:

train_df_copy['Cryosleep'] = train_df_copy['Cryosleep'].astype('bool')
train_df_copy['CryoSleep'] = train_df_copy['Cryosleep']
train_df_copy.drop('Cryosleep',axis=1,inplace=True)

Let's take another look at our dataset now:

train_df_copy.head()

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Name	Transported	Expenses
0	0001_01	Europa	False	B/0/P	TRAPPIST-1e	39.0	False	0.0	0.0	0.0	0.0	0.0	Maham Ofracculy	False	0.0
1	0002_01	Earth	False	F/0/S	TRAPPIST-1e	24.0	False	109.0	9.0	25.0	549.0	44.0	Juanna Vines	True	736.0
2	0003_01	Europa	False	A/0/S	TRAPPIST-1e	58.0	True	43.0	3576.0	0.0	6715.0	49.0	Altark Susent	False	10383.0
3	0003_02	Europa	False	A/0/S	TRAPPIST-1e	33.0	False	0.0	1283.0	371.0	3329.0	193.0	Solam Susent	False	5176.0
4	0004_01	Earth	False	F/1/S	TRAPPIST-1e	16.0	False	303.0	70.0	151.0	565.0	2.0	Willy Santantines	True	1091.0

We have now replaced the values of our original CryoSleep column, that had missing values, with the values of our newly created Cryosleep column which doesn't have any null values. Then we dropped our new column.

Our new column also states accurately that Maham is not in cryosleep, and he still hasn't spent any money on amenities, i.e., RoomService,FoodCourt,ShoppingMall,Spa and VRDeck.

The new column shouldn't have any null values now. Let's check just in case:

train_df_copy.CryoSleep.isnull().any()

False

Since the only important person in this dataset is Maham, we don't need the names column. (Or maybe we actually do and can use to to further improve prediction, but I'm just not good enough to figure out how to do that yet.)

train_df_copy.drop('Name',axis=1,inplace=True)

Now for the amenities, we can easily impute null values for Cryosleep == True, since we know they are going to be zero as the person is in cryosleep.

train_df_copy.loc[train_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']] = 0
train_df_copy.loc[train_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']].isna().sum()

RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

Before dealing with the rest of the amenities' values, let's make some more new columns to aid us.

train_df_copy['Adults'] = train_df_copy['Age'] >= 13

I know 13 year olds aren't adults, okay. What I mean is that they are able to spend money at this age. Unike a certain someone we know of. If someone has any spare change, do let me know.

I'm not picking on Maham, I just want him to enjoy his journey on the spaceship titanic to the absolute fullest, especially with the tragedy that happens. To be honest, I am extremely happy that he didn't get transported into who-knows-what dimension. He is still with us, and we are all grateful for that, I am sure.

Jokes aside, let's make a column now that tells us if someone is 13+ and is spending money.

train_df_copy['Adult_and_spending'] = (train_df_copy['Expenses'] > 0) & (train_df_copy['Age'] >=13)

Let's take a look at the rows that are True for our new Adult_and_spending column:

train_df_copy.loc[train_df_copy.Adult_and_spending == True]

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Transported	Expenses	Adults	Adult_and_spending
1	0002_01	Earth	False	F/0/S	TRAPPIST-1e	24.0	False	109.0	9.0	25.0	549.0	44.0	True	736.0	True	True
2	0003_01	Europa	False	A/0/S	TRAPPIST-1e	58.0	True	43.0	3576.0	0.0	6715.0	49.0	False	10383.0	True	True
3	0003_02	Europa	False	A/0/S	TRAPPIST-1e	33.0	False	0.0	1283.0	371.0	3329.0	193.0	False	5176.0	True	True
4	0004_01	Earth	False	F/1/S	TRAPPIST-1e	16.0	False	303.0	70.0	151.0	565.0	2.0	True	1091.0	True	True
5	0005_01	Earth	False	F/0/P	PSO J318.5-22	44.0	False	0.0	483.0	0.0	291.0	0.0	True	774.0	True	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8687	9275_03	Europa	False	A/97/P	TRAPPIST-1e	30.0	False	0.0	3208.0	0.0	2.0	330.0	True	3540.0	True	True
8688	9276_01	Europa	False	A/98/P	55 Cancri e	41.0	True	0.0	6819.0	0.0	1643.0	74.0	False	8536.0	True	True
8690	9279_01	Earth	False	G/1500/S	TRAPPIST-1e	26.0	False	0.0	0.0	1872.0	1.0	0.0	True	1873.0	True	True
8691	9280_01	Europa	False	E/608/S	55 Cancri e	32.0	False	0.0	1049.0	0.0	353.0	3235.0	False	4637.0	True	True
8692	9280_02	Europa	False	E/608/S	TRAPPIST-1e	44.0	False	126.0	4688.0	0.0	0.0	12.0	True	4826.0	True	True

5040 rows × 16 columns

So there are 5040 people who are 13+ and are spending money.

Now we are going to impute the values for our amenities.

We know if someone is not an adult and has zero expenses, they are either below 13, which means they definitely haven't spent on any amenities, or they are in cryosleep, which again means they definitely haven't spent on amenities.

So, wherever we have Adult_and_spending == False, we'll impute them with 0.

train_df_copy.RoomService = train_df_copy.RoomService.fillna(train_df_copy.RoomService.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'RoomService'] = 0

train_df_copy.FoodCourt = train_df_copy.FoodCourt.fillna(train_df_copy.FoodCourt.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'FoodCourt'] = 0

train_df_copy.ShoppingMall = train_df_copy.ShoppingMall.fillna(train_df_copy.ShoppingMall.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'ShoppingMall'] = 0

train_df_copy.Spa = train_df_copy.Spa.fillna(train_df_copy.Spa.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'Spa'] = 0

train_df_copy.VRDeck = train_df_copy.VRDeck.fillna(train_df_copy.VRDeck.mean())
train_df_copy.loc[train_df_copy.Adult_and_spending ==False, 'VRDeck'] = 0

Neat. Now we are done with imputing these columns as well.

Let's take a look:

train_df_copy[['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']].isna().sum()

RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

Perfect.

For the remaining columns, we can't figure out what values to fill in this manner. So we are just going to fill them with the values that the majority of people have in the dataset, i.e., the mode.

train_df_copy.HomePlanet.mode()

0    Earth
Name: HomePlanet, dtype: object

train_df_copy.Destination.mode()

0    TRAPPIST-1e
Name: Destination, dtype: object

train_df_copy.VIP.mode()

0    False
Name: VIP, dtype: object

So, these are the values we will be imputing with.

train_df_copy.HomePlanet = train_df_copy.HomePlanet.fillna('Earth')
train_df_copy.Destination = train_df_copy.Destination.fillna('TRAPPIST-1e')
train_df_copy.VIP = train_df_copy.VIP.fillna('False')
train_df_copy.VIP = train_df_copy.VIP.astype('bool')

Aaand done!

Let's see how much we are done:

train_df_copy.isnull().sum()

PassengerId             0
HomePlanet              0
CryoSleep               0
Cabin                 199
Destination             0
Age                     0
VIP                     0
RoomService             0
FoodCourt               0
ShoppingMall            0
Spa                     0
VRDeck                  0
Transported             0
Expenses                0
Adults                  0
Adult_and_spending      0
dtype: int64

The cabin is the only column that remains with null values!

Filling this is not easy due to my limited skill. I am just going to use ffill to fill these null values. What that does is basically use the previous value to impute the missing one.

So, for example, if we have a dataset like:

[1, 2, 3, null, 4]

If we use ffill on this, it'll become:

[1, 2, 3, 3, 4].

train_df_copy['Cabin'] = train_df_copy.Cabin.fillna(method='ffill')

train_df_copy.isnull().sum()

PassengerId           0
HomePlanet            0
CryoSleep             0
Cabin                 0
Destination           0
Age                   0
VIP                   0
RoomService           0
FoodCourt             0
ShoppingMall          0
Spa                   0
VRDeck                0
Transported           0
Expenses              0
Adults                0
Adult_and_spending    0
dtype: int64

And so, we are done with imputing. Time to move on to feature engineering.

Feature Engineering

These are the features that I am going to add to this dataset (again, I got the idea for them here).

train_df_copy['Group_nums'] = train_df_copy.PassengerId.apply(lambda x: x.split('_')).apply(lambda x: x[0])
train_df_copy['Grouped'] = ((train_df_copy['Group_nums'].value_counts() > 1).reindex(train_df_copy['Group_nums'])).tolist()
train_df_copy['Deck'] = train_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[0])
train_df_copy['Side'] = train_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[2])
train_df_copy['Has_expenses'] = train_df_copy['Expenses'] > 0
train_df_copy['Is_Embryo'] = train_df_copy['Age'] == 0

These specifiy:

If someone was alone or in a group.
Which deck someone was in.
Which side (Starboard or Port).
If the passenger was 0 years old (i.e, an embryo).

Let's get rid of our temporary columns:

train_df_copy.drop(['Adult_and_spending','Group_nums','Expenses'],axis=1,\
                   inplace=True)

This is our final dataset:

train_df_copy.head()

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Transported	Adults	Grouped	Deck	Side	Has_expenses	Is_Embryo
0	0001_01	Europa	False	B/0/P	TRAPPIST-1e	39.0	False	0.0	0.0	0.0	0.0	0.0	False	True	False	B	P	False	False
1	0002_01	Earth	False	F/0/S	TRAPPIST-1e	24.0	False	109.0	9.0	25.0	549.0	44.0	True	True	False	F	S	True	False
2	0003_01	Europa	False	A/0/S	TRAPPIST-1e	58.0	True	43.0	3576.0	0.0	6715.0	49.0	False	True	True	A	S	True	False
3	0003_02	Europa	False	A/0/S	TRAPPIST-1e	33.0	False	0.0	1283.0	371.0	3329.0	193.0	False	True	True	A	S	True	False
4	0004_01	Earth	False	F/1/S	TRAPPIST-1e	16.0	False	303.0	70.0	151.0	565.0	2.0	True	True	False	F	S	True	False

Saving it just in case.

train_df_copy.to_csv('Cleaned and imputed data.csv',index=False)

Since even our test data has missing values, we have to do all that to our test data as well.

Test Data

test_df_copy = test_df.copy()

test_df_copy['Expenses'] = test_df_copy[['RoomService', 'FoodCourt',
                                           'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)

test_df_copy.Age = test_df_copy.Age.fillna(test_df_copy.Age.median())

test_df_copy['Adult_spending_awake'] = (test_df_copy['Expenses'] > 0)\
                                     & (test_df_copy['Age'] >= 13)\
                                     & (test_df_copy['CryoSleep'] == False)

test_df_copy['Cryosleep'] = 0
test_df_copy.loc[test_df_copy['Expenses'] == 0, 'Cryosleep'] = 1
test_df_copy.loc[test_df_copy.CryoSleep.astype('str') == 'True', 'Cryosleep'] = 1
test_df_copy.loc[test_df_copy.CryoSleep.astype('str') == 'False', 'Cryosleep'] = 0
test_df_copy['Cryosleep'] = test_df_copy['Cryosleep'].astype('bool')
test_df_copy['CryoSleep'] = test_df_copy['Cryosleep']
test_df_copy.drop('Cryosleep',axis=1,inplace=True)
test_df_copy.drop('Name',axis=1,inplace=True)

test_df_copy.loc[test_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']] = 0

test_df_copy['Adults'] = test_df_copy['Age'] >= 13

test_df_copy['Adult_and_spending'] = (test_df_copy['Expenses'] > 0) & (test_df_copy['Age'] >=13)
test_df_copy.loc[test_df_copy.Adult_and_spending == True]

test_df_copy.RoomService = test_df_copy.RoomService.fillna(test_df_copy.RoomService.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'RoomService'] = 0

test_df_copy.FoodCourt = test_df_copy.FoodCourt.fillna(test_df_copy.FoodCourt.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'FoodCourt'] = 0

test_df_copy.ShoppingMall = test_df_copy.ShoppingMall.fillna(test_df_copy.ShoppingMall.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'ShoppingMall'] = 0

test_df_copy.Spa = test_df_copy.Spa.fillna(test_df_copy.Spa.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'Spa'] = 0

test_df_copy.VRDeck = test_df_copy.VRDeck.fillna(test_df_copy.VRDeck.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'VRDeck'] = 0

test_df_copy.HomePlanet = test_df_copy.HomePlanet.fillna('Earth')
test_df_copy.Destination = test_df_copy.Destination.fillna('TRAPPIST-1e')
test_df_copy.VIP = test_df_copy.VIP.fillna('False')
test_df_copy.VIP = test_df_copy.VIP.astype('bool')

test_df_copy['Cabin'] = test_df_copy.Cabin.fillna(method='ffill')

test_df_copy['Group_nums'] = test_df_copy.PassengerId.apply(lambda x: x.split('_')).apply(lambda x: x[0])
test_df_copy['Grouped'] = ((test_df_copy['Group_nums'].value_counts() > 1).reindex(test_df_copy['Group_nums'])).tolist()
test_df_copy['Deck'] = test_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[0])
test_df_copy['Side'] = test_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[2])
test_df_copy['Has_expenses'] = test_df_copy['Expenses'] > 0
test_df_copy['Is_Embryo'] = test_df_copy['Age'] == 0

test_df_copy.columns
test_df_copy.drop(['Expenses', 'Adult_spending_awake', 'Adult_and_spending','Adults'],axis=1, inplace=True)

test_df_copy.to_csv('Cleaned and imputed test data.csv',index=False)

Simple enough.

Time to build some models.

Model Building

Let's import Logistic Regression. I'm also going to import train-test split, just for some light evaluation.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Now, we import the csv's that we saved earlier.

df_train = pd.read_csv('Cleaned and imputed data.csv')
df_test = pd.read_csv('Cleaned and imputed test data.csv')

df_train.head()

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Transported	Adults	Grouped	Deck	Side	Has_expenses	Is_Embryo
0	0001_01	Europa	False	B/0/P	TRAPPIST-1e	39.0	False	0.0	0.0	0.0	0.0	0.0	False	True	False	B	P	False	False
1	0002_01	Earth	False	F/0/S	TRAPPIST-1e	24.0	False	109.0	9.0	25.0	549.0	44.0	True	True	False	F	S	True	False
2	0003_01	Europa	False	A/0/S	TRAPPIST-1e	58.0	True	43.0	3576.0	0.0	6715.0	49.0	False	True	True	A	S	True	False
3	0003_02	Europa	False	A/0/S	TRAPPIST-1e	33.0	False	0.0	1283.0	371.0	3329.0	193.0	False	True	True	A	S	True	False
4	0004_01	Earth	False	F/1/S	TRAPPIST-1e	16.0	False	303.0	70.0	151.0	565.0	2.0	True	True	False	F	S	True	False

df_test.head()

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Group_nums	Grouped	Deck	Side	Has_expenses	Is_Embryo
0	0013_01	Earth	True	G/3/S	TRAPPIST-1e	27.0	False	0.0	0.0	0.0	0.0	0.0	13	False	G	S	False	False
1	0018_01	Earth	False	F/4/S	TRAPPIST-1e	19.0	False	0.0	9.0	0.0	2823.0	0.0	18	False	F	S	True	False
2	0019_01	Europa	True	C/0/S	55 Cancri e	31.0	False	0.0	0.0	0.0	0.0	0.0	19	False	C	S	False	False
3	0021_01	Europa	False	C/1/S	TRAPPIST-1e	38.0	False	0.0	6652.0	0.0	181.0	585.0	21	False	C	S	True	False
4	0023_01	Earth	False	F/5/S	TRAPPIST-1e	20.0	False	10.0	0.0	635.0	0.0	0.0	23	False	F	S	True	False

All looks good.

Now we are going to do some feature selection.

df_train.dtypes

PassengerId      object
HomePlanet       object
CryoSleep          bool
Cabin            object
Destination      object
Age             float64
VIP                bool
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Transported        bool
Adults             bool
Grouped            bool
Deck             object
Side             object
Has_expenses       bool
Is_Embryo          bool
dtype: object

features = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP',
            'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 
            'Grouped', 'Deck', 'Has_expenses', 'Side', 'Is_Embryo']

These are the features that I decided to use for model training and testing. I don't know if these are the best ones. So you can try different ones, and could even get a better result than mine!

Now we will assign the data in the training set to feature and target variables, and do a train-test-split split for evaluation.

X = pd.get_dummies(df_train[features])
y = df_train['Transported']

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1)

Let's fit and score:

model = LogisticRegression(max_iter=10000)
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.8003679852805887

Not bad.

Since we actually have to predict the test set that Kaggle has provided, we want to use all of the train data to train the model. The more data the model gets to learn from, the better the prediction.

model2 = LogisticRegression(max_iter=10000)
model2.fit(X,y)
model2.score(X,y)

0.792016565052341

Let's predict our test set now and save it:

y_pred_log2 = model2.predict(pd.get_dummies(df_test[features]))

Now I'll use the only other classification model I knew at the time, K-Neighbors Classifier.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

And use GridSearchCV to get the optimal K value (code commented out as it takes time to run):

# knn = KNeighborsClassifier()
# param_grid = {'n_neighbors':np.arange(2,15)}
# knn_gscv = GridSearchCV(knn, param_grid, cv=5)
# knn_gscv.fit(X,y)
# knn_gscv.best_params_

knn2 = KNeighborsClassifier(n_neighbors=14)
knn2.fit(X,y)
knn2.score(X,y)

0.8149085471068676

And save:

y_pred_knn = knn2.predict(pd.get_dummies(df_test[features]))

Now, I did see that the model that seemed to perform great on this data is Gradient Boosting Classifier. So I looked it up and just used it with default hyperparameters:

from sklearn.ensemble import GradientBoostingClassifier
gbr = GradientBoostingClassifier(random_state = 1)
  
# Fit to training set
gbr.fit(X, y)
gbr.score(X,y)

0.8130679857356494

Seems slightly worse than our K-Neighbors Classifier. But still, we'll keep its predictions as well.

pred_y_gbr = gbr.predict(pd.get_dummies((df_test[features])))

Since Gradient Boosting was performing well, and I had also stumbled upon Extreme Gradient Boosting, it only seems logical to try that out as well (maybe we'll get extremely good results):

from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X,y)
xgb.score(X,y)

0.8887610721269987

y_pred_xgb = xgb.predict(pd.get_dummies((df_test[features])))

The last thing I want to do is tune the Gradient Boost further using GSCV (my 4gb laptop dies when running this ok, so yes I will comment it out again):

# gbc = GradientBoostingClassifier()
# parameters = {
#     "n_estimators":[5,50,100],
#     "max_depth":[1,3,5],
#    "learning_rate":[0.01,0.1,1]
# }

# from sklearn.model_selection import GridSearchCV
# from sklearn.model_selection import RandomizedSearchCV

# cv = RandomizedSearchCV(gbc, parameters, n_iter=27, scoring='accuracy', n_jobs=-1, cv=5, random_state=1)
# cv.fit(X,y)
# cv.best_params_

gbc1 = GradientBoostingClassifier(n_estimators=50,max_depth=5,learning_rate=0.1) #best params from gscv

gbc1.fit(X,y)
gbc1.score(X,y)

0.831013459105027

pred_y_gbr2 = gbc1.predict(pd.get_dummies((df_test[features])))

And so, we are done!

Time for submission.

Results

# Logist_out2 = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported': y_pred_log2})
# Logist_out2.to_csv('submission.csv',index=False)

Logistic Regression competition Score = 0.79448

# knn_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported': y_pred_knn})
# knn_out.to_csv('submission.csv',index=False)

KNN competition score = 0.79261

# xgb_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported':y_pred_xgb.astype('bool')})
# xgb_out.to_csv('submission.csv',index=False)

XGB competition score = 0.79307

# gbr_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported': pred_y_gbr})
# gbr_out.to_csv('submission.csv',index=False)

Gradient Boost competition score = 0.80056

gbc_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported':pred_y_gbr2})
gbc_out.to_csv('submission.csv',index=False)

Tuned Gradient Boost competition score = 0.80476

And so, we have a winner.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Predictions		Predictions
Cleaned and imputed data 2.csv		Cleaned and imputed data 2.csv
Cleaned and imputed data.csv		Cleaned and imputed data.csv
Cleaned and imputed test data.csv		Cleaned and imputed test data.csv
Cleaning and imputing.py		Cleaning and imputing.py
GBC CM.png		GBC CM.png
Project.py		Project.py
model building.py		model building.py
readme.md		readme.md
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spaceship Titanic

Table of contents:

Exploratory Data Analysis

Imputing Null Values

Feature Engineering

Test Data

Model Building

Results

About

Releases

Packages

Languages

You-sha/Spaceship-Titanic

Folders and files

Latest commit

History

Repository files navigation

Spaceship Titanic

Table of contents:

Exploratory Data Analysis

Imputing Null Values

Feature Engineering

Test Data

Model Building

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages