Blog

Mar 1, 2017  mode_comment

Kaggle project: Predicting Titanic Survivors

Back

RMS Titanic is considered to be one of the most infamous shipwreck in history; 1502 out of 2224 passengers and crew were killed when the ship collided with an iceberg. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. Here, we will go through a simple Kaggle competition to predict what sorts of people were more likely to survive than others.


1. Jupyter Notebook


Before we begin, let's first install Jupyter Notebook. Jupyter Notebook is a command shell for interactive computing in multiple programming languages. It is a very important tool used in scientific computing projects because it allows professionals to interact and communicate their work easily. For those of you who have installed Python via Anaconda, there is no need to install Jupyter Notebook separately because Anaconda comes with Jupyter Notebook included in the bundle.

Once Jupyter Notebook is installed, let's download our training data from Kaggle.

Kaggle provides training and testing data.

Now let's open up a new Jupyter notebook. If you are on a mac, fire up the Terminal app and type jupyter notebook. If you are on a windows, click on the Jupyter notebook launcher. In the case that a web browser didn't open when you did the above, you could type in http://localhost:8888/ into your web browser and you will be directed to the Jupyter Notebook interactive shell.

Jupyter Notebook interactive shell.
Once in the interactive shell, we can click New on the top right and open a blank Jupyter notebook with Python selected as the kernel.



2. Load Data


Let's dive right into Python coding! First, let's load essential libraries and our data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Numpy is a python library that is a fundamental package to scientific computing. Pandas is another python library that is a fundamental package to data science. It is designed to work with DataFrames, similar to R. Matplotlib is a Python 2D plotting library. Matplotlib works nicely with Jupyter because you can type in %matplotlib inline to display your plots directly in your notebook.

We will be working with the train.csv data from Kaggle. Making sure the data file is in the same directory as your Jupyter notebook, let's load and review our data.

data = pd.read_csv("train.csv")
print(data.head(5))
print(data.describe())

After typing this into your notebook, press ctrl+enter. This will run the code. Pandas read_csv method reads the .csv file and stores it as a DataFrame object. Thus, our data variable is now a DataFrame object. Once you have a DataFrame object, you can call various methods onto this object such as .head() and .describe(). .head() shows you the first few lines of the DataFrame and .describe() gives you a statistical summary of the DataFrame.

Running the code in Jupyter Notebook.

The nice thing about Jupyter Notebook is that you can have multiple code blocks to break up your code and test the code at different points (and not mess up your already functioning code). If you press option+enter (alt for windows?), Jupyter Notebook will not only run the current code block but will create a new code block below the current one. This is a good summary for Jupyter Notebook tips and tricks, provided by Dataquest.



3. Clean Data


Now it's time to clean our data. I want to emphasize that data wrangling (cleaning, mining etc) is the hardest part of a data science project. Kaggle datasets are relatively clean, but real life datasets are messy with a lot of human error. So it is important to go through your data and think a lot about how you could organize it to answer the project question.

The first thing we want to do is look at the data. I usually like to print a list of the columns in our DataFrame and check what data types they are.

print(data.columns) #print the columns of our dataframe
for column in data.columns:
    print(data[column].dtype) #look at the data type of the column

Pandas has a .columns method that will return the columns of a DataFrame object. After looking at the columns, I then loop through the columns and look at the data type of those columns. The next basic thing we could check is whether or not these columns have any missing data.

print(pd.isnull(data))

I use the Pandas .isnull() method on the data to check whether a value if NaN (Not a Number) or null. This method will return True if there is a NaN or null value; False otherwise. We can see that some columns have missing values; let's check exactly which columns have missing data.

for column in data.columns:
    if np.any(pd.isnull(data[column])) == True:
            print(column)

The above code searches if there is any value in the columns that equals to True. Running the code, we see that columns Age, Cabin, Embarked have missing values. We will fill in the missing values and convert non-numeric data types into numerics so that we can use them as features for our machine learning model.

# fill in the missing values with the median age
data["Age"] = data["Age"].fillna(data["Age"].median())

# convert female/male to numeric values (male=0, female=1)
data.loc[data["Sex"] == "male", "Sex"] = 0
data.loc[data["Sex"] == "female", "Sex"] = 1

# do the same for Embarked
data["Embarked"] = data["Embarked"].fillna("S")
data.loc[data["Embarked"] == "S", "Embarked"] = 0
data.loc[data["Embarked"] == "C", "Embarked"] = 1
data.loc[data["Embarked"] == "Q", "Embarked"] = 2
Indexing/Selecting in Pandas.

Above, I introduced several pandas methods. The first one is .fillna() which fills NaN/null values with the value passed into the method. We passed in data["Age"].median() to fill the missing values with the median age. Then, we located males and females in the Sex column and mapped the strings into a numeric value. The same was done for Embarked. These are very simple data cleaning steps; one could do further cleaning based on the columns they select to use as features to train a model or one could even create new features by combining existing features. For this tutorial's purpose, we will move on to do some machine learning at this point.



4. Train a model: Linear Regression


Python has an awesome machine learning library called Scikit-learn. We will be using this library to train different models and test our accuracy.

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold

# columns we will use as features for our model
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# instantiate the model
linreg = LinearRegression()

# generate cross-validation folds
kf = KFold(data.shape[0], n_folds=10)

predictions = []
for train, test in kf:
    # load X (predictor) and y (outcome)
    X = data[predictors].iloc[train,:]
    y = data["Survived"].iloc[train,:]

    # fit the data
    linreg.fit(X, y)

    # cross-validate
    test_pred = linreg.predict(data[predictors.iloc[test,:]])
    predictions.append(test_pred)

    print(predictions)

Here, I first import the necessary libraries and select columns I want to use as predictors/features to train a machine learning model. Then, I instantiate a linear regression model by linreg = LinearRegression(). For cross-validation, I make use of KFold which takes the length of your data, how many folds you want, and a random number seed as arguments. KFold will return the row index for train/test samples, which we can use to select from our data with .iloc[train/test,:]. With the folds, we fit the instantiated linear regression model to the training data with .fit(X_train, y_train) and predict on the testing data with .predict(X_test).

Now if we look at the returned list for our predictions, we have 10 arrays that return the probability of survival for the 10 testing sets. Let's concatenate the arrays and calculate the model accuracy.

prediction = np.concatenate(predictions, axis=0)

# map the probability to binary outcome (survived or not)
prediction[prediction > 0.5] = 1
prediction[prediction <= 0.5] = 0

# calculate accuracy
accuracy = sum( prediction[prediction == data["Survived"]] ) / len(prediction)

Since linear regression gives a probability of survival that falls between 0 and 1, we need to map the continuous number to a binary outcome. If the probability is less than or equal to 0.5, the prediction is that the passenger has not survived (0) and vice versa. prediction[prediction > 0.5] will basically map the prediction array to an array of booleans based on the condition prediction > 0.5 and will map only the True values to what you set. To calculate the accuracy, since KFolds does not shuffle the original data (unless you set shuffle to be True), we can directly compare with the actual data output (survived or no?) and dividing by the number of samples would give us a percentage, which turns out to be 0.789.




5. Train a model: Logistic Regression


In the previous section, we were working with a linear regression model. As the word 'linear' suggests, linear regression is a good model to use when your prediction data is a continuous variable. For our titanic dataset, our prediction is a binary variable, which is discontinuous. So using a logistic regression model makes more sense than using a linear regression model. The two models have a different cost function, which is why one model does better with one type of data and vice versa.

Let's repeat what we did in the previous section; this time for logistic regression.

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

# instantiate the model
logreg = LogisticRegression()

# perform cross-validation
print(cross_val_score(logreg, data[predictors], data['Survived'], cv=10, scoring='accuracy').mean())

Just as we did previously, we import and instantiate the logistic regression model. However, this time I took a different approach to cross-validate and check the accuracy of the model. We could manually divide up our data set into different folds for cross-validation, but the Scikit-learn library also provides a convenient way to do cross-validation. I imported cross_val_score, which takes the machine learning model, predictor data, outcome, number of folds, and the type of scoring as arguments. It returns the specified scoring for the folds so I take the average at the end to look at the overall accuracy. This time the accuracy is 0.794, slightly higher than that of the linear regression model (0.789).




6. Kaggle Submission


Now that we have a machine learning model we can use to make predictions, let's make some predictions on the test dataset that Kaggle provides. As with the training set, the test set can be downloaded from the Kaggle website.

# read the data
test = pd.read_csv("test.csv")

# clean the data
test['Age'] = test['Age'].fillna(data['Age'].median())

test.loc[test["Sex"] == "male", "Sex"] = 0
test.loc[test["Sex"] == "female", "Sex"] = 1

test.loc[test["Embarked"] == "S", "Embarked"] = 0
test.loc[test["Embarked"] == "C", "Embarked"] = 1
test.loc[test["Embarked"] == "Q", "Embarked"] = 2

test["Fare"] = test["Fare"].fillna(test["Fare"].median())

logreg.fit(data[predictors], data["Survived"])
prediction = logreg.predict(test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset
submission = pd.DataFrame({ 
    "PassengerId" : test["PassengerId"],
    "Survived" : prediction
    })

print(submission)

Note that for the test data, data values were missing in the column Fare rather than Embarked. It is very important to do all the data cleaning/checking with the test dataset also! The same data cleaning procedures were done to the test set, then the logistic regression model was trained with the training dataset and this model was used to predict the test dataset. Since Kaggle only cares about the passenger IDs and whether or not they survived, we created a new dataframe by pd.Dataframe({}) with the relevant information.

The very last step is to print the prediction dataframe for submission.

submission.to_csv("submission.csv", index=False)

This code will write your predictions to a .csv file called submission.csv. The index is set to False so that the passengerID is the index itself. The only thing left is submission now! To submit, go to Kaggle again and click on Submit Predictions.

Submitting your predictions.




7. It's the End


That's it for a very simple data science pipeline! In this post, I tried to introduce Jupyter Notebook as an integral part of data science projects and I tried to go over some basics of the Pandas library. Dataquest provides a simple cheat sheet for Python Pandas, check it out over here. Please let me know if there is anything to be corrected/updated in this post. Thanks and good luck on your data science journey!


About

I am a computational scientist finishing my PhD at U of Penn. I picked up programming coming into graduate school and after years of computational research, I'm amazed by what data can do. I love to use data analytics to find trends, which when exposed, empower people to make informed decisions about the world they live in. I'm also the co-founder of Penn Data Science Group.