
1. Jupyter Notebook
Before we begin, let's first install Jupyter Notebook. Jupyter Notebook is a command shell for interactive computing in multiple programming languages. It is a very important tool used in scientific computing projects because it allows professionals to interact and communicate their work easily. For those of you who have installed Python via Anaconda, there is no need to install Jupyter Notebook separately because Anaconda comes with Jupyter Notebook included in the bundle.
Once Jupyter Notebook is installed, let's download our training data from Kaggle.

Now let's open up a new Jupyter notebook. If you are on a mac, fire up the Terminal app and type jupyter notebook
. If you are on a windows, click on the Jupyter notebook launcher.
In the case that a web browser didn't open when you did the above, you could type in http://localhost:8888/
into your web browser and you will be directed to the Jupyter Notebook interactive shell.

2. Load Data
Let's dive right into Python coding! First, let's load essential libraries and our data.
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline
Numpy is a python library that is a fundamental package to scientific computing.
Pandas is another python library that is a fundamental package to data science. It is designed to work with DataFrames, similar to R.
Matplotlib is a Python 2D plotting library. Matplotlib works nicely with Jupyter because you can type in %matplotlib inline
to display your plots directly in your notebook.
We will be working with the train.csv
data from Kaggle. Making sure the data file is in the same directory as your Jupyter notebook, let's load and review our data.
data = pd.read_csv("train.csv") print(data.head(5)) print(data.describe())
After typing this into your notebook, press ctrl+enter
. This will run the code. Pandas read_csv
method reads the .csv file and stores it as a DataFrame object. Thus, our data
variable is now a DataFrame object. Once you have a DataFrame object, you can call various methods onto this object such as .head()
and .describe()
. .head()
shows you the first few lines of the DataFrame and .describe()
gives you a statistical summary of the DataFrame.

The nice thing about Jupyter Notebook is that you can have multiple code blocks to break up your code and test the code at different points (and not mess up your already functioning code).
If you press option+enter
(alt for windows?), Jupyter Notebook will not only run the current code block but will create a new code block below the current one. This is a good summary for Jupyter Notebook tips and tricks, provided by Dataquest.
3. Clean Data
Now it's time to clean our data. I want to emphasize that data wrangling (cleaning, mining etc) is the hardest part of a data science project. Kaggle datasets are relatively clean, but real life datasets are messy with a lot of human error. So it is important to go through your data and think a lot about how you could organize it to answer the project question.
The first thing we want to do is look at the data. I usually like to print a list of the columns in our DataFrame and check what data types they are.
print(data.columns) #print the columns of our dataframe for column in data.columns: print(data[column].dtype) #look at the data type of the column
Pandas has a .columns
method that will return the columns of a DataFrame object. After looking at the columns, I then loop through the columns and look at the data type of those columns. The next basic thing we could check is whether or not these columns have any missing data.
print(pd.isnull(data))
I use the Pandas .isnull()
method on the data to check whether a value if NaN (Not a Number) or null. This method will return True
if there is a NaN or null value; False
otherwise.
We can see that some columns have missing values; let's check exactly which columns have missing data.
for column in data.columns: if np.any(pd.isnull(data[column])) == True: print(column)
The above code searches if there is any value in the columns that equals to True
. Running the code, we see that columns Age, Cabin, Embarked have missing values.
We will fill in the missing values and convert non-numeric data types into numerics so that we can use them as features for our machine learning model.
# fill in the missing values with the median age data["Age"] = data["Age"].fillna(data["Age"].median()) # convert female/male to numeric values (male=0, female=1) data.loc[data["Sex"] == "male", "Sex"] = 0 data.loc[data["Sex"] == "female", "Sex"] = 1 # do the same for Embarked data["Embarked"] = data["Embarked"].fillna("S") data.loc[data["Embarked"] == "S", "Embarked"] = 0 data.loc[data["Embarked"] == "C", "Embarked"] = 1 data.loc[data["Embarked"] == "Q", "Embarked"] = 2

Above, I introduced several pandas methods. The first one is .fillna()
which fills NaN/null values with the value passed into the method. We passed in data["Age"].median()
to fill the missing values with the median age.
Then, we located males and females in the Sex column and mapped the strings into a numeric value. The same was done for Embarked.
These are very simple data cleaning steps; one could do further cleaning based on the columns they select to use as features to train a model or one could even create new features by combining existing features. For this tutorial's purpose, we will move on to do some machine learning at this point.
4. Train a model: Linear Regression
Python has an awesome machine learning library called Scikit-learn. We will be using this library to train different models and test our accuracy.
from sklearn.linear_model import LinearRegression from sklearn.cross_validation import KFold # columns we will use as features for our model predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] # instantiate the model linreg = LinearRegression() # generate cross-validation folds kf = KFold(data.shape[0], n_folds=10) predictions = [] for train, test in kf: # load X (predictor) and y (outcome) X = data[predictors].iloc[train,:] y = data["Survived"].iloc[train,:] # fit the data linreg.fit(X, y) # cross-validate test_pred = linreg.predict(data[predictors.iloc[test,:]]) predictions.append(test_pred) print(predictions)
Here, I first import the necessary libraries and select columns I want to use as predictors/features to train a machine learning model.
Then, I instantiate a linear regression model by linreg = LinearRegression()
.
For cross-validation, I make use of KFold
which takes the length of your data, how many folds you want, and a random number seed as arguments.
KFold
will return the row index for train/test samples, which we can use to select from our data with .iloc[train/test,:]
.
With the folds, we fit the instantiated linear regression model to the training data with .fit(X_train, y_train)
and predict on the testing data with .predict(X_test)
.
Now if we look at the returned list for our predictions, we have 10 arrays that return the probability of survival for the 10 testing sets. Let's concatenate the arrays and calculate the model accuracy.
prediction = np.concatenate(predictions, axis=0) # map the probability to binary outcome (survived or not) prediction[prediction > 0.5] = 1 prediction[prediction <= 0.5] = 0 # calculate accuracy accuracy = sum( prediction[prediction == data["Survived"]] ) / len(prediction)
Since linear regression gives a probability of survival that falls between 0 and 1, we need to map the continuous number to a binary outcome.
If the probability is less than or equal to 0.5, the prediction is that the passenger has not survived (0) and vice versa. prediction[prediction > 0.5]
will basically map the prediction array to an array of booleans based on the condition prediction > 0.5
and will map only the True values to what you set.
To calculate the accuracy, since KFolds does not shuffle the original data (unless you set shuffle to be True), we can directly compare with the actual data output (survived or no?) and dividing by the number of samples would give us a percentage, which turns out to be 0.789.
5. Train a model: Logistic Regression
In the previous section, we were working with a linear regression model. As the word 'linear' suggests, linear regression is a good model to use when your prediction data is a continuous variable. For our titanic dataset, our prediction is a binary variable, which is discontinuous. So using a logistic regression model makes more sense than using a linear regression model. The two models have a different cost function, which is why one model does better with one type of data and vice versa.
Let's repeat what we did in the previous section; this time for logistic regression.
from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import cross_val_score # instantiate the model logreg = LogisticRegression() # perform cross-validation print(cross_val_score(logreg, data[predictors], data['Survived'], cv=10, scoring='accuracy').mean())
Just as we did previously, we import and instantiate the logistic regression model. However, this time I took a different approach to cross-validate and check the accuracy of the model.
We could manually divide up our data set into different folds for cross-validation, but the Scikit-learn library also provides a convenient way to do cross-validation.
I imported cross_val_score
, which takes the machine learning model, predictor data, outcome, number of folds, and the type of scoring as arguments. It returns the specified scoring for the folds so I take the average at the end to look at the overall accuracy.
This time the accuracy is 0.794, slightly higher than that of the linear regression model (0.789).
6. Kaggle Submission
Now that we have a machine learning model we can use to make predictions, let's make some predictions on the test dataset that Kaggle provides. As with the training set, the test set can be downloaded from the Kaggle website.
# read the data test = pd.read_csv("test.csv") # clean the data test['Age'] = test['Age'].fillna(data['Age'].median()) test.loc[test["Sex"] == "male", "Sex"] = 0 test.loc[test["Sex"] == "female", "Sex"] = 1 test.loc[test["Embarked"] == "S", "Embarked"] = 0 test.loc[test["Embarked"] == "C", "Embarked"] = 1 test.loc[test["Embarked"] == "Q", "Embarked"] = 2 test["Fare"] = test["Fare"].fillna(test["Fare"].median()) logreg.fit(data[predictors], data["Survived"]) prediction = logreg.predict(test[predictors]) # Create a new dataframe with only the columns Kaggle wants from the dataset submission = pd.DataFrame({ "PassengerId" : test["PassengerId"], "Survived" : prediction }) print(submission)
Note that for the test data, data values were missing in the column Fare rather than Embarked. It is very important to do all the data cleaning/checking with the test dataset also!
The same data cleaning procedures were done to the test set, then the logistic regression model was trained with the training dataset and this model was used to predict the test dataset.
Since Kaggle only cares about the passenger IDs and whether or not they survived, we created a new dataframe by pd.Dataframe({})
with the relevant information.
The very last step is to print the prediction dataframe for submission.
submission.to_csv("submission.csv", index=False)
This code will write your predictions to a .csv file called submission.csv
. The index
is set to False so that the passengerID is the index itself. The only thing left is submission now!
To submit, go to Kaggle again and click on Submit Predictions.

7. It's the End
That's it for a very simple data science pipeline! In this post, I tried to introduce Jupyter Notebook as an integral part of data science projects and I tried to go over some basics of the Pandas library. Dataquest provides a simple cheat sheet for Python Pandas, check it out over here. Please let me know if there is anything to be corrected/updated in this post. Thanks and good luck on your data science journey!