We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.

This was done using Python, from scratch defining the sigmoid function and the gradient descent, and we have seen also the same example using the *statsmodels* library.

Now we are going to see how to solve a logistic regression problem using the popular *SciKitLearn* library, specifically the *LogisticRegression* module.

The example this time is to predict survival on the Titanic ship (that sank against an iceberg).

It’s a basic learning competition on the ML platform Kaggle, a simple introduction to machine learning concepts, specifically binary classification (survived / not survived).

Here we are looking into how to apply Logistic Regression to the Titanic dataset.

You can follow along the Python notebook on GitHub or the Python kernel on Kaggle.

# 1. Collect and understand the data

The data can be downloaded directly from Kaggle:

import pandas as pd # get titanic training CSV file as a DataFrame titanic = pd.read_csv("../input/train.csv") titanic.shape

(891, 12)

Each line is a passenger (891 in total) and each one has 12 features.

The features are the passenger’s name, sex and age, the ticket class (1st, 2nd or 3rd) and so on. The feature Survived is a flag: 1 if that passenger survived, 0 otherwise.

Not all features are numeric:

titanic.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB

# 2. Process the Data

Categorical variables need to be transformed into numeric variables.

### Transform the embarkment port

There are three ports: C = Cherbourg, Q = Queenstown, S = Southampton

ports = pd.get_dummies(titanic.Embarked , prefix='Embarked')

Now the feature Embarked (a category) has been trasformed into 3 binary features, e.g. Embarked_C = 0 not embarked in Cherbourg, 1 = embarked in Cherbourg.

Finally, the 3 new binary features substitute the original one in the data frame:

titanic = titanic.join(ports) titanic.drop(['Embarked'], axis=1, inplace=True) # then drop the original column

### Transform the gender feature

This is easier, being already a binary classification (male or female, this was 1912). Mapping will be enough:

titanic.Sex = titanic.Sex.map({'male':0, 'female':1})

## Extract the target variable

Create an X dataframe with the input features and an y series with the target.

As we said *Survived* is the target variable:

y = titanic.Survived.copy() # copy “y” column values out X = titanic.drop(['Survived'], axis=1) # then, drop y column

### Drop not so important features

For the first model, we ignore some categorical features which will not add too much of a signal.

X.drop(['Cabin'], axis=1, inplace=True) X.drop(['Ticket'], axis=1, inplace=True) X.drop(['Name'], axis=1, inplace=True) X.drop(['PassengerId'], axis=1, inplace=True) X.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 10 columns): Pclass 891 non-null int64 Sex 891 non-null int64 Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Fare 891 non-null float64 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 dtypes: float64(2), int64(5), uint8(3) memory usage: 51.4 KB

All features are now numeric, ready for regression.

But we have still a couple of processing to do.

## Check if there are any missing values

X.isnull().values.any()

True

#X[pd.isnull(X).any(axis=1)]

True, there are missing values in the data (NaN) and a quick look at the data reveals that they are all in the Age feature.

One possibility could be to remove the feature, another one is to fill the missing value with a fixed number or the average age.

X.Age.fillna(X.Age.mean(), inplace=True) # replace NaN with average age X.isnull().values.any()

False

Now all missing values have been removed.

The logistic regression would otherwise not work with missing values.

## Split the dataset into training and validation

The **training** set will be used to build the machine learning models. The model will be based on the features like passengers’ gender and class but also on the known survived flag.

The **validation** set should be used to see how well the model performs on unseen data. For each passenger in the test set, I use the model trained to predict whether or not they survived the sinking of the Titanic, then will be compared with the actual survival flag.

from sklearn.model_selection import train_test_split # 80 % go into the training test, 20% in the validation test X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)

# 3. Modelling

## Get a baseline

A baseline is always useful to see if the model trained behaves significantly better than an easy to obtain baseline, such as a random guess or a simple heuristic like all and only female passengers survived. In this case, after quickly looking at the training dataset – where the survival outcome is present – I am going to use the following:

def simple_heuristic(titanicDF): ''' predict whether or not the passngers survived or perished. Here's the algorithm, predict the passenger survived: 1) If the passenger is female or 2) if his socioeconomic status is high AND if the passenger is under 18 ''' predictions = [] # a list for passenger_index, passenger in titanicDF.iterrows(): if passenger['Sex'] == 1: # female predictions.append(1) # survived elif passenger['Age'] &amp;lt;span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" &amp;gt;&amp;amp;#65279;&amp;lt;/span&amp;gt;&amp;lt; 18 and passenger['Pclass'] == 1: # male but minor and rich predictions.append(1) # survived else: predictions.append(0) # everyone else perished return predictions

Let’s see how this simple algorithm will behave on the validation dataset and we will keep that number as our baseline:

simplePredictions = simple_heuristic(X_valid) correct = sum(simplePredictions == y_valid) print ("Baseline: ", correct/len(y_valid))

Baseline: 0.731843575419

Baseline: a simple algorithm predicts correctly 73% of validation cases.

Now let’s see if the model can do better.

## Logistic Regression

from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

# 4. Evaluate the model

model.score(X_train, y_train)

0.8117977528089888

model.score(X_valid, y_valid)

0.75977653631284914

Two things:

– the score on the training set is much better than on the validation set, an indication that could be overfitting and not being a general model, e.g. for all ship sinks.

– the score on the validation set is better than the baseline, so it adds some value at a minimal cost (the logistic regression is not computationally expensive, at least not for smaller datasets).

An advantage of logistic regression (e.g. against a neural network) is that it’s easily interpretable. It can be written as a math formula:

model.intercept_ # the fitted intercept

array([ 1.32382218])

model.coef_ # the fitted coefficients

array([[ 2.94418113e-04, -9.28482957e-01, 2.83925340e+00, -3.94599533e-02, -3.88779847e-01, 1.14992067e-02, 1.91946485e-03, 7.02467086e-01, 4.32510542e-01, 1.88844553e-01]])

Which means that the formula is:

where the logit is:

where is the model intercept and the other beta parameters are the model coefficients from above, each multiplied for the related feature:

# 5. Iterate on the model

The model could be improved, for example transforming the excluded features above or creating new ones (e.g. I could extract titles from the names which could be another indication of the socio-economic status).

The resulting score on Kaggle test set (a separate dataset unknown to the training set) is **0.75119**

Note that the score on the validation set has been a good predictor!

Thanks Ed,

that’s a good question.

It’s not easy to assess features precision and significance in logistic regression, via SKlearn.

One way to assess the importance is to do some features engineering / selection (e.g. https://mashimo.wordpress.com/2014/10/09/features-selection-for-linear-regression/) using the overall model performance metrics (precision, recall, confusion matrix, …)

SKlearn feature selection is described in details here: http://scikit-learn.org/stable/modules/feature_selection.html#

The easiest way to assess the precision for a classification problem would likely be to use a random forest; they make it very easy to measure the relative importance of each feature, also via SKlearn (the returned model has an attribute “feature_importances”).

SKlearn deliberately does not support statistical inference. For out-of-the-box coefficients significance tests (and much more), you can use the Logit estimator from Statsmodels.

http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html

import statsmodels.api as sm

sm_model = sm.Logit(y, x).fit()

sm_model.pvalues

sm_model.summary()

Thanks for the great explanation.

One question though, how would you assess the precision and significance of each variable in the logistic regression using SKlearn? (i.e. metrics equivalent to p-value, chi2 or t-stat)

Pingback: Multi-class logistic regression – Look back in respect