We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.
This was done using Python, from scratch defining the sigmoid function and the gradient descent, and we have seen also the same example using the statsmodels library.

Now we are going to see how to solve a logistic regression problem using the popular SciKitLearn library, specifically the LogisticRegression module.

The example this time is to predict survival on the Titanic ship (that sank against an iceberg).
It’s a basic learning competition on the ML platform Kaggle, a simple introduction to machine learning concepts, specifically binary classification (survived / not survived).
Here we are looking into how to apply Logistic Regression to the Titanic dataset.

You can follow along the Python notebook on GitHub or the Python kernel on Kaggle.

1. Collect and understand the data

The data can be downloaded directly from Kaggle:

import pandas as pd

      # get titanic training CSV file as a DataFrame
titanic = pd.read_csv("../input/train.csv")

titanic.shape

(891, 12)

Each line is a passenger (891 in total) and each one has 12 features.
The features are the passenger’s name, sex and age, the ticket class (1st, 2nd or 3rd) and so on. The feature Survived is a flag: 1 if that passenger survived, 0 otherwise.

Not all features are numeric:

titanic.info()

<class 'pandas.core.frame.DataFrame'>
 RangeIndex: 891 entries, 0 to 890
 Data columns (total 12 columns):
 PassengerId 891 non-null int64
 Survived 891 non-null int64
 Pclass 891 non-null int64
 Name 891 non-null object
 Sex 891 non-null object
 Age 714 non-null float64
 SibSp 891 non-null int64
 Parch 891 non-null int64
 Ticket 891 non-null object
 Fare 891 non-null float64
 Cabin 204 non-null object
 Embarked 889 non-null object
 dtypes: float64(2), int64(5), object(5)
 memory usage: 83.6+ KB

2. Process the Data

Categorical variables need to be transformed into numeric variables.

Transform the embarkment port

There are three ports: C = Cherbourg, Q = Queenstown, S = Southampton

ports = pd.get_dummies(titanic.Embarked , prefix='Embarked')

Now the feature Embarked (a category) has been trasformed into 3 binary features, e.g. Embarked_C = 0 not embarked in Cherbourg, 1 = embarked in Cherbourg.
Finally, the 3 new binary features substitute the original one in the data frame:

titanic = titanic.join(ports)
titanic.drop(['Embarked'], axis=1, inplace=True) # then drop the original column

Transform the gender feature

This is easier, being already a binary classification (male or female, this was 1912). Mapping will be enough:

titanic.Sex = titanic.Sex.map({'male':0, 'female':1})

Extract the target variable

Create an X dataframe with the input features and an y series with the target.
As we said Survived is the target variable:

y = titanic.Survived.copy() # copy “y” column values out

X = titanic.drop(['Survived'], axis=1) # then, drop y column

Drop not so important features

For the first model, we ignore some categorical features which will not add too much of a signal.

X.drop(['Cabin'], axis=1, inplace=True) 

X.drop(['Ticket'], axis=1, inplace=True) 

X.drop(['Name'], axis=1, inplace=True) 

X.drop(['PassengerId'], axis=1, inplace=True) 

X.info()

<class 'pandas.core.frame.DataFrame'>
 RangeIndex: 891 entries, 0 to 890
 Data columns (total 10 columns):
 Pclass 891 non-null int64
 Sex 891 non-null int64
 Age 714 non-null float64
 SibSp 891 non-null int64
 Parch 891 non-null int64
 Fare 891 non-null float64
 Embarked_C 891 non-null uint8
 Embarked_Q 891 non-null uint8
 Embarked_S 891 non-null uint8
 dtypes: float64(2), int64(5), uint8(3)
 memory usage: 51.4 KB

All features are now numeric, ready for regression.
But we have still a couple of processing to do.

Check if there are any missing values

X.isnull().values.any()

True

#X[pd.isnull(X).any(axis=1)]

True, there are missing values in the data (NaN) and a quick look at the data reveals that they are all in the Age feature.
One possibility could be to remove the feature, another one is to fill the missing value with a fixed number or the average age.

X.Age.fillna(X.Age.mean(), inplace=True)  # replace NaN with average age

X.isnull().values.any()

False

Now all missing values have been removed.
The logistic regression would otherwise not work with missing values.

Split the dataset into training and validation

The training set will be used to build the machine learning models. The model will be based on the features like passengers’ gender and class but also on the known survived flag.

The validation set should be used to see how well the model performs on unseen data. For each passenger in the test set, I use the model trained to predict whether or not they survived the sinking of the Titanic, then will be compared with the actual survival flag.

from sklearn.model_selection import train_test_split
  # 80 % go into the training test, 20% in the validation test
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)

3. Modelling

Get a baseline

A baseline is always useful to see if the model trained behaves significantly better than an easy to obtain baseline, such as a random guess or a simple heuristic like all and only female passengers survived. In this case, after quickly looking at the training dataset – where the survival outcome is present – I am going to use the following:

def simple_heuristic(titanicDF):
    '''
    predict whether or not the passngers survived or perished.
    Here's the algorithm, predict the passenger survived:
    1) If the passenger is female or
    2) if his socioeconomic status is high AND if the passenger is under 18
    '''

    predictions = [] # a list

    for passenger_index, passenger in titanicDF.iterrows():

        if passenger['Sex'] == 1:
                    # female
            predictions.append(1)  # survived
        elif passenger['Age'] &amp;amp;lt;span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			&amp;amp;gt;&amp;amp;amp;#65279;&amp;amp;lt;/span&amp;amp;gt;&amp;amp;lt; 18 and passenger['Pclass'] == 1:
                    # male but minor and rich
            predictions.append(1)  # survived
        else:
            predictions.append(0) # everyone else perished

    return predictions

Let’s see how this simple algorithm will behave on the validation dataset and we will keep that number as our baseline:

simplePredictions = simple_heuristic(X_valid)
correct = sum(simplePredictions == y_valid)
print ("Baseline: ", correct/len(y_valid))

Baseline: 0.731843575419

Baseline: a simple algorithm predicts correctly 73% of validation cases.
Now let’s see if the model can do better.

Logistic Regression

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
 intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
 penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
 verbose=0, warm_start=False)

4. Evaluate the model

model.score(X_train, y_train)

0.8117977528089888

model.score(X_valid, y_valid)

0.75977653631284914

Two things:
– the score on the training set is much better than on the validation set, an indication that could be overfitting and not being a general model, e.g. for all ship sinks.
– the score on the validation set is better than the baseline, so it adds some value at a minimal cost (the logistic regression is not computationally expensive, at least not for smaller datasets).

An advantage of logistic regression (e.g. against a neural network) is that it’s easily interpretable. It can be written as a math formula:

model.intercept_ # the fitted intercept

array([ 1.32382218])

model.coef_  # the fitted coefficients

array([[ 2.94418113e-04, -9.28482957e-01, 2.83925340e+00,
 -3.94599533e-02, -3.88779847e-01, 1.14992067e-02,
 1.91946485e-03, 7.02467086e-01, 4.32510542e-01,
 1.88844553e-01]])

Which means that the formula is:
$\boldsymbol P(survive) = \frac{1}{1+e^{-logit}}$

where the logit is:

$logit = \boldsymbol{\beta_{0} + \beta_{1}\cdot x_{1} + ... + \beta_{n}\cdot x_{n}}$

where $\beta_{0}$ is the model intercept and the other beta parameters are the model coefficients from above, each multiplied for the related feature:

$logit = \boldsymbol{1.4224 - 0.9319 * Pclass + ... + 0.2228 * Embarked_S}$

5. Iterate on the model

The model could be improved, for example transforming the excluded features above or creating new ones (e.g. I could extract titles from the names which could be another indication of the socio-economic status).

The resulting score on Kaggle test set (a separate dataset unknown to the training set) is 0.75119
Note that the score on the validation set has been a good predictor!

3 thoughts on “Logistic regression using SKlearn”

mashimo

Thanks Ed,
that’s a good question.

It’s not easy to assess features precision and significance in logistic regression, via SKlearn.

One way to assess the importance is to do some features engineering / selection (e.g. https://mashimo.wordpress.com/2014/10/09/features-selection-for-linear-regression/) using the overall model performance metrics (precision, recall, confusion matrix, …)
SKlearn feature selection is described in details here: http://scikit-learn.org/stable/modules/feature_selection.html#

The easiest way to assess the precision for a classification problem would likely be to use a random forest; they make it very easy to measure the relative importance of each feature, also via SKlearn (the returned model has an attribute “feature_importances”).

SKlearn deliberately does not support statistical inference. For out-of-the-box coefficients significance tests (and much more), you can use the Logit estimator from Statsmodels.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html

import statsmodels.api as sm
sm_model = sm.Logit(y, x).fit()
sm_model.pvalues
sm_model.summary()

14 September 2018 at 12:17 Reply
Ed

Thanks for the great explanation.

One question though, how would you assess the precision and significance of each variable in the logistic regression using SKlearn? (i.e. metrics equivalent to p-value, chi2 or t-stat)

13 September 2018 at 18:18 Reply
Pingback: Multi-class logistic regression – Look back in respect