# An introduction to logistic regression

Variables can be described as either quantitative or qualitative.
Quantitative variables have a numerical value, e.g. a person’s income, or the price of a house.
Qualitative variables have a values taken from one of different classes or categories. E.g., a person’s gender (male or female), the type of house purchased (villa, flat, penthouse, …) the colour of the eye (brown, blue, green) or a cancer diagnosis.

Linear regression predicts a continuous variable but sometime we want to predict a categorical variable, i.e. a variable with a small number of possible discrete outcomes, usually unordered (there is no order among the outcomes).

This kind of problems are called Classification.

# Classification

Given a feature vector X and a qualitative response y taking values from one fixed set, the classification task is to build a function f(X) that takes as input the feature vector X and predicts its value for y.
Often we are interested also (or even more) in estimating the probabilities that X belongs to each category in C.
For example, it is more valuable to have the probability that an insurance claim is fraudulent, than if a classification is fraudulent or not.

There are many possible classification techniques, or classifiers, available to predict a qualitative response.

We will se now one called logistic regression.

Note: this post is part of a series about Machine Learning with Python.

# Binary logistic regression

Frequently we care about outcomes that have two values and this type is called binary logistic regression.

Binary means that we want to predict the output value from a set with only two categories  such as true or false, good or bad, yes or no, success or failure, win or loss.

For convenience we are using the generic 0 or 1 coding for the response.

One property of the binary logistic regression is that – having only two possible outcomes – knowing the probability of one outcome we can always find out the other one.
Let’s say we know that the probability of outcome equal to zero is p0, then:

P(y=0)  = P0 –> P(y=1) = 1 – P(y=0)  =  1 – P0
(the sum of the two probabilities must be 1).

As an example we will use one explained by Andrew Ng in his CS229 course on Machine Learning at Stanford, which has also inspired his MOOC on Coursera:
we will build a model to predict whether a student gets admitted into a university (i.e, the output classes are Yes or No), based on their results on past exams.
We encode Yes as 1 and No as 0.

This example is also available as an interactive Jupyter notebook on my GitHub.

Could we use Linear Regression to build such a model?

Since it estimates class probabilities, it can be thought of as a regression method as well.

In this case of a binary outcome, linear regression could do a good job as a classifier;
the linear regression model would be given by this hypothesis function:

p(X) = beta0 + beta1 * X

where X are the exam results and will return the probability of being admitted.

However, linear regression might produce probabilities less than zero or bigger than one which would make less sense, and it will perform very poorly on classifiers with more than two classes. Logistic regression is more appropriate.

We have historical data from previous applicants (results of two exams and admission result) that we can use as training set.

## The logistic function

To use the logistic regression, we need to change the form for our hypotheses, as well as the cost function that we used in the linear regression.

The logistic regression instead uses the Logistic Response Function to model how the probability of the event “Admitted” P(y=1) is affected by our variables (the exams score x):

$P(y=1) = \frac{1}{1 +e^{-(\beta_{0} + \beta_{1} \cdot x_{1} + ... + \beta_{n} \cdot x_{n}) }}$

This gives the probability that the outcome y is equal to 1.

which has an easily calculated derivative, therefore ideal (computationally speaking) to find the optimal beta parameters.

Just as a side note, the exponential part is called the logit :

$P(y=1) = \frac{1}{1 +e^{-logit }}$

The bigger the logit, the bigger P(y=1).

## Sigmoid function

The generic function $P(y=1) = \frac{1}{1 +e^{-z }}$ is called the sigmoid function.
We have already seen it in the sigmoid neurons.

Its main advantage is that is a nonlinear transformation of linear regression equation to produce number between 0 and 1.

Notice that g(z) tends towards 1 as z → ∞, and g(z) tends towards 0 as z → −∞.

Other functions that smoothly increase from 0 to 1 can also be used, but the sigmoid is quite wide-spread for logistic regression.

Our first step is to implement the sigmoid function:

import numpy as np

def sigmoid(z):
return 1 / (1 + np.exp(-z))

# we could easily get h(x) given beta and input X (my exams for example):
h = sigmoid(beta * X_my_exams)  # probability of admission


So, we have now that P(y=1) = h(x) and P(y=0) = 1 – h(x)  where h(x) is the hypothesis function using a set of beta values and the sigmoid function.

We still need to find the correct beta parameters for the model.

## Fit the beta parameters

The coefficients β0 and β1 are unknown, and must be estimated based on the available training data.
This is our next step – similar to the linear regression: fit the beta values, given the input X values.

We fitted the beta parameters for the linear regression using a method called maximum likelihood (specifically the least squares method which is just a special case of it) and the gradient descent algorithm.

We can use a similar method for the logistic regression too.
The basic intuition behind using maximum likelihood is as follows: we seek estimates for the beta parameters such that the predicted probability for each individual, corresponds as closely as possible to the individual’s observed status.

The mathematical details of maximum likelihood are beyond the scope of this post.

But in short, we want to have a cost function with this propriety :

$Cost(h_{\beta }(x),y) =\left\{\begin{matrix} -\log (h_{\beta }(x)); if y=1 \\ -\log (1 - h_{\beta }(x)); if y=0 \end{matrix}\right.$

The cost will be zero if y=1 and h=1 but will tend to infinite if y=1 and h=0.
This captures the intuition that if h=0 (the predicted P(y=1) is 0 when in reality y=1), we will penalise the learning algorithm by a very large cost.

These can be combined into:

$Cost(h_{\beta }(x),y) = - y \cdot \log (h_{\beta }(x)) - (1-y) \cdot \log (1 - h_{\beta }(x))$

Finally:

$J(\beta) = -\frac{1}{m} \sum_{i=1}^{m} Cost(h_{\beta }(x_{i}),y_{i})$

which is guaranteed to be convex and non-negative.
Here is the code:

def getCost(beta, X, y):
"""
Compute the cost of a particular choice of beta as the
parameter for logistic regression.

Arguments:
beta: parameters, list
X: input data points, array
y: output data points, array

Returns:
float (the cost)
"""

# Initialize some useful values
y = np.squeeze(y) # this is to avoid broadcasting when element-wise multiply
m = len(y) # number of training examples

# Compute the partial derivatives and set grad to the partial
# derivatives of the cost w.r.t. each parameter in beta

h = sigmoid(np.dot(X, beta))

# J cost function
term1 = y * np.log(h)
term2 = (1 - y) * np.log(1 - h)
cost = -np.sum(term1 + term2) / m

return cost

Now that we have a working cost function J, the next step is to find the beta parameters that will minimise the cost for the given training set X and y.

To do this, we write a function that computes the gradient of the model parameters. Recall that with gradient descent we update the parameters in a way that’s guaranteed to move them in a direction that reduces the training error (i.e. the “cost”): We can do this because the cost function is differentiable:

Goal : min J(beta)

Repeat

$\beta_{j} = \beta_{j} - \alpha \cdot \frac{\partial }{\partial \beta_{j}}J(\beta )$

This time we will do a different approach: instead of implementing the entire gradient descent algorithm, we just compute a single gradient step.
Then we will call an optimiser function from the scientific package Scipy, that can perform different optimisation algorithms.

The advantage of it – beside being a highly tuned function – is that we do not have to write any loops or set a learning rate alfa.
This is all done by the optimisation function: we only need to provide the objective function calculating the cost and the gradient.
You can also keep them separate but it’s nice to re-use the sigmoid function computation.
Therefore, we are going to add the gradient step to the cost function above and rename it (in bold the new gradient part):

def getCostGradient(beta, X, y):
"""
Compute the cost of a particular choice of beta as the
parameter for logistic regression and the gradient of the cost
w.r.t. to the parameters.

Arguments:
beta: parameters, list
X: input data points, array
y : output data points, array

Returns:
float - the cost
array of float - the gradient (same dimension as beta parameters)
"""
# Initialize some useful values
y = np.squeeze(y) # this is to avoid broadcasting when element-wise multiply
m = len(y) # number of training examples
grad = np.zeros(beta.shape) # grad should have the same dimensions as beta

# Compute the partial derivatives and set grad to the partial
# derivatives of the cost w.r.t. each parameter in theta

h = sigmoid(np.dot(X, beta))

# J cost function
y0 = y * np.log(h)
y1 = (1 - y) * np.log(1 - h)
cost = -np.sum(y0 + y1) / m

error = h - y
grad = np.dot(error, X) / m

return (cost, grad)

The  package provides several commonly used optimization algorithms, including what we need: unconstrained and constrained minimisation and least-squares algorithms.
Specifically, we can use the minimize() function with these main arguments:

• fun — this is the objective function, providing the cost, i.e. our getCost().
• x0 — the initial guest, i.e. our initial beta parameters
• args — the extra arguments for the objective function, in our case the input values X and y.
• jac — the Jacobian gradient of the objective function if separate. Passing True assumes that the objective function will compute the gradient step too.
• method — the type of solver, several are available. Can use here the BFGS or the TNC algorithm or the Newton method.
import scipy.optimize as opt

x0 = initialBeta, args = (X, y),
method = 'Newton-CG', jac = True)
optimalBeta = result.x; # and here we have our beta parameters!

## Decision boundary

This final beta value will then be used to plot the decision boundary on the training data, resulting in a figure similar to this:

The blue line is our decision boundary: when your exams score lie below the line then probably (that is the prediction) you will not be admitted to University. If they lie above, probably you will.

As you can see, the boundary is not predicting perfectly on the training historical data. It’s a model. Not perfect but useful.
What we can do is to measure its accuracy.

## Accuracy

Now let’s write a function to predict if a new student will be admitted.

def predict(beta, X):
probabilities = sigmoid(np.dot(X, beta))
return [1 if x >= 0.5 else 0 for x in probabilities]

predictions = predict(optimalBeta, X)

correct = [1 if ((a == 1 and b == 1) or (a == 0 and b == 0))
else 0 for (a, b) in zip(predictions, y)]
accuracy = (sum(map(int, correct)) % len(correct))

print ('accuracy = {0}%'.format(accuracy) )
Out[1]: accuracy = 89%

In the next post of the series we can see how to use higher level packages to model a logistic regression.

Note: this post is part of a series about Machine Learning with Python.