Multi-class logistic regression

We have seen several examples of binary logistic regression where the outcomes that we wanted to predict had two classes, such as a model predicting if a student will be admitted to the University (Yes or No) based on the previous exam results  or if a random Titanic passenger will survive or not.

Binary classification such as these are very common but you can also encounter classification problems where the outcome is a multi-class of more than two: for example if tomorrow weather will be sunny, cloudy or rainy; or if an incoming email shall be tagged as work, family, friends or hobby.

We see now a couple of approaches to handle such classification problems with a practical example: to classify a sky object based on a set of observed variables.
Data is from the Sloan Digital Sky Survey (Release 14).
For the example that we are using the sky object to be classified can be one of three classes: Star, Galaxy or Quasar.

The code is available also on a notebook in GitHub. Data is also available on GitHub.

Data Acquisition

As usual, first of all we import the data into a Pandas dataframe and explore it.

import pandas as pd

sdss = pd.read_csv('../datasets/Skyserver_SQL2_27_2018 6_51_39 PM.csv', skiprows=1)

objid ra dec u g r i z run rerun camcol field specobjid class redshift plate mjd fiberid
0 1237648704577142822 183.531326 0.089693 19.47406 17.04240 15.94699 15.50342 15.22531 752 301 4 267 3722360139651588096 STAR -0.000009 3306 54922 491

The class variable identifies an object to be either a galaxy, star or quasar.
This will be the output variable which we will predict. A multi-class target variable.

 GALAXY 4998
 STAR 4152
 QSO 850
 Name: class, dtype: int64
<class 'pandas.core.frame.DataFrame'>
 RangeIndex: 10000 entries, 0 to 9999
 Data columns (total 18 columns):
 objid 10000 non-null int64
 ra 10000 non-null float64
 dec 10000 non-null float64
 u 10000 non-null float64
 g 10000 non-null float64
 r 10000 non-null float64
 i 10000 non-null float64
 z 10000 non-null float64
 run 10000 non-null int64
 rerun 10000 non-null int64
 camcol 10000 non-null int64
 field 10000 non-null int64
 specobjid 10000 non-null uint64
 class 10000 non-null object
 redshift 10000 non-null float64
 plate 10000 non-null int64
 mjd 10000 non-null int64
 fiberid 10000 non-null int64
 dtypes: float64(8), int64(8), object(1), uint64(1)
 memory usage: 1.4+ MB

The dataset has 10000 examples, 17 features and 1 target.

objid ra dec u g r i z run rerun camcol field specobjid redshift plate mjd fiberid
count 1.000000e+04 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.0 10000.000000 10000.000000 1.000000e+04 10000.000000 10000.000000 10000.000000 10000.000000
mean 1.237650e+18 175.529987 14.836148 18.619355 17.371931 16.840963 16.583579 16.422833 981.034800 301.0 3.648700 302.380100 1.645022e+18 0.143726 1460.986400 52943.533300 353.069400
std 1.173967e+12 47.783439 25.212207 0.828656 0.945457 1.067764 1.141805 1.203188 273.305024 0.0 1.666183 162.577763 2.013998e+18 0.388774 1788.778371 1511.150651 206.298149
min 1.237647e+18 8.235100 -5.382632 12.988970 12.799550 12.431600 11.947210 11.610410 308.000000 301.0 1.000000 11.000000 2.995782e+17 -0.004136 266.000000 51578.000000 1.000000
25% 1.237649e+18 157.370946 -0.539035 18.178035 16.815100 16.173333 15.853705 15.618285 752.000000 301.0 2.000000 184.000000 3.389250e+17 0.000081 301.000000 51900.000000 186.750000
50% 1.237649e+18 180.394514 0.404166 18.853095 17.495135 16.858770 16.554985 16.389945 756.000000 301.0 4.000000 299.000000 4.966580e+17 0.042591 441.000000 51997.000000 351.000000
75% 1.237651e+18 201.547279 35.649397 19.259232 18.010145 17.512675 17.258550 17.141447 1331.000000 301.0 5.000000 414.000000 2.881300e+18 0.092579 2559.000000 54468.000000 510.000000
max 1.237652e+18 260.884382 68.542265 19.599900 19.918970 24.802040 28.179630 22.833060 1412.000000 301.0 6.000000 768.000000 9.468834e+18 5.353854 8410.000000 57481.000000 1000.000000

No missing values!

Data preprocessing

Remove unnecessary features

array(['objid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run', 'rerun',
 'camcol', 'field', 'specobjid', 'class', 'redshift', 'plate', 'mjd',
 'fiberid'], dtype=object)

From the project website, we can see that objid and specobjid are just identifiers for accessing the rows in the original database. Therefore we will not need them for classification as they are not related to the outcome.

Moreover, the features run, rerun, camcol and field are values which describe parts of the camera at the moment when making the observation, e.g. ‘run’ represents the corresponding scan which captured the object.

We will drop these columns as any correlation to the outcome would be coincidentally.

sdss.drop(['objid', 'run', 'rerun', 'camcol', 'field', 'specobjid'], axis=1, inplace=True)
ra dec u g r i z class redshift plate mjd fiberid
0 183.531326 0.089693 19.47406 17.04240 15.94699 15.50342 15.22531 STAR -0.000009 3306 54922 491
1 183.598371 0.135285 18.66280 17.21449 16.67637 16.48922 16.39150 STAR -0.000055 323 51615 541
2 183.680207 0.126185 19.38298 18.19169 17.47428 17.08732 16.80125 GALAXY 0.123111 287 52023 513
3 183.870529 0.049911 17.76536 16.60272 16.16116 15.98233 15.90438 STAR -0.000111 3306 54922 510
4 183.883288 0.102557 17.55025 16.26342 16.43869 16.55492 16.61326 STAR 0.000590 3306 54922 512
# Extract output values
y = sdss['class'].copy() # copy “y” column values out

sdss.drop(['class'], axis=1, inplace=True) # then, drop y column


Split the dataset

We will split the data into a training and a test part (70-30). The models will be trained on the training data set and tested on the test data set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(sdss, y, test_size=0.3)

Data scaling

Scaling all values to be within the (0, 1) interval will reduce the distortion due to exceptionally high values and make some algorithms converge faster.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

One vs all

The first approach – called also one vs rest – is the simplest classifier and comes directly from the binary logistic regression.

Our example has three classes but this classifier (in short OvA) can work with any number of classes.

The One-versus-all model is extremely simple and is exactly what you expect
of the words: you train a classifier for each class.

So for example, you train a first model that says the positive class is going to be all the Stars and the negative class is going to be everything else (in our example: galaxies and quasars);  it is just trying to learn a classifier that separates the stars from the other sky objects.

Boundary between stars and the rest

In particular – on the star side – the probability that y is equal to a star given the input x_i and the parameter w is going to be greater than 0.5 and on the other side – the rest – the probability that y is equal to a star given the input x_i and the parameters w is going to be less than 0.5

Next, we’ll learn a model for each one of the other cases. So, it’s going to be 1 versus all for each one of the classes.
We will learn a model that tries to compare and separate galaxies from stars and quasars. And lastly, we have a model for the quasars prediction.

Different lines = different classifiers

The weights fitted for each model will be different, as you can see by the lines being different.

Once we trained these models, what do we output?
Whatever class has the highest probability, it wins. So in other words, if the probability that an input is a star against everything else is higher, then the object is a star.

And with that simple algorithm, we now have a multi-class classification system by using a number of these binary classifiers.

It’s possible to use the Logistic Regression we have already used  to create all the needed models and then implement a simple for loop to find out which one has the highest probability but sklearn offers a simpler way to do it:

from sklearn.linear_model import LogisticRegression

  # Create One-vs-rest logistic regression object
ovr = LogisticRegression(<strong>multi_class='ovr',</strong> solver='liblinear')

modelOvr =, y_train)

modelOvr.score(X_test_scaled, y_test)

A very simple classifier like this has already a high accuracy of 90%

As you can imagine, if the target variable has many classes, this classifier will take a long time to train and therefore it’s not really scalable.

Using the multinomial classifier

Sklearn offers also a classifier that uses the Softmax function instead of the Sigmoid function and the cross-entropy as loss function.

This softmax regression classifier is appropriate when the target classes are mutually exclusive, such as in our sky objects case.

If the case is to classify an image as indoor scene or black & white or contains people, than these would be non mutually exclusive classes (e.g. a picture could be indoor AND contains people) and to build three separate logistic classifiers would be more appropriate.

You can skip the following paragraph if you are not interested in what are Softmax and a cross-entropy function and prefer to jump directly to the implementation.

Softmax function

The Softmax function is a generalisation of the sigmoid function. It calculates the probabilities distribution of the event over ‘K’ different events; the predicted probability for the i-th class given a sample input vector x would be:

P(y=i|x) =\frac{e^{X^{T}w_{i}}} {\sum_1^K e^{ x^{T} w_{i} }}

where K is the number of classes of the target variable, x is the input vector and w its weights

It returns the probabilities of each class and the target class will have the high probability.

Here is an example:

import numpy as np

K = 7 # seven classes
x = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0] # input X

  # assuming - for simplicity - that the weights are all 1
softmax = lambda x : np.exp(x)/np.sum(np.exp(x))

probabilities = softmax(x)
array([ 0.02364054, 0.06426166, 0.1746813 , 0.474833 , 0.02364054,
0.06426166, 0.1746813 ])
print("The target class with the highest probability is the index={0} and has prob={1}"
      .format(probabilities.argmax(), max(probabilities)))
The target class with the highest probability is the index=3 and has prob=0.4748329997443803

The cross-entropy loss function

When we train a model, we need a way to measure the difference between the predicted probabilities ŷ  and the ground-truth probabilities y, so that we tune the parameters to minimise this difference. Cross entropy is a reasonable choice for this measure.

Entropy originated from the Information Theory by Shannon.
The theory says that when you represent an information using e.g. bits and you need to transmit them, the cost is proportional to the number of bits, so you want to use less bits for the most common information.
For example, I could say that a Star is 0, a Galaxy is 1 and a Quasar (the least common object) is a 10, and so on.
By fully exploiting the known (and correct) distribution over sky objects models in this way, we reduce the number of bits needed on average.
This optimal number of bits is known as entropy.

In contrast, cross entropy is the number of bits needed if we encode symbols from the true distribution y using the wrong distribution ŷ (e.g. the distribution trained by the model) and can be used to define the loss function in machine learning and optimisation. Minimising cross entropy is equivalent to minimising the negative log likelihood of our data, which is a direct measure of the predictive power of our model.

Cross entropy is always larger than entropy; encoding symbols according to the approximate distribution ŷ  will always need more bits.

In practice a multi-class logistic classifier can use the cross-entropy function to find the similarity distance between the probabilities calculated from the softmax function and the target. For the right target class, the distance value will be less.

Sklearn logistic regression using the cross entropy measure

In the multiclass case, the training algorithm uses the cross- entropy loss if the ‘multi_class’ option is set to ‘multinomial’.

For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs
handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

# Create cross-entropy-loss logistic regression object
xe = LogisticRegression(multi_class='multinomial', solver='newton-cg')
# Train model
modelXE =, y_train)

preds = modelXE.predict(X_test_scaled)

modelXE.score(X_test_scaled, y_test)

A small improvement.

More important, it generalises to situations where the number of classes is very large and one-vs-all is intractable

Later we will see different, non-linear approaches to the multi-class problems, such as decision trees and random forests.

Note: this post is part of a series about Machine Learning with Python.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s