We have seen several examples of binary logistic regression where the outcomes that we wanted to predict had **two classes**, such as a model predicting if a student will be admitted to the University (Yes or No) based on the previous exam results or if a random Titanic passenger will survive or not.

**Binary classification** such as these are very common but you can also encounter classification problems where the outcome **is a multi-class** of more than two: for example if tomorrow weather will be sunny, cloudy or rainy; or if an incoming email shall be tagged as work, family, friends or hobby.

We see now a couple of approaches to handle such classification problems with a practical example: to classify a sky object based on a set of observed variables.

Data is from the Sloan Digital Sky Survey (Release 14).

For the example that we are using the sky object to be classified can be one of three classes: Star, Galaxy or Quasar.

The code is available also on a notebook in GitHub. Data is also available on GitHub.

## Data Acquisition

As usual, first of all we import the data into a Pandas dataframe and explore it.

import pandas as pd sdss = pd.read_csv('../datasets/Skyserver_SQL2_27_2018 6_51_39 PM.csv', skiprows=1) sdss.head(1)

The *class* variable identifies an object to be either a galaxy, star or quasar.

This will be the output variable which we will predict. A multi-class target variable.

sdss['class'].value_counts()

GALAXY 4998 STAR 4152 QSO 850 Name: class, dtype: int64

sdss.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 18 columns): objid 10000 non-null int64 ra 10000 non-null float64 dec 10000 non-null float64 u 10000 non-null float64 g 10000 non-null float64 r 10000 non-null float64 i 10000 non-null float64 z 10000 non-null float64 run 10000 non-null int64 rerun 10000 non-null int64 camcol 10000 non-null int64 field 10000 non-null int64 specobjid 10000 non-null uint64 class 10000 non-null object redshift 10000 non-null float64 plate 10000 non-null int64 mjd 10000 non-null int64 fiberid 10000 non-null int64 dtypes: float64(8), int64(8), object(1), uint64(1) memory usage: 1.4+ MB

The dataset has 10000 examples, 17 features and 1 target.

sdss.describe()

No missing values!

## Data preprocessing

### Remove unnecessary features

sdss.columns.values

array(['objid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run', 'rerun', 'camcol', 'field', 'specobjid', 'class', 'redshift', 'plate', 'mjd', 'fiberid'], dtype=object)

From the project website, we can see that *objid* and *specobjid* are just identifiers for accessing the rows in the original database. Therefore we will not need them for classification as they are not related to the outcome.

Moreover, the features *run*, *rerun*, *camcol* and *field* are values which describe parts of the camera at the moment when making the observation, e.g. ‘run’ represents the corresponding scan which captured the object.

We will drop these columns as any correlation to the outcome would be coincidentally.

sdss.drop(['objid', 'run', 'rerun', 'camcol', 'field', 'specobjid'], axis=1, inplace=True)

sdss.head()

# Extract output values y = sdss['class'].copy() # copy “y” column values out sdss.drop(['class'], axis=1, inplace=True) # then, drop y column

## Training

### Split the dataset

We will split the data into a training and a test part (70-30). The models will be trained on the training data set and tested on the test data set

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(sdss, y, test_size=0.3)

### Data scaling

Scaling all values to be within the (0, 1) interval will reduce the distortion due to exceptionally high values and make some algorithms converge faster.

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler().fit(X_train) X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test)

# One vs all

The first approach – called also one vs rest – is the simplest classifier and comes directly from the binary logistic regression.

Our example has three classes but this classifier (in short OvA) can work with any number of classes.

The One-versus-all model is extremely simple and is exactly what you expect

of the words: you train a classifier **for each class**.

So for example, you train a first model that says the positive class is going to be all the Stars and the negative class is going to be everything else (in our example: galaxies and quasars); it is just trying to learn a classifier that separates the stars from the other sky objects.

In particular – on the star side – the probability that y is equal to a star given the input x_i and the parameter w is going to be greater than 0.5 and on the other side – the rest – the probability that y is equal to a star given the input x_i and the parameters w is going to be less than 0.5

Next, we’ll learn a model for each one of the other cases. So, it’s going to be 1 versus all for each one of the classes.

We will learn a model that tries to compare and separate galaxies from stars and quasars. And lastly, we have a model for the quasars prediction.

The weights fitted for each model will be different, as you can see by the lines being different.

Once we trained these models, what do we output?

**Whatever class has the highest probability, it wins.** So in other words, if the probability that an input is a star against everything else is higher, then the object is a star.

And with that simple algorithm, we now have a multi-class classification system by using a number of these binary classifiers.

It’s possible to use the Logistic Regression we have already used to create all the needed models and then implement a simple for loop to find out which one has the highest probability but *sklearn* offers a simpler way to do it:

from sklearn.linear_model import LogisticRegression # Create One-vs-rest logistic regression object ovr = LogisticRegression(<strong>multi_class='ovr',</strong> solver='liblinear') modelOvr = ovr.fit(X_train_scaled, y_train) modelOvr.score(X_test_scaled, y_test)

0.90233333333333332

A very simple classifier like this has already a high accuracy of 90%

As you can imagine, if the target variable has many classes, this classifier will take a long time to train and therefore it’s not really scalable.

## Using the multinomial classifier

*Sklearn* offers also a classifier that uses the Softmax function instead of the Sigmoid function and the cross-entropy as loss function.

This softmax regression classifier is appropriate when the target classes are **mutually exclusive,** such as in our sky objects case.

If the case is to classify an image as indoor scene or black & white or contains people, than these would be non mutually exclusive classes (e.g. a picture could be indoor AND contains people) and to build three separate logistic classifiers would be more appropriate.

You can skip the following paragraph if you are not interested in what are Softmax and a cross-entropy function and prefer to jump directly to the implementation.

### Softmax function

The Softmax function is a generalisation of the sigmoid function. It calculates the probabilities distribution of the event over ‘K’ different events; the predicted probability for the i-th class given a sample input vector x would be:

where K is the number of classes of the target variable, x is the input vector and w its weights

It returns the probabilities of each class and the target class will have the high probability.

Here is an example:

import numpy as np K = 7 # seven classes x = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0] # input X # assuming - for simplicity - that the weights are all 1 softmax = lambda x : np.exp(x)/np.sum(np.exp(x)) probabilities = softmax(x) probabilities

array([ 0.02364054, 0.06426166, 0.1746813 , 0.474833 , 0.02364054, 0.06426166, 0.1746813 ])

print("The target class with the highest probability is the index={0} and has prob={1}" .format(probabilities.argmax(), max(probabilities)))

The target class with the highest probability is the index=3 and has prob=0.4748329997443803

### The cross-entropy loss function

When we train a model, we need a way to measure the difference between the predicted probabilities ŷ and the ground-truth probabilities y, so that we tune the parameters to minimise this difference. Cross entropy is a reasonable choice for this measure.

**Entropy** originated from the Information Theory by Shannon.

The theory says that when you represent an information using e.g. bits and you need to transmit them, the cost is proportional to the number of bits, so you want to use less bits for the most common information.

For example, I could say that a Star is 0, a Galaxy is 1 and a Quasar (the least common object) is a 10, and so on.

By fully exploiting the known (and correct) distribution over sky objects models in this way, we reduce the number of bits needed on average.

This optimal number of bits is known as entropy.

In contrast, cross entropy is the number of bits needed if we encode symbols from the true distribution y using the wrong distribution ŷ (e.g. the distribution trained by the model) and can be used to define the loss function in machine learning and optimisation. Minimising cross entropy is equivalent to minimising the negative log likelihood of our data, which is a direct measure of the predictive power of our model.

Cross entropy is always larger than entropy; encoding symbols according to the approximate distribution ŷ will always need more bits.

In practice a multi-class logistic classifier can use the cross-entropy function to find the similarity distance between the probabilities calculated from the softmax function and the target. For the right target class, the distance value will be less.

### Sklearn logistic regression using the cross entropy measure

In the multiclass case, the training algorithm uses the cross- entropy loss if the ‘*multi_class*’ option is set to ‘*multinomial*’.

For multiclass problems, only ‘*newton-cg’, ‘sag’, ‘saga’* and ‘*lbfgs*’

handle multinomial loss; ‘*liblinear*’ is limited to one-versus-rest schemes.

# Create cross-entropy-loss logistic regression object xe = LogisticRegression(multi_class='multinomial', solver='newton-cg')

# Train model modelXE = xe.fit(X_train_scaled, y_train) preds = modelXE.predict(X_test_scaled) modelXE.score(X_test_scaled, y_test)

0.91133333333333333

A small improvement.

More important, it generalises to situations where the number of classes is very large and one-vs-all is intractable

Later we will see different, non-linear approaches to the multi-class problems, such as decision trees and random forests.

Note: this post is part of a series about Machine Learning with Python.