Linear discriminant Analysis – LDA

Finally, let’s talk about multiclass classification. We have seen already a couple of examples: the multi-class logistic regression and the multinomial Naive Bayes.
In particular, we’re going to briefly talk about perhaps the simplest (but useful) approach for doing multiclass classifications, the 1 versus all  approach.  And then a more sophisticated one: the LDA.

For that we are going to use this example: a prediction model for Weight Lifting based on sensors to predict how well an exercise is performed.

You can follow along the notebook on GitHub, where the dataset is also uploaded (and the tables are much better displayed than on wordpress…)

Project goal

In this project we will use data from accelerometers on the belt, forearm, arm and dumbell of 6 participants.
They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

More information is available from this website  (see the section on the Weight Lifting Exercise Dataset).

The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the data set: A – E

  • exactly according to the specification (Class A)
  • throwing the elbows to the front (Class B)
  • lifting the dumbbell only halfway (Class C)
  • lowering the dumbbell only halfway (Class D)
  • throwing the hips to the front (Class E).

One vs all

The One versus all model is extremely simple and is exactly what you expect of the words one versus all: you train a classifier for each category.
For example, you train one that says the positive category is going to be the class A and the negative category is going to be everything else.

What you do is to learn a classifier that separates most of the A-class from the other classes. So in particular, to train or classify them, which outputs +1 if the input x is more likely to be a class A then everything else.

y = 1 … number of classes (five in our example):

  • class A -> y = 1
  • class B -> y = 2
  • class C -> y = 3
  • class D -> y = 4
  • class E -> y = 5

And then the way that we estimate the probability that an input x i is
a B-class and so on for all five classes.

What is done is to learn a model for each one of these cases. So, it’s going to be 1 versus all for each one of the classes.
At the end, we will have five classifiers for each class i to predict the probability that y=i:

h_{\theta}^{(i)}(x) = P(y=i|x)

where i = 1, …, 5

That doesn’t tell how to do multiclass classification in general.
We trained this one versus all models and what do we output?

As a prediction, we just say: whatever class has the highest probability  wins. In other words, if the probability that an input is a class A against everything else is higher, then the point is a class A.

On a new input x to make a prediction, pick the class i that maximises \underset{i}{max}\left [h_{\theta }^{(i)}(x) \right ].

We would start with the maximum probability being zero. And we go class by class and ask: is the probability – according to the model for A-class, then to the model for B-class, and so on – is that higher than the max probability so far?
If it’s higher that means that class looks like it’s winning so we update the maximum probability to be the probability according to that class and we say y is that class.
As we iterate over each one of these, the maximum is going to win.
And with that simple algorithm, we now have a multi-class classification system by using a number of these binary classifiers.

The first thing to note is that such an approach can be used for n classes – not only five – but doesn’t scale very well: soon it becomes unmanageable to train all these classifiers.

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) can be used as a technique for feature extraction to increase the computational efficiency and reduce the degree of over-fitting due to the curse of dimensionality in nonregularized models.

The general concept behind LDA is very similar to PCA, whereas PCA attempts to find the orthogonal component axes of maximum variance in a dataset; the goal in LDA is to find the feature subspace that optimizes class separability.

Both LDA and PCA are linear transformation techniques that can be used to reduce the number of dimensions in a dataset; the former is an unsupervised algorithm, whereas the latter is supervised.

The assumptions that we make when we are using LDA are that the features are normally distributed and independent of each other. However, even if we violate those assumptions to a certain extent, LDA may still work reasonably well.

LDA in Scikit learn

This is the intro of the LDA classifier on Scikit Learn:

A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule.

This classifier is attractive because it has closed-form solutions that can be easily computed, is inherently multiclass, has proven to work well in practice, and has no hyperparameters to tune.

Let’s see – through the weight lifting example – how LDA works.

Read the data

The dataset is available online:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H.: Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013 ().

import pandas as pd # Start by importing the data
X = pd.read_csv('../datasets/pml-training.csv', low_memory=False) 
X.shape
(19622, 160)

The data set has 19622 obs. (rows) of 160 features (columns).

Let’s have a closer look at these features:

X.columns
Index(['Unnamed: 0', 'user_name', 'raw_timestamp_part_1',
'raw_timestamp_part_2', 'cvtd_timestamp', 'new_window', 'num_window',
'roll_belt', 'pitch_belt', 'yaw_belt',
...
'gyros_forearm_x', 'gyros_forearm_y', 'gyros_forearm_z',
'accel_forearm_x', 'accel_forearm_y', 'accel_forearm_z',
'magnet_forearm_x', 'magnet_forearm_y', 'magnet_forearm_z', 'classe'],
dtype='object', length=160)

This is not really necessary but I like to rename the first column from “X” to a meaningful name:

X.rename(columns = {X.columns[0] : 'ID'}, inplace = True)

Explore the data

X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19622 entries, 0 to 19621
Columns: 160 entries, Unnamed: 0 to classe
dtypes: float64(94), int64(29), object(37)
memory usage: 24.0+ MB
 

This is how the five classes are distributed:

X.classe.hist();

output_15_0

Classes are almost uniformly distributed, with a small prevalence for A-class.

Preprocess the data

y = X.classe.copy() # copy “y” column (the target) values out

X.drop(['classe'], axis=1, inplace=True) # then, drop y column

Feature extraction

160 features are computationally expensive for the model training, so we aim to reduce them.

Remove user- and time-dependent features

First of all, it’s clear from the summary above that the first six variables have no use, since they are user-dependent and time-dependent; so we remove them:

columnsToDelete = ['ID','user_name', 'raw_timestamp_part_1', 'raw_timestamp_part_2','cvtd_timestamp', 'new_window', 'num_window']

X.drop(columnsToDelete, axis = 1, inplace=True)

Remove features with missing values

X.isnull().values.any() 
True

From the summary above, some variables have missing values (NaN).
We could substitute them with the average for that variable but there are many missing values and this is not improving the model accuracy (I tried).
Instead of less-accurate imputation of missing data, I just remove all predictors with NaN values. Hard but fair …

X.dropna(axis=1, inplace=True) 

X.shape
(19622, 52)

Finally we have reduced the dataset to 52 features, less than one third.

 

Remove features with near zero variance

There is something else we can do: drop features which have a variance near zero meaining they don’t provide enough value for predictions.
We will use the sklearn module VarianceThreshold to find out which features have big enough variance.

from sklearn.feature_selection import VarianceThreshold

selector = <strong>VarianceThreshold()</strong>
selector.fit(X)
VarianceThreshold(threshold=0.0)

Find all features with variance larger than 0.5 (you can tune it):

mask = selector.variances_ &gt; 0.5 # arbitrary value 0.5

Copy all “high variance” features into a new dataframe x_hv:

X_hv = X.loc[:, mask == True]

X_hv.shape
(19622, 46)
totalObs = X_hv.shape[0] # this variable will soon come useful
totalObs
19622

Another 6 features removed.
A final reduction could be to remove all the features that have a high correlation between them, which I skip. 

Normalisation

LDA (and in general other classification models) works better if the data set is normalised.

from sklearn import preprocessing

normaliser = preprocessing.Normalizer()

  # this will keep the columns names
xNormalised = pd.DataFrame(<strong>normaliser.fit_transform</strong>(X_hv), columns = X_hv.columns) 

Split data into training and testing

For each model I will measure the out of sample error that is the error rate you get on a new data set.
The purpose of using a different data set than training is model checking. I want to validate how well the model got trained.
I will calculate the out of sample error by looking at the accuracy.

from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X_hv, y, test_size=0.2, random_state=7)

Baseline prediction

A baseline is needed to see if any model trained is really that useful. Can it beat the simplest baseline?
Our baseline will be deciding randomly the “classe” category based on the frequency in the training set (classe A is 28% of the times, classe B is 19% and so on …)
The baseline accuracy would be therefore around 0.2, as there are five classes to choose.
Any model with a higher accuracy than the baseline is a better model.

y.value_counts() # frequency of each category
A 5580
B 3797
E 3607
C 3422
D 3216
Name: classe, dtype: int64

Let’s get the baseline by hard-coding the frequencies of the classes usage:

def baselinePredict(aNumber):
  # aNumber: expects a number between 0 and 1
  # totalObs is a global vaiable

if (aNumber < 5580 / totalObs):
  return 'A'
elif (aNumber < (5580 + 3797) / totalObs):
  return 'B'
elif (aNumber < (5580 + 3797 + 3607) / totalObs):
  return 'C'
elif (aNumber < (5580 + 3797 + 3607 + 3422) / totalObs):
  return 'D'
else:
  return 'E'

Let’s try a test:

import random

test = random.random() # a number between 0 and 1

baselinePredict(test)
'C'
yTotals = ytest.count()
correctPredictions = 0

for i in range(yTotals):
  rndPrediction = baselinePredict(random.random())

  if (rndPrediction == ytest.iloc[i]):
    correctPredictions += 1

print("Percentage of correct predictions for all test dataset: ", correctPredictions / yTotals)
Percentage of correct predictions for all test dataset: 0.202

This is our baseline. As it’s random calculated, it varies; usually between 0.2 and 0.22 circa.

Now we have everything to start training our models. Let’s model!

Multi-class logistic: One vs All

We now train a one versus all model.

from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
import time # to measure the time needed for training

modelOneVsAll = <strong>OneVsRestClassifier</strong>(SVC(kernel='linear'))&nbsp;

print ("Training the One versus All model ...")

s = time.time()

modelOneVsAll.fit(Xtrain, ytrain)

print("Done! Completed in: ", time.time() - s, "seconds")

Training the One versus All model ...
Done! Completed in: 84.21 seconds
modelOneVsAll.score(Xtest, ytest)
0.624

The model scores a better accuracy than the baseline but it took a long time.
Using a Logistic multinomial regression hold similar accuracy results.

Reduce even more the features

One possibility would be to apply the LR model to a very reduced subset of features.
To do that we need first to find out which are the most important features.
As we remember a good way to have them is to use a decision tree. Let’s use the tree as a shortcut to find the most important features.

Train a decision tree

from sklearn.ensemble import ExtraTreesClassifier
tree = ExtraTreesClassifier()

tree.fit(Xtrain, ytrain)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
 
feat_importances = pd.Series(tree.feature_importances_, index = Xtest.columns)
feat_importances.nlargest(4)
pitch_forearm     0.056728
magnet_dumbbell_z 0.044789
yaw_belt 0.033881
roll_forearm 0.033668
dtype: float64

Re-train the oneVsAll model with the top four features

Now that we have the top 4 features, we can train again the XE model only with them:

top4features = ['pitch_forearm', 'yaw_belt', 'magnet_dumbbell_z', 'roll_forearm']

Xtrain_reduced = Xtrain[top4features].copy()

Xtrain_reduced.shape
(15697, 4)
Xtest_reduced = Xtest[top4features].copy()

print ("Training the OneVsAll model on 4 features ...")
s = time.time()

modelOneVsAll.fit(Xtrain_reduced, ytrain)

print("Done! Completed in: ", time.time() - s, "seconds")
Training the OneVsAll model on 4 features ...
Done! Completed in: 14.98 seconds
modelOneVsAll.score(Xtest_reduced, ytest)

0.31

Well, the time greatly reduced but the score is now very low, barely better than the baseline …
Not so useful. How to do?

Reduce the features using LDA

LDA comes to rescue in the form of a data reduction similar to PCA.
The LDA model that we trained can be applied to the dataset and transform it into a smaller set of components, between 1 and the number of classes minus one.
Since we have 5 classes we can reduce up to 4 components (this is why we choose earlier 4 top features)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # LDA

reduceLDA = LinearDiscriminantAnalysis(n_components = 4)
reduceLDA.fit(Xtrain, ytrain)
LinearDiscriminantAnalysis(n_components=4, priors=None, shrinkage=None, solver='svd', store_covariance=False, tol=0.0001)
XtrainLDAred = reduceLDA.transform(Xtrain)

XtestLDAred = reduceLDA.transform(Xtest)

And now we can again apply the OneVsAll model but this time to the reduced dataset:

print ("Training the OneVsAll model on reduced 4-components dataset ...")
s = time.time()

modelOneVsAll.fit(XtrainLDAred, ytrain)

print("Done! Completed in: ", time.time() - s, "seconds")
Training the OneVsAll model on reduced 4-components dataset ...
Done! Completed in: 34.41 seconds
modelOneVsAll.score(XtestLDAred, ytest)
0.687

The score is now similar to the one using the entire dataset but the training is much faster.
This shows how the LDA data reduction can help to have good results and times by applying the model on a reduced dataset.

LDA Model

But LDA can do more: it can also be used as a classifier directly!
We will train it (and time this) on the original dataset, not the reduced one; the LDA classifier will automatically reduce the dataset.


modelLDA = LinearDiscriminantAnalysis()
print ("Training the LDA model ...")
s = time.time()

modelLDA.fit(Xtrain, ytrain)

print("Done! Completed in: ", time.time() - s, "seconds")
Training the LDA model ...
Done! Completed in: 0.39 seconds
print("LDA Model accuracy:", modelLDA.score(Xtest,ytest))
LDA Model accuracy: 0.694

Great! The LDA model is much better than the baseline and similar to the OneVsAll results plus it was quite fast.

Note: 70% of correct predictions is not exactly perfect but it was not the goal of this exercise: you can get almost perfect results using decision trees or random forests.

tree.score(Xtest, ytest)
0.989

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s