Support Vector Machines

The Support Vector Machines (SVM) are a set of supervised learning algorithms that you can use for classification, regression and outlier detection purposes.
Its optimisation objective is to maximise the margin, i.e. the distance between the separating decision boundary (hyperplane) and the training samples closest to the boundary. Larger margins mean less overfitting.
These algorithms are called also Maximal Margin Classifiers.

In the case of support-vector machines, a data point is viewed as a p-dimensional vector (a list of p numbers), the so-called support vectors.

We want to know whether we can separate such points with a (p-1) -dimensional hyperplane. For p=2, the hyperplane is a line. There are many hyperplanes that might classify the data.

The figure shows a p=2 example:

Svm_separating_hyperplanes
Image from Wikimedia Commons

H3 does not separate the classes.
H1 does, but only with a small margin.
H2 separates them with the maximal margin.

H2 represents the largest separation, or margin, between the two classes. By choosing the hyperplane – among all separating hyperplanes – so that the gap or margin between the two classes is the biggest, we intend to lower the generalisation error and avoid overfitting.

This is a convex quadratic program, and can be solved efficiently, for example using the SVM module from SKlearn.

Let’s see a couple of examples of classifying data via SVM.

You can follow the code also along my notebook on GitHub.

Continue reading “Support Vector Machines”

K-Nearest Neighbours (KNN)

K-Nearest Neighbours (KNN in short) is a supervised classification algorithm.
The idea behind the nearest neighbours classification is that looking just at the neighbours who are close to a new datapoint – among all dimensions – is a good predictor of its properties.

Most of the techniques we have encountered look at the data set as a whole in order to learn patterns in the data. Nearest neighbours, on the other hand, consciously neglects a lot of information, and let the prediction for each new point depends only on the handful of points closest to it.

The Model Nearest neighbours is one of the simplest predictive models there is. It makes no mathematical assumptions, no assumptions are made about the shape of the decision boundary and it doesn’t require any sort of heavy machinery. The only things it requires are:

  • Some notion of distance
  • An assumption that points that are close to one another are similar

Here is how is working the K-nearest neighbours (KNN) classifier:

In the general situation, we have some training data points and we have a corresponding set of labels. K is a given positive integer (a parameter) which represents the number of neighbours to look at.

When we attempt to classify a new, never before seen, data point x0, it finds the K nearest labeled points in the training data and let them vote on the new test observation’s output (basically KNN estimates the conditional probability for each class, applies Bayes rule and classifies the test observation x0 to the class with the largest probability).

The choice of K has a drastic effect on the KNN classifier obtained.

With K = 1, the KNN training error rate is 0, but the test error rate may be quite high. In general, as we use more flexible classification methods, the training error rate will decline but the test error rate may not.

KNN is a completely non-parametric approach:  therefore, we can expect this approach to dominate other classifiers such as logistic regression when the decision boundary is highly non-linear. On the other hand, KNN does not tell us which predictors are important; we don’t get a table of coefficients for example.

Let’s see an example, using the wheat dataset and the sk-learn library.
You can find the code and the data sets on my GitHub repository.

Continue reading “K-Nearest Neighbours (KNN)”

PCA: Principal Component Analysis

Principal component analysis (PCA) is an unsupervised linear transformation technique that is widely used across different fields, most prominently for dimensionality reduction, noise reduction and to identify data patterns based on the correlation between features.

The basic idea is that we might not need every predictor: some predictors are correlated and / or a weighted combination of predictors may be better. We can simply pick that combination to capture the most information possible.
The benefit – beside reducing the number of predictors – is also to reduce the noise (due to averaging).

In a nutshell, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions that the original one.

This is why PCA is useful for dimensionality reduction: it allows us to map a sample vector onto a new smaller feature subspace that has fewer dimensions than the original feature space.

The PCA algorithm for dimensionality reduction (from n-dimensions to k-dimensions) consists of a few simple steps:

  • Pre-processing: standardise the -dimensional dataset by feature scaling and mean normalisation (every feature has mean zero) on the training set.
  • Compute the covariance matrix.
  • Decompose the covariance matrix into its eigenvectors and eigenvalues.
  • Select eigenvectors that correspond to the largest eigenvalues, where is the dimensionality of the new feature subspace.
  • Construct a projection matrix from the “top” eigenvectors.
  • Transform the n-dimensional input dataset using the projection matrix to obtain the new k-dimensional feature subspace.

Variance and eigenvalues

We will tackle the first four steps of a principal component analysis through a simple example: standardising the data, constructing the covariance matrix, obtaining the eigenvalues and eigenvectors of the covariance matrix and finally sorting the eigenvalues by decreasing order to rank the eigenvectors.

Continue reading “PCA: Principal Component Analysis”

Random Forest

We know that decision trees have a tendency to overfit. One way of avoiding this is to build and combine multiple decision trees, a technique called random forests.

But before talking about forests, a word about Bootstrap aggregation or Bagging and one about Boosting. If you want to see instead the random forest’s coding example, jump to the “Human Activity Prediction” chapter. Continue reading “Random Forest”

Decision Trees

Decision trees are a supervised, probabilistic, machine learning classifier that are often used as decision support tools. Like any other classifier, they are capable of predicting the label of a sample, and the way they do this is by examining the probabilistic outcomes of your samples’ features.

Decision trees are one of the oldest and most used machine learning algorithms, perhaps even pre-dating machine learning. They’re very popular and have been around for decades. Following through with sequential cause-and-effect decisions comes very naturally.

Continue reading “Decision Trees”

Classification metrics and Naive Bayes

We have seen how classification via logistic regression works and here we will look into a special classifier called Naive Bayes and the metrics used in classification problems, all using a text classification example.

We build an analytics model using text as our data, specifically trying to understand the sentiment of tweets about the company Apple. This is  a special classification problem, often called Sentiment Analysis.

The challenge is to see if we can correctly classify tweets as being negative, positive, or neutral about Apple.

The code is available as a Python notebook on GitHub. Continue reading “Classification metrics and Naive Bayes”