Support Vector Machines

The Support Vector Machines (SVM) are a set of supervised learning algorithms that you can use for classification, regression and outlier detection purposes.
Its optimisation objective is to maximise the margin, i.e. the distance between the separating decision boundary (hyperplane) and the training samples closest to the boundary. Larger margins mean less overfitting.
These algorithms are called also Maximal Margin Classifiers.

In the case of support-vector machines, a data point is viewed as a p-dimensional vector (a list of p numbers), the so-called support vectors.

We want to know whether we can separate such points with a (p-1) -dimensional hyperplane. For p=2, the hyperplane is a line. There are many hyperplanes that might classify the data.

The figure shows a p=2 example:

Image from Wikimedia Commons

H3 does not separate the classes.
H1 does, but only with a small margin.
H2 separates them with the maximal margin.

H2 represents the largest separation, or margin, between the two classes. By choosing the hyperplane – among all separating hyperplanes – so that the gap or margin between the two classes is the biggest, we intend to lower the generalisation error and avoid overfitting.

This is a convex quadratic program, and can be solved efficiently, for example using the SVM module from SKlearn.

Let’s see a couple of examples of classifying data via SVM.

You can follow the code also along my notebook on GitHub.

SciKit-Learn has many classes for SVM usage, depending on the purpose. The one we’ll be focusing on is the Support Vector Classifier, SVC.

OCR via Support Vector Classifier (SVC)

In 1982 the first computer-driven OCR (Optical Character Recognition) machine got installed by the United States Postal Service (USPS) in Los Angeles and by the end of 1984 over 250 OCRs machines were installed in 118 major mail processing centres across the country.

Let’s see if it’s possible to train a support vector classifier in a few seconds using machine learning and if the classification accuracy is similar or better than the advertised USPS stats.

Read and data pre-processing

We start by reading the dataset, which comes from the UCI Machine Learning Repository and is composed by bitmaps of handwritten digits from a preprinted form.

import pandas as pd # Library to create and handle Dataframes

# The Dataset comes from:

# Load up the data.
with open('Datasets/optdigits.tes', 'r') as f: testing  = pd.read_csv(f)
with open('Datasets/optdigits.tra', 'r') as f: training = pd.read_csv(f)

The number of samples between training and testing can vary but the number of features better remain the same!

n_features = testing.shape[1]

X_test  = testing.iloc[:,:n_features-1]
X_train = training.iloc[:,:n_features-1]
y_test  = testing.iloc[:,n_features-1:].values.ravel()
y_train = training.iloc[:,n_features-1:].values.ravel()

print (n_features)

We can have a look at how these handwritten digits look like

import matplotlib.pyplot as plt # library to plot

# The 'targets' or labels are stored in y. The 'samples' or data is stored in X
print ("Peeking your data...")
fig = plt.figure()

cnt = 0
for col in range(5):
for row in range(10):
plt.subplot(5, 10, cnt + 1)
plt.imshow(X_train.iloc[cnt,:].values.reshape(8,8),, interpolation='nearest')
cnt += 1

Peeking your data...


Train the SVM Classifier

Now we are ready to train the Support Vector Classifier, using the SciKitLearn library.
We leave all parameters at their defaults, setting only the kernel to be Linear.
More on the kernels later.

from sklearn import svm # Support Vector Machine library

# Create and train an SVM classifier.
print ("Training SVM Classifier...")
svc = svm.SVC(kernel='linear'), y_train)

Training SVM Classifier...

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)


Print the predicted digit and the actual label for a random example.
We take the thousandth digit.

# Print out the TRUE value of the 1000th digit in the test set
# By TRUE value, we mean, the actual provided label for that sample
true_1000th_test_value = y_test[999]
print ("1000th test label is: ", true_1000th_test_value)
1000th test label is: 1
# Predict the value of the 1000th digit in the test set.
# Was the model's prediction correct?
guess_1000th_test_value = svc.predict(X_test[999:1000])

print ("1000th test prediction is: ", guess_1000th_test_value)

1000th test prediction is: [1]

The model’s prediction was correct.
We can display that image, so we can visually check if it was a hard image or an easy image:

# Use IMSHOW to display the 1000th test image
plt.imshow(X_test.iloc[999,:].values.reshape(8,8),, interpolation='nearest');


Visual confirmation of accuracy

Here we can print more digits with indication of what was the predicted label (in red if it was wrong):

# Visual Confirmation of accuracy

fig = plt.figure()

# Make some guesses
y_guess = svc.predict(X_test)

num_rows = 10
num_cols = 5

index = 0
for col in range(num_cols):
for row in range(num_rows):
plt.subplot(num_cols, num_rows, index + 1)

# 8x8 is the size of the image, 64 pixels
plt.imshow(X_test.iloc[index,:].values.reshape(8,8),, interpolation='nearest')

# Green = Guessed right
# Red = Fail!
fontcolor = 'g' if y_test[index] == y_guess[index] else 'r'
plt.title('Label: %i' % y_guess[index], fontsize=7, color=fontcolor)
index += 1



We can see that – on this sample of 50 handwritten digits – 4 of them are wrong, that’s 8%.
And here we calculate the score on the entire testing dataset:

# Calculate the score of the SVC against the testing data
print ("Scoring SVM Classifier...")
score = svc.score(X_test, y_test)
print ("Score: ", score)
Scoring SVM Classifier...
Score: 0.9610244988864143

Not bad, the model was correct more than 96% of the times!

Non-linearities and the Kernel trick

Sometimes the data are separable, but noisy. This can lead to a poor solution for the maximal-margin classifier.

One way to overcome this is to act on the regularisation parameter C and effectively makes the margins smaller or wider.
C is 1 by default and it’s a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularisation.

Sometime a linear boundary simply won’t work, no matter what value of C.

In this case the best approach would be to enlarge the features space by including transformations, hence going from a p-dimensional space to a higher M > p dimensional space, via a mapping function. We then train a linear SVM model to classify the data in this new feature space. Of course, the same mapping function needs to be used to transform new, unseen data to classify it.

Going back to the original picture, a 2-dimensional space with a line as the decision boundary; the approach would be about enlarging this into a 3d space, where the boundary could be a surface, better able to separate the classes.

This results in non-linear decision boundaries in the original space and could be expensive if there are a lot of points and the mapping is complicated.

There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers — through the use of kernels.

This is called the kernel trick because rather than actually mapping the points into the higher-dimensional space, we use a “kernel” function to compute dot products in the higher-dimensional space and use those to find a hyperplane.
The beauty of the “kernel trick” is that, even if there is an infinite-dimensional basis, we need only look at the n^2 inner products between training data points.
This has to do with the role of inner products between vectors in support-vector classifiers but its math goes beyond the scope of this article.

Let’s see instead a concrete example using kernels.

Non-linear Kernels for the SVC

We experiment now with different kernels, starting with the polynomial kernel.

When training an SVM with a kernel, two additional parameters must be considered: C and gamma.
The parameter C, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface.
A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly.
We keep C at the default value = 1.

Gamma defines how much influence a single training example has.
The larger gamma is, the closer other examples must be to be affected.

USPS has an advertised accuracy score of 98% which is higher than our SVC with a linear Kernel.
We can beat it with a non-linear kernel!

# We start with the POLY kernel

svc = svm.SVC(kernel='poly', C=1.0, gamma=0.001), y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='poly',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
# Calculate the score of the SVC against the testing data
print ("Scoring SV poly Classifier...")
score = svc.score(X_test, y_test)
print ("Score: ", score)
Scoring SV poly Classifier...
Score: 0.9749443207126949

Which is sightly better, but we can try a different kernel, more performant: the Radial Basis Function (RBF) kernel.

# change  SVC's kernel to 'rbf'

svc = svm.SVC(kernel='rbf', C=1.0, gamma=0.001), y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
# Calculate the score of SVC against the testing data
print ("Scoring SVM rbf Classifier...")
score = svc.score(X_test, y_test)
print ("Score: ", score)

Scoring SVM rbf Classifier...
Score: 0.982739420935412

Now it’s better than the UPS’ score!

Hyper-parameters tuning for SVC Kernels

Proper choice of C and gamma is critical to the SVM’s performance.
One is advised to use sklearn.model_selection.GridSearchCV with C and gamma spaced exponentially far apart to choose good values.

We will tune them – and the pre-processor – using a different example.

A Parkinson’s classifier

Apply SVC to the Parkinson’s Data Set, provided courtesy of UCI’s Machine Learning Repository. The dataset was created at the University of Oxford, in collaboration with 10 medical centres around the US, along with Intel who developed the device used to record the primary features of the dataset: speech signals.

Goals: first to see if it’s possible to differentiate between people who have Parkinson’s and who don’t using SciKit-Learn’s support vector classifier and then to take a first-stab at a naive way of fine-tuning our parameters in an attempt to maximise the accuracy of the testing set.

Read and pre-process the data

X = pd.read_csv("Datasets/")

X.drop(['name'], axis=1, inplace=True) # drop name column

y = X.status.copy() # copy “y” values out from status

X.drop(['status'], axis=1, inplace=True) # drop status column

# Perform a train/test split. 30% test group size, with a random_state equal to 7.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,

We can apply different scaler for the pre-processing.
The standard scaler seems the best, feel free to experiment with others, uncommenting one below:

from sklearn import preprocessing

# tried with different scaler, standard is the best

scaler = preprocessing.StandardScaler() # best score was 0.932
#scaler = preprocessing.MinMaxScaler()  # best score was 0.881
#scaler = preprocessing.MaxAbsScaler()  # best score was 0.881
#scaler = preprocessing.Normalizer()    # best score was 0.797
#scaler = preprocessing.KernelCenterer() # best score was 0.915

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Same for the dimensionality reduction: feel free to experiment with PCA or Isomap:

from sklearn.decomposition import PCA
from sklearn import manifold

usePCA = False # change this to use PCA as dimensionality reducer

if usePCA:
reducer = PCA(n_components=7).fit(X_train)
reducer = manifold.Isomap(n_neighbors=3, n_components=6).fit(X_train)

X_train = reducer.transform(X_train)
X_test  = reducer.transform(X_test)

Train the SVM classifier

Create and fit an SVC based on the RBF kernel against the training data and then finally score the testing data.

To search the best hyper-parameters we just use simple nested loops.
Looping C from 0.05 to 2 and looping gamma from 0.001 to 0.1:

import numpy as np

# a naive, best-parameter search using nested for-loops.
best_score = 0
for c in np.arange(0.05,2,0.05):
for gamma in np.arange(0.001, 0.1, 0.001):
svc = svm.SVC(kernel='rbf', C=c, gamma=gamma), y_train)

score = svc.score(X_test, y_test)
if score > best_score:
best_score = score
print(f"New best score: {score:.3f} using C=  {c:.2f} and gamma = {gamma:.3f}")

New best score: 0.7966101694915254 using C= 0.05 and gamma = 0.001
New best score: 0.8135593220338984 using C= 0.1 and gamma = 0.01
New best score: 0.8305084745762712 using C= 0.1 and gamma = 0.012
New best score: 0.847457627118644 using C= 0.15 and gamma = 0.013
New best score: 0.864406779661017 using C= 0.2 and gamma = 0.013
New best score: 0.8813559322033898 using C= 0.35 and gamma = 0.02
New best score: 0.8983050847457628 using C= 0.45 and gamma = 0.046
New best score: 0.9152542372881356 using C= 0.5 and gamma = 0.087
New best score: 0.9322033898305084 using C= 0.55 and gamma = 0.096
New best score: 0.9491525423728814 using C= 0.85 and gamma = 0.088

Comparing KNN vs. SVC

How does SVC compare with other classifiers, such as the KNN?
We classify the UCI’s wheat-seeds dataset – that we used previously with the KNN algorithm – by using the SVC and compare the results.
First, benchmark how long it takes to train and predict with SVC relative to how long K-Neighbors took to train and test and then compare the decision boundary plot produced by the two.

Defining some parameters

# INFO:  Parameters can be adjusted here
C = 1
kernel = 'linear'
iterations = 5000 # this will take long time, you can start with 100

# INFO: You can set this to false if you want to
# draw the full square matrix

Read the data

# Load up the wheat dataset into dataframe 'X'
df = pd.read_csv("Datasets/", index_col='id')

Data pre-processing

This is all standard. You can refer to the previous examples for more details, especially the KNN example.

# INFO: An easy way to show which rows have nans in them
print (df[pd.isnull(df).any(axis=1)])

id  area  perimeter compact. length width asymm. groove wheat_type
7   14.11 14.10   0.8911  5.4200 3.302   2.700   NaN   canadian
35  16.12 15.00   NaN     0.9000 NaN     5.709   3.485 canadian
60  11.42 12.86   0.8683  5.0080 2.850   2.700   NaN   canadian
135 15.38 14.66   0.8990  5.4770 3.465   3.600   NaN   canadian
169 11.24 13.00   NaN     0.8359 5.090   2.715   3.521 canadian
170 11.02 13.00   NaN     0.8189 5.325   2.701   6.735 canadian
201 12.67 13.32   0.8977  4.9840 3.135   2.300   NaN   canadian

# Go ahead and drop any row with a nan
df.dropna(axis=0, inplace=True)

# INFO: you might try setting the nan values to the
# mean value of that column, the mean should only be calculated for
# the specific class rather than across all classes, now that you
# have the labels

# Copy the labels out of the dset into variable 'y' then Remove
# them from X. Encode the labels -- canadian:0, kama:1, and rosa:2
labels = df.wheat_type.copy() # copy “y” values out
df.drop(['wheat_type'], axis=1, inplace=True) # drop output column

labels ={'canadian':0, 'kama':1, 'rosa':2})

Split into training and testing data sets

# Split data into test / train sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.3,

Utility function: draw the plots

This is a convenience function to break any higher-dimensional space down and view cross sections of it.

import matplotlib as mpl
import matplotlib.pyplot as plt

def drawPlots(model, X_train, X_test, y_train, y_test, wintitle='Figure 1'):

# If this line throws an error, use'ggplot') instead'ggplot') # Look Pretty

padding = 3
resolution = 0.5
max_2d_score = 0
score = 0

y_colors = ['#ff0000', '#00ff00', '#0000ff']
my_cmap = mpl.colors.ListedColormap(['#ffaaaa', '#aaffaa', '#aaaaff'])
colors = [y_colors[i] for i in y_train]
num_columns = len(X_train.columns)

fig = plt.figure()

cnt = 0
for col in range(num_columns):
for row in range(num_columns):
# Easy out
if FAST_DRAW and col > row:
cnt += 1

ax = plt.subplot(num_columns, num_columns, cnt + 1)

# Intersection:
if col == row:
plt.text(0.5, 0.5, X_train.columns[row], verticalalignment='center', horizontalalignment='center', fontsize=12)
cnt += 1

# Only select two features to display, then train the model
X_train_bag = X_train.iloc[:, [row,col]]
X_test_bag = X_test.iloc[:, [row,col]], y_train)

# Create a mesh to plot in
x_min, x_max = X_train_bag.iloc[:, 0].min() - padding, X_train_bag.iloc[:, 0].max() + padding
y_min, y_max = X_train_bag.iloc[:, 1].min() - padding, X_train_bag.iloc[:, 1].max() + padding
xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
np.arange(y_min, y_max, resolution))

# Plot Boundaries
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

# Prepare the contour
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=my_cmap, alpha=0.8)
plt.scatter(X_train_bag.iloc[:, 0], X_train_bag.iloc[:, 1], c=colors, alpha=0.5)

score = round(model.score(X_test_bag, y_test) * 100, 3)
plt.text(0.5, 0, f"Score: {score}", transform = ax.transAxes, horizontalalignment='center', fontsize=8)
max_2d_score = score if score > max_2d_score else max_2d_score

cnt += 1

print ("Max 2D Score: ", max_2d_score)

Utility function: benchmark times

import time

def benchmark(model, X_train, X_test, y_train, y_test, wintitle='Figure 1'):
print ('\n\n' + wintitle + ' Results')

# the only purpose of doing many iterations was to get a more accurate
# count of the time it took for each classifier
s = time.time()
for i in range(iterations):
# train the classifier on the training data / labels:
#, y_train)

print(f"{iterations} Iterations Training Time: {time.time() - s:.3f}")

scoreBch = 0

s = time.time()
for i in range(iterations):
# score the classifier on the testing data / labels:
scoreBch = model.score(X_test, y_test)

print(f"{iterations} Iterations Scoring Time: {time.time() - s:.3f}")

print ("High-Dimensionality Score: ", round((scoreBch*100), 3))

Train the Knn classifier

# Create a KNeighbors classifier
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

And get its benchmark:

benchmark(knn, X_train, X_test, y_train, y_test, 'KNeighbors')

KNeighbors Results
5000 Iterations Training Time: 1.889
5000 Iterations Scoring Time: 3.780
High-Dimensionality Score: 83.607
drawPlots(knn, X_train, X_test, y_train, y_test, 'KNeighbors')

Max 2D Score: 90.164


Train the SVM Classifier

# Create an SVM classifier
# Use a linear kernel, and set the C value to C (see initial parameters)
from sklearn.svm import SVC

svc = SVC(kernel='linear', C=C)

benchmark(svc, X_train, X_test, y_train, y_test, 'SVC')

SVC Results
5000 Iterations Training Time: 3.799 
5000 Iterations Scoring Time: 1.655 
High-Dimensionality Score: 86.885 

drawPlots(svc, X_train, X_test, y_train, y_test, 'SVC')

Max 2D Score: 93.443


SVC in high dimensions, even with a provided kernel, still attempts to find the best linearly separable plane to split your classes.
If you have ‘dirty’ features thrown into the mix, it’s entirely possible they will end up hurting your overall SVC performance, as opposed to just having a few really good features.

KNN trained faster but predicted slower than SVC.
SVC had the best score.

When to use SVM or Logistic Regression

In practical classification tasks, linear logistic regression and linear SVMs often yield very similar results.

Given n = number of features and m= number of training examples:

  • If n is large (relative to m, e.g. n=10K and m = 10..1K):
    Use logistic regression, or SVM without a kernel (“linear kernel”)
  • If n is small and m is intermediate (e.g. n = 1..1K and m = 10..10K):
    Use SVM with Gaussian kernel
  • If n is small and m is large:
    Create/add more features, then use logistic regression or SVM without a kernel

P.S. Neural network likely to work well for most of these settings but may be slower to train.

When classes are (nearly) separable, SVM does better than LR.
When not, LR (with ridge penalty) and SVM very similar.

Logistic regression tries to maximise the conditional likelihoods of the training data, which makes it more prone to outliers than SVMs.
On the other hand, logistic regression has the advantage that it is a simpler model that can be implemented more easily. Furthermore, logistic regression models can be easily updated, which is attractive when working with streaming data.

The advantages of support vector machines are:

  • Effective in high dimensional spaces.
  • Still effective in cases where number of dimensions is greater than the number of samples.
  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

If you wish to estimate probabilities, LR is the choice.
SVMs do not directly provide probability estimates

For nonlinear boundaries, kernel SVMs are popular. You can use kernels with LR as well but computations are more expensive.
Note: this post is part of a series about Machine Learning with Python.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s