K-Nearest Neighbours (KNN)

K-Nearest Neighbours (KNN in short) is a supervised classification algorithm.
The idea behind the nearest neighbours classification is that looking just at the neighbours who are close to a new datapoint – among all dimensions – is a good predictor of its properties.

Most of the techniques we have encountered look at the data set as a whole in order to learn patterns in the data. Nearest neighbours, on the other hand, consciously neglects a lot of information, and let the prediction for each new point depends only on the handful of points closest to it.

The Model Nearest neighbours is one of the simplest predictive models there is. It makes no mathematical assumptions, no assumptions are made about the shape of the decision boundary and it doesn’t require any sort of heavy machinery. The only things it requires are:

  • Some notion of distance
  • An assumption that points that are close to one another are similar

Here is how is working the K-nearest neighbours (KNN) classifier:

In the general situation, we have some training data points and we have a corresponding set of labels. K is a given positive integer (a parameter) which represents the number of neighbours to look at.

When we attempt to classify a new, never before seen, data point x0, it finds the K nearest labeled points in the training data and let them vote on the new test observation’s output (basically KNN estimates the conditional probability for each class, applies Bayes rule and classifies the test observation x0 to the class with the largest probability).

The choice of K has a drastic effect on the KNN classifier obtained.

With K = 1, the KNN training error rate is 0, but the test error rate may be quite high. In general, as we use more flexible classification methods, the training error rate will decline but the test error rate may not.

KNN is a completely non-parametric approach:  therefore, we can expect this approach to dominate other classifiers such as logistic regression when the decision boundary is highly non-linear. On the other hand, KNN does not tell us which predictors are important; we don’t get a table of coefficients for example.

Let’s see an example, using the wheat dataset and the sk-learn library.
You can find the code and the data sets on my GitHub repository.

A wheat example

We do the usual data reading and pre-processing, concluding with normalisation and splitting the data into test and training datasets.

Read the data

import pandas as pd
#
# Load up the dataset into a variable called X.
#
X = pd.read_csv("datasets/wheat.data", index_col='id')
X.head()

ID area perimeter compactness length width asymmetry groove wheat_type
0 15.26 14.84 0.8710 5.763 3.312 2.221 5.220 kama
1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 kama
2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 kama
3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 kama
4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 kama

Data pre-processing

First we separate the target variable:

#
# Copy the 'wheat_type' series slice out of X, and into a series
# called 'y'. Then drop the original 'wheat_type' column from the X
#
y = X.wheat_type.copy()
X.drop(['wheat_type'], axis=1, inplace=True)

# Do a quick, "ordinal" conversion of 'y'.
#
y = y.astype("category").cat.codes

Fix the invalid values:

#
# Basic nan munging. Fill each row's nans with the mean of the feature
#
X.fillna(X.mean(), inplace=True)

Split the data into training and testing datasets:

from sklearn.model_selection import train_test_split

#
# Split X into training and testing data sets
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
                                                    random_state=1)

Data normalisation

We apply the Normalizer function from SK-learn:

from sklearn import preprocessing

#
# Create an instance of SKLearn's Normalizer class and then train it
# using its .fit() method against the *training* data.
#
#
normaliser = preprocessing.Normalizer().fit(X_train)

#
# With the trained pre-processor, transform both training AND
# testing data.
#
# NOTE: Any testing data has to be transformed with the preprocessor
# that has been fit against the training data, so that it exist in the same
# feature-space as the original data used to train the models.
#
X_train_normalised = normaliser.transform(X_train)
X_train = pd.DataFrame(X_train_normalised)

X_test_normalised = normaliser.transform(X_test)
X_test = pd.DataFrame(X_test_normalised)

X_train/test_normalised are now the cleaned and normalised datasets.

Apply PCA

Finally apply a PCA transformation.

This has to be done because the only way to visualise the decision boundary in 2D would be if the KNN algorithm ran in 2D as well.
Note that removing the PCA will improve the accuracy (K-Neighbours is applied to the entire train data, not just the two principal components).

from sklearn.decomposition import PCA

#
# Just like the preprocessing transformation, create a PCA
# transformation as well. Fit it against the training data, and then
# project the training and testing features into PCA space using the
# PCA model's .transform() method.
#
#

pca_reducer = PCA(n_components=2).fit(X_train_normalised)

X_train_pca = pca_reducer.transform(X_train_normalised)
X_test_pca = pca_reducer.transform(X_test_normalised)

X_train/test_pca are now the reduced, cleaned and normalised datasets.

KNN algorithm

Now we finally apply the K-neighbours algorithm, using the related module from SKlearn.

For K-Neighbours, generally the higher your “K” value, the smoother and less jittery your decision surface becomes.
Higher K values also result in your model providing probabilistic information about the ratio of samples per each class.
There is a tradeoff though, as higher K values mean the algorithm is less sensitive to local fluctuations since farther samples are taken into account. This causes it to only model the overall classification function without much attention to detail, and increases the computational complexity of the classification.

We use here K = 9

from sklearn.neighbors import KNeighborsClassifier

#
# Create and train a KNeighborsClassifier. Start with K=9 neighbours.
# NOTE: Be sure to train the classifier against the pre-processed, PCA-
# transformed training data above!
#
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_pca, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=9, p=2,
weights='uniform')

Decision Boundaries

A unique feature of supervised classification algorithms are their decision boundaries, or more generally, their n-dimensional decision surface: a threshold or region where if superseded, will result in your sample being assigned that class.

The decision surface isn’t always spherical. In fact, it can take many different types of shapes depending on the algorithm that generated it.

Let’s prepare a function to plot the decision boundaries, that we can use for other examples, later:

import matplotlib.pyplot as plt
import matplotlib

matplotlib.style.use('ggplot') # Look Pretty

import numpy as np

def plotDecisionBoundary(model, X, y, colors, padding=0.6, resolution = 0.0025):

  fig = plt.figure(figsize=(8,6))
  ax = fig.add_subplot(111)

  # Calculate the boundaries
  x_min, x_max = X[:, 0].min(), X[:, 0].max()
  y_min, y_max = X[:, 1].min(), X[:, 1].max()
  x_range = x_max - x_min
  y_range = y_max - y_min
  x_min -= x_range * padding
  y_min -= y_range * padding
  x_max += x_range * padding
  y_max += y_range * padding

  # Create a 2D Grid Matrix. The values stored in the matrix
  # are the predictions of the class at at said location
  xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution),
                       np.arange(y_min, y_max, resolution))

  # What class does the classifier say?
  Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
  Z = Z.reshape(xx.shape)

  # Plot the contour map using the rainbow colourmap
  #cs = plt.contourf(xx, yy, Z, cmap=plt.cm.rainbow)
  ax.contourf(xx, yy, Z, cmap=plt.cm.rainbow)
  fig.tight_layout(pad=2)

  # Plot the testing original points as well...
  for label in np.unique(y):
    indices = np.where(y == label)
    ax.scatter(X[indices, 0], X[indices, 1], c=colors[label], alpha=0.8)

  # print the title
  p = model.get_params()
  fig.suptitle('Decision boundaries, K = ' + str(p['n_neighbors']))

We can use this new function to plot the decision boundaries of the Wheat classifier: canadian (blue), kama (green) or rosa (white)

myColours = ['royalblue','forestgreen','ghostwhite']

plotDecisionBoundary(knn, X_train_pca, y_train, colors =  myColours)

output_23_0

The KNN (with K=9) algorithm divided the space into three clusters, one for each wheat type.
The clusters fit quite well the testing data but not perfectly, some data points are mis-classified as the score can show:

#
# Display the accuracy score of the test data/labels, computed by
# the KNeighbours model.
#
# NOTE: You do NOT have to run .predict before calling .score, since
# .score will take care of running the predictions automatically.
#
print(knn.score(X_test_pca, y_test))

0.8714285714285714

K-Neighbours is particularly useful when no other model fits your data well, as it is a parameter free approach to classification. So for example, you don’t have to worry about things like your data being linearly separable or not.

Some of the caution-points to keep in mind while using K-Neighbours is that your data needs to be measurable. If there is no metric for discerning distance between your features, K-Neighbours cannot help you. As with all algorithms dependent on distance measures, it is also sensitive to feature scaling, to perturbations and the local structure of your dataset, particularly at lower “K” values.

KNN with hyper-parameters

We now explore deeper how the algorithm’s parameters impact it, using as an example a dataset to classify a breast tumor as benign or malign.

Breast cancer doesn’t develop over night and, like any other cancer, can be treated extremely effectively if detected in its earlier stages. Part of the understanding cancer is knowing that not all irregular cell growths are malignant; some are benign, or non-dangerous, non-cancerous growths.
Being able to properly assess if a tumor is actually benign and ignorable, or malignant and alarming is therefore of importance, and also is a problem that might be solvable through data and machine learning.
Using the Breast Cancer Wisconsin Original data set, provided courtesy of UCI’s Machine Learning Repository.

Read the data

#
# Load in the dataset, identify nans, and set proper headers.
#
X = pd.read_csv("Datasets/breast-cancer-wisconsin.data", header=None,
                 names=['sample', 'thickness', 'size', 'shape', 'adhesion',
                        'epithelial', 'nuclei', 'chromatin', 'nucleoli',
                        'mitoses', 'status'], index_col='sample', na_values='?')

X.head()
sample thickness size shape adhesion epithelial nuclei chromatin nucleoli mitoses status
1000025 5 1 1 1 2 1.0 3 1 1 2
1002945 5 4 4 5 7 10.0 3 2 1 2
1015425 3 1 1 1 2 2.0 3 1 1 2
1016277 6 8 8 1 3 4.0 3 7 1 2
1017023 4 1 1 3 2 1.0 3 1 1 2

Data Pre-processing

Extract the target values, remove all NaN values and split into testing and training data:

#
# Copy out the status column into a slice, then drop it from the main
# dataframe.
#
#
y = X.status.copy()
X.drop(['status'], axis=1, inplace=True)
#
# With the labels safely extracted from the dataset, replace any nan values
# with the mean feature / column value
#
if X.isnull().values.any() == True:
  print("Preprocessing data: substituted all NaN with mean value")
  X.fillna(X.mean(), inplace=True)
else:
  print("Preprocessing data: No NaN found!")

Preprocessing data: substitutes all NaN with mean value:

#
# Do train_test_split. set the random_state=7 for reproduceability, and keep
# the test_size at 0.5 (50%).
#
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.5, random_state=7)

Define hyper-parameters

We will loop the KNN algorithm with different parameters, specifically:

  • different scalers for normalisation
  • reduced or not reduced (here PCA but can also use Isomap for reduction)
  • different weight function
  • and different values of K
# automate the tuning of hyper-parameters using for-loops to traverse the search space.
reducers = [False, True]
weights = ['uniform', 'distance']

# Experiment with the basic SKLearn preprocessing scalers. We know that
# the features consist of different units mixed in together, so it might be
# reasonable to assume feature scaling is necessary.
scalers = [preprocessing.Normalizer, preprocessing.StandardScaler,
           preprocessing.MinMaxScaler, preprocessing.RobustScaler]

from sklearn.decomposition import PCA
from sklearn import manifold

Hyper-parameters tuning

Loop through all the parameters: fit the model and print the result every time

# the f print function works from Python 3.6, you can use print otherwise
separator = "--------------------------------------"
print('*** Starting K-neighbours classifier')
print(separator)

bestScore = 0.0

# outer loop: the scalers
for scaler in scalers:
  print("* Scaler = ", scaler)

  scalerTrained = scaler().fit(X_train)

  X_train_scaled = scalerTrained.transform(X_train)
  X_test_scaled  = scalerTrained.transform(X_test)

  print("PCA?  | K  | Weight | Score")
  print(separator)

  # next loop though PCA reduction or not

  reducer = None

  for isPCA in reducers:
    if isPCA:
    #
    # Implement PCA here: reduce down to two dimensions.
    #
      reducer = PCA(n_components=2).fit(X_train_scaled)

    else:
    #
    # Implement Isomap here.  K values can be from 5 to 10.
    # Reduce down to two dimensions.
    #
      reducer = manifold.Isomap(n_neighbors=10, n_components=2).fit(X_train_scaled)

    # 2D transformation on both datasets
    X_train_reduced = reducer.transform(X_train_scaled)
    X_test_reduced  = reducer.transform(X_test_scaled)

    #
    # Implement and train KNeighborsClassifier on the projected 2D
    # training data. You can use any K value from 1 - 15, so play around
    # with it and see what results you can come up. Your goal is to find a
    # good balance where you aren't too specific (low-K), nor are you too
    # general (high-K). You should also experiment with how changing the weights
    # parameter affects the results.
    #
    for k in range(1,16):
        for weight in weights:

            #
            # Train the model against data_train.
            #
            knmodel = KNeighborsClassifier(n_neighbors = k, weights = weight)
            knmodel.fit(X_train_reduced, y_train) 

# INFO: Be sure to always keep the domain of the problem in mind! It's
# WAY more important to errantly classify a benign tumor as malignant,
# and have it removed, than to incorrectly leave a malignant tumor, believing
# it to be benign, and then having the patient progress in cancer. Since the UDF
# weights don't give you any class information, the only way to introduce this
# data into SKLearn's KNN Classifier is by "baking" it into your data. For
# example, randomly reducing the ratio of benign samples compared to malignant
# samples from the training set.

#
# Calculate + Print the accuracy of the testing set
#
            currentScore = knmodel.score(X_test_reduced, y_test)

            print(f"{isPCA} | {k} | {weight} | {currentScore}")

            # save the best model for plotting it later
            if (currentScore > bestScore):
                bestScore = currentScore
                bestPCA = isPCA
                bestK = k
                bestWeight = weight
                bestScaler = scaler
*** Starting K-neighbours classifier
--------------------------------------
* Scaler = <class 'sklearn.preprocessing.data.Normalizer'>
PCA? | K | Weight | Score
--------------------------------------
False | 1 | uniform | 0.8228571428571428
False | 1 | distance | 0.8228571428571428
False | 2 | uniform | 0.8142857142857143
False | 2 | distance | 0.8228571428571428
False | 3 | uniform | 0.8285714285714286
False | 3 | distance | 0.8257142857142857
False | 4 | uniform | 0.8285714285714286
False | 4 | distance | 0.8314285714285714
False | 5 | uniform | 0.8371428571428572
False | 5 | distance | 0.8428571428571429
...
< see all results on the GitHub notebook!>
...
True | 10 | uniform | 0.96
True | 10 | distance | 0.9571428571428572
True | 11 | uniform | 0.9628571428571429
True | 11 | distance | 0.9571428571428572
True | 12 | uniform | 0.96
True | 12 | distance | 0.9571428571428572
True | 13 | uniform | 0.9628571428571429
True | 13 | distance | 0.9571428571428572
True | 14 | uniform | 0.96
True | 14 | distance | 0.9571428571428572
True | 15 | uniform | 0.96
True | 15 | distance | 0.9571428571428572

Re-apply the best parameters to the model

We recorded the parameters that resulted in the best score.
Let’s re-apply the model using those parameters and print its decision boundaries.

print("These are the best parameters for the model:")
print("PCA?  | K  | Weight | Scaler | Score")
print(f"{bestPCA}  | {bestK} | {bestWeight} | {bestScaler} | {bestScore}")
These are the best parameters for the model:
PCA? | K | Weight | Scaler | Score
True | 8 | distance | <class 'sklearn.preprocessing.data.MinMaxScaler'> | 0.9685714285714285
BestScalerTrained = bestScaler().fit(X_train)

X_train_scaled = BestScalerTrained.transform(X_train)
X_test_scaled  = BestScalerTrained.transform(X_test)

if isPCA:
    #
    # Implement PCA here.
    # You should reduce down to two dimensions.
    #
    reducer = PCA(n_components=2).fit(X_train_scaled)

else:
    #
    # Implement Isomap here.  K values from 5-10.
    # You should reduce down to two dimensions.
    #
    reducer = manifold.Isomap(n_neighbors=10, n_components=2).fit(X_train_scaled)

    #
    # Train your model against data_train, then transform both
    # data_train and data_test using your model. You can save the results right
    # back into the variables themselves.
    #
X_train_reduced = reducer.transform(X_train_scaled)
X_test_reduced  = reducer.transform(X_test_scaled)

    #
    # Implement and train KNeighborsClassifier on your projected 2D
    # training data here. You can use any K value from 1 - 15, so play around
    # with it and see what results you can come up. Your goal is to find a
    # good balance where you aren't too specific (low-K), nor are you too
    # general (high-K). You should also experiment with how changing the weights
    # parameter affects the results.
    #
bestKnmodel = KNeighborsClassifier(n_neighbors = bestK, weights = bestWeight)
bestKnmodel.fit(X_train_reduced, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=8, p=2,
weights='distance')

Plotting the decision boundaries

# (2 for benign, 4 for malignant)
myColours = {2:'royalblue',4:'lightsalmon'} 

plotDecisionBoundary(bestKnmodel, X_test_reduced, y_test, colors =  myColours, padding = 0.1, resolution = 0.1)
output_43_0
Decision boundaries for benign (blue) and malign (red) classifications

As we wrote, it’s more important to have less false positives (malign cases wrongly classified as benign) then false negatives (benign cases wrongly classified as malign).

See on the GitHub notebook another example using the KNN algorithm with reduction (same example as the Isomap faces).

Note: this post is part of a series about Machine Learning with Python.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s