Principal component analysis (PCA) is an unsupervised **linear transformation** technique that is widely used across different fields, most prominently for dimensionality reduction, noise reduction and to identify data patterns based on the correlation between features.

The basic idea is that **we might not need every predictor**: some predictors are correlated and / or a weighted combination of predictors may be better. We can simply pick that combination to capture the most information possible.

The benefit – beside reducing the number of predictors – is also to reduce the noise (due to averaging).

In a nutshell, PCA aims to find **the directions of maximum variance** in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions that the original one.

This is why PCA is useful for dimensionality reduction: it allows us to map a sample vector onto a new smaller feature subspace that has fewer dimensions than the original feature space.

The PCA algorithm for dimensionality reduction (from n-dimensions to k-dimensions) consists of a few simple steps:

- Pre-processing: standardise the -dimensional dataset by feature scaling and mean normalisation (every feature has mean zero) on the training set.
- Compute the covariance matrix.
- Decompose the covariance matrix into its eigenvectors and eigenvalues.
- Select eigenvectors that correspond to the largest eigenvalues, where is the dimensionality of the new feature subspace.
- Construct a projection matrix from the “top” eigenvectors.
- Transform the n-dimensional input dataset using the projection matrix to obtain the new k-dimensional feature subspace.

# Variance and eigenvalues

We will tackle the first four steps of a principal component analysis through a simple example: standardising the data, constructing the covariance matrix, obtaining the eigenvalues and eigenvectors of the covariance matrix and finally sorting the eigenvalues by decreasing order to rank the eigenvectors.

We do that by looking at the wheat dataset that we already used previously for the Decision Tree example where we had to choose just two predictors: the perimeter and the groove. Maybe a different pair would have been better?

The dataset is available on the UCI’s wheat-seeds page.

You can find the dataset and this code on my GitHub repository as well

import pandas as pd # # Load up the dataset into a variable called X. # X = pd.read_csv("datasets/wheat.data", index_col='id') X.head()

## Standardise the data

Pre-process the data: drop the *NaN* end extract the labels for the classification:

X.dropna(axis=0, inplace=True) labels = X.wheat_type.copy() # copy “y” values out X.drop(['wheat_type'], axis=1, inplace=True) # drop output column labels = labels.map({'canadian':0, 'kama':1, 'rosa':2}) # encoding

Next we split the data into separate training and testing sets, using the usual 70-30 proportions.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=0)

Until now we followed the same procedure as with the Decision Tree model.

Now we will standardise the Wheat data to unit variance using *SKlearn*.*SKLearn* has many different methods for transforming features by scaling them (this is a type of pre-processing): *RobustScaler*, *Normalizer*, *MinMaxScaler, MaxAbsScaler, StandardScaler.*..

However in order to be effective at PCA, there are a few requirements that must be met, and which will drive the selection of the scaler.

PCA required that the data is standardised — in other words its mean is equal to 0, and it has unit variance.

*SKLearn’s* regular *Normalizer* doesn’t zero out the mean of your data, it only clamps it, so it’s inappropriate to use here (depending on the data).*MinMaxScaler* and *MaxAbsScaler* both fail to set a unit variance, so we won’t be using them either.*RobustScaler* can work, again depending on the data (but watch for outliers).

For these reasons we’re going to use the *StandardScaler.*

from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train_std = sc.fit_transform(X_train) X_test_std = sc.transform(X_test)

## Construct the covariance matrix

After completing the mandatory preprocessing step we move to the second step: constructing the covariance matrix. We use N*umPy* for this.

The symmetric covariance matrix stores the **pairwise covariances** between the different features.

import numpy as np covMatrix = np.cov(X_train_std.T) covMatrix

array([[ 1.0070922 , 1.0011297 , 0.61670387, 0.95285342, 0.97793977, -0.22604464, 0.87202777], [ 1.0011297 , 1.0070922 , 0.53529585, 0.97686436, 0.95131999, -0.21918194, 0.90047068], [ 0.61670387, 0.53529585, 1.0070922 , 0.36459673, 0.76670126, -0.29691044, 0.22746871], [ 0.95285342, 0.97686436, 0.36459673, 1.0070922 , 0.86062855, -0.18497896, 0.94432928], [ 0.97793977, 0.95131999, 0.76670126, 0.86062855, 1.0070922 , -0.24239365, 0.75483918], [-0.22604464, -0.21918194, -0.29691044, -0.18497896, -0.24239365, 1.0070922 , -0.0444695 ], [ 0.87202777, 0.90047068, 0.22746871, 0.94432928, 0.75483918, -0.0444695 , 1.0070922 ]])

As we started from a dataset with 7 features , we got a 7×7 covariance matrix.

A positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions. I talked about covariance in a previous post.

## Obtain and sort the eigenvalues and eigenvectors

The **eigenvectors** of the covariance matrix represent the principal components (the **directions of maximum variance**), whereas the corresponding **eigenvalues** will define **their magnitude**.

In the case of the Wheat dataset, we would obtain 7 eigenvectors and eigenvalues from the 7-dimensional covariance matrix.

Since the manual computation of eigenvectors and eigenvalues is a tedious task, we will use the *linalg.eig* function from *NumPy* to obtain the eigenpairs of the Wheat covariance matrix:

eigenvalues, eigenvectors = np.linalg.eig(covMatrix) f"Eigenvalues: {eigenvalues}." # the f print function works from Python 3.6, you can use print otherwise

'Eigenvalues: [5.06540460e+00 1.15619185e+00 7.36028723e-01 6.50774882e-02 2.04535544e-02 7.88788482e-04 5.70039211e-03].'

Since we want to reduce the dimensionality of our dataset by compressing it onto a new feature subspace, we only select the subset of the eigenvectors (principal components) that **contains most of the information** (variance).

As the eigenvalues define the magnitude of the eigenvectors, we can sort the eigenvalues by descending magnitude and select the top eigenvectors based on the values of their corresponding eigenvalues.

tot = sum(eigenvalues) varianceExplained = [(i / tot) for i in sorted(eigenvalues, reverse=True)] cumVarianceExp = np.cumsum(varianceExplained) # cumulative varianceExplained

[0.7185332477964889, 0.16400709284053022, 0.10440648886885688, 0.009231313716071133, 0.0029013593303879085, 0.0008086069285229908, 0.0001118905191418737]

Let’s plot the variance explained ratios of the eigenvalues:

import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(10, 5)) width = 0.8 tickLocations = np.arange(0,7) rectLocations = tickLocations-(width/2.0) labels = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7'] ax.bar(rectLocations, varianceExplained, width, color='wheat', edgecolor='brown', linewidth=1.0, align='center',label='individual explained variance') ax.step(tickLocations, cumVarianceExp, where='mid',label='cumulative explained variance') # --- pretty-up the plot ax.set_xticks(ticks= tickLocations) ax.set_xticklabels(labels) fig.suptitle("Explained Variance for the eigenvalues") ax.yaxis.grid(True) ax.set_ylabel('Explained variance ratio') ax.set_xlabel('Principal components') ax.legend(loc='center right');

The resulting plot indicates that the first principal component alone accounts for 70 percent of the variance. Also, we can see that **the first two principal components combined explain almost 90 percent of the variance** in the data.

## Transform the feature space into a smaller one

After we have successfully decomposed the covariance matrix into *eigenpairs*, let’s now proceed with the last steps to transform the Wheat dataset onto the new principal component axes.

In this section, we will use the sorted *eigenpairs* by descending order of the eigenvalues, to construct a projection matrix from the selected eigenvectors and use the projection matrix to transform the data onto the lower-dimensional subspace.

eigenpairs =[(np.abs(eigenvalues[i]),eigenvectors[:,i]) for i in range(len(eigenvalues))] eigenpairs.sort(reverse=True)

Next, we collect the two eigenvectors that correspond to** the two largest values to capture about 90 percent of the variance in this dataset.**

Note that we only chose two eigenvectors for the purpose of illustration, since we are going to plot the data via a two-dimensional scatter plot later in this subsection.

In practice, the number of principal components has to be determined from a trade-off between computational efficiency and the performance of the classifier.

In this case two is a great choice as it amounts to 90% of the variance.

V = np.hstack((eigenpairs[0][1][:, np.newaxis], eigenpairs[1][1][:, np.newaxis])) print('Matrix V:\n', V)

Matrix V: [[ 0.44455801 0.02566934] [ 0.44188066 0.08388916] [ 0.27591508 -0.53312904] [ 0.42307046 0.20911146] [ 0.43198716 -0.11511336] [-0.11816204 0.71635014] [ 0.38912303 0.37140397]]

We have created a 7×2-dimensional projection matrix V from the top two eigenvectors. Using the projection matrix, we can now transform a sample X onto the PCA subspace obtaining *X_pca*, a now two-dimensional sample vector consisting of two new features.

X_train_std[0].dot(V) # how the first row looks like

array([3.06229007, 0.67948187])

X_train_pca = X_train_std.dot(V)

Lastly, let’s visualise the transformed Wheat training set in a two-dimensional scatterplot:

fig, ax = plt.subplots(figsize=(10, 7)) colors = ['r', 'b', 'g'] markers = ['s', 'x', 'o'] classes = ['canadian', 'kama', 'rosa'] for l, c, m in zip(np.unique(y_train), colors, markers): ax.scatter(X_train_pca[y_train==l, 0], X_train_pca[y_train==l, 1], c=c, label=classes[l], marker=m) ax.set_xlabel('PC 1') ax.set_ylabel('PC 2') ax.legend(loc='lower right') fig.suptitle("Wheat type distribution by Principal Components");

As we can see in the plot, the data is more spread along the x-axis — the first principal component — than the second principal component (y-axis), which is consistent with the explained variance ratio plot that we created in the previous subsection.

Moreover, we can intuitively see that a linear classifier will likely be able to separate the classes quite well.

# Examples of Principal Component Analysis using SKlearn

Let’s see how PCA (using *SKlearn*) can be used in concrete examples.

## Kidney study

Goal: reduce the dense kidney clinic study feature set to its two main components.

We use a dataset to predict the chronic kidney disease, that can be found on the usual UCI repository:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

You can take a look at the dataset webpage in the attribute info section.

As usual, we start by reading the data:

df = pd.read_csv("Datasets/kidney_disease.csv", index_col='id') df.head()

5 rows × 25 columns

There are some *NaN*. Let’s drop them.

df.dropna(inplace=True) df.info()

Int64Index: 158 entries, 3 to 399 Data columns (total 25 columns): # Column Non-Null Count Dtype -- ------ -------- -------- ----- 0 age 158 non-null float64 1 bp 158 non-null float64 2 sg 158 non-null float64 3 al 158 non-null float64 4 su 158 non-null float64 5 rbc 158 non-null object 6 pc 158 non-null object 7 pcc 158 non-null object 8 ba 158 non-null object 9 bgr 158 non-null float64 10 bu 158 non-null float64 11 sc 158 non-null float64 12 sod 158 non-null float64 13 pot 158 non-null float64 14 hemo 158 non-null float64 15 pcv 158 non-null object 16 wc 158 non-null object 17 rc 158 non-null object 18 htn 158 non-null object 19 dm 158 non-null object 20 cad 158 non-null object 21 appet 158 non-null object 22 pe 158 non-null object 23 ane 158 non-null object 24 classification 158 non-null object dtypes: float64(11), object(14) memory usage: 32.1+ KB

There are 158 entries.

Each one has the classification target column, plus 24 other classes: 11 numeric and 14 nominal.

See the UCI site page for an explanation of each class.

# Create some color coded labels; the actual label feature # will be removed prior to executing PCA, since it's unsupervised. # we're only labeling by color so you can see the effects of PCA labelsCol = ['red' if i=='ckd' else 'green' for i in df.classification]

We start by transforming the nominal, specifically the *object* features:

# Get rid of nominal columns # df.drop(labels=['classification'], axis=1, inplace=True) df = pd.get_dummies(df,columns=['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane'])

And the last three nominal can be changed to a numeric type:

# Print out and check the dataframe's dtypes. # df.wc = pd.to_numeric(df.wc, errors='coerce') df.rc = pd.to_numeric(df.rc, errors='coerce') df.pcv = pd.to_numeric(df.pcv, errors='coerce') print(df.dtypes)

age float64 bp float64 sg float64 al float64 su float64 bgr float64 bu float64 sc float64 sod float64 pot float64 hemo float64 pcv int64 wc int64 rc float64 rbc_abnormal uint8 rbc_normal uint8 pc_abnormal uint8 pc_normal uint8 pcc_notpresent uint8 pcc_present uint8 ba_notpresent uint8 ba_present uint8 htn_no uint8 htn_yes uint8 dm_no uint8 dm_yes uint8 cad_no uint8 cad_yes uint8 appet_good uint8 appet_poor uint8 pe_no uint8 pe_yes uint8 ane_no uint8 ane_yes uint8 dtype: object

After adding in all the other features and properly encoding those that need to be, the separation between CKD and Non-CKD patients increases significantly: by taking many secondary features and combining them, machine learning is usually able to come up with more informative descriptions of the data.

PCA operates based on variance. The variable with the greatest variance will dominate.

Let’s peek into the data and check the variance of every feature in the dataset.

df.var()

age 2.406297e+02 bp 1.248891e+02 sg 3.023865e-05 al 1.996936e+00 su 6.616141e-01 bgr 4.217182e+03 bu 2.246322e+03 sc 9.471717e+00 sod 5.609143e+01 pot 1.208501e+01 hemo 8.307100e+00 pcv 8.290402e+01 wc 9.777380e+06 rc 1.039104e+00 rbc_abnormal 1.015883e-01 rbc_normal 1.015883e-01 pc_abnormal 1.508103e-01 pc_normal 1.508103e-01 pcc_notpresent 8.127066e-02 pcc_present 8.127066e-02 ba_notpresent 7.062807e-02 ba_present 7.062807e-02 htn_no 1.699589e-01 htn_yes 1.699589e-01 dm_no 1.467387e-01 dm_yes 1.467387e-01 cad_no 6.518584e-02 cad_yes 6.518584e-02 appet_good 1.064662e-01 appet_poor 1.064662e-01 pe_no 1.112634e-01 pe_yes 1.112634e-01 ane_no 9.159074e-02 ane_yes 9.159074e-02 dtype: float64

Large variance amount: **wc, bgr, bu**

df.describe()

The first thing PCA does is center the dataset about its mean by subtracting the mean from each sample. Looking at the .*describe()* output of the dataset, particularly the min, max, and mean readings per feature, some features will dominate the X axis (likely *wc*) and some the Y axis (likely *bgr*).

Now we use PCA from SKLearn:

from sklearn.decomposition import PCA # Run PCA on the dataset and reduce it to 2 components pcaModel = PCA(n_components=2) pcaModel.fit(df) # fit the model T = pcaModel.transform(df) # T is an array of len 158 (one for each df's entry) and two components pcaModel.explained_variance_ratio_

array([9.99309365e-01, 4.57758701e-04])

*explained_variance_ratio_* is the percentage of variance explained by each of the selected components.

*singular_values_* is an array. The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space. Equal to n_components largest eigenvalues of the covariance matrix of X.

pcaModel.singular_values_

array([39180.18992145, 838.56138993])

T[0]

array([-1.77596096e+03, -1.26057430e+00])

Let’s plot the features with their vectors.

First we define a helper function to draw the vectors on a plot:

import math def drawVectors(transformed_features, components_, columns, plt): num_columns = len(columns) # This function will project your *original* feature (columns) # onto your principal component feature-space, so that you can # visualise how "important" each one was in the # multi-dimensional scaling # Scale the principal components by the max value in # the transformed set belonging to that component xvector = components_[0] * max(transformed_features[:,0]) yvector = components_[1] * max(transformed_features[:,1]) # Sort each column by its length. These are your *original* # columns, not the principal components. important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) } important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True) print ("Features by importance:\n", important_features) ## visualize projections ax = plt.axes() for i in range(num_columns): # Use an arrow to project each original feature as a # labeled vector on your principal component axes plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75) plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75) return ax

And now we can plot:

# Plot the transformed data as a scatter plot. Recall that transforming # the data will result in a NumPy NDArray. Therefore we convert it to DataFrame and have pandas # do the plot. # # Since we transformed via PCA, we no longer have column names. We know we # are in P.C. space, so we'll just define the coordinates accordingly: ax = drawVectors(T, pca.components_, df.columns.values, plt) T = pd.DataFrame(T) T.columns = ['PC 1', 'PC 2'] T.plot.scatter(x='PC 1', y='PC 2', marker='o', c=labelsCol, alpha=0.75, ax=ax);

Features by importance: [(17924.12862704472, 'wc'), (309.72755212441524, 'bgr'), (133.4661920624122, 'bu'), (29.873170788201644, 'pcv'), (26.924712516320835, 'age'), (15.348887614206896, 'sod'), (14.728205659896712, 'bp'), (9.26958992968382, 'hemo'), (8.315782432584017, 'sc'), (4.812817492433102, 'al'), (3.854866211705058, 'pot'), (2.9506615569902213, 'rc'), (2.939116338871963, 'su'), (1.4500345016834413, 'dm_yes'), (1.4500345016834406, 'dm_no'), (1.4262548164785005, 'htn_no'), (1.4262548164785, 'htn_yes'), (1.0748414625899176, 'pc_normal'), (1.0748414625899176, 'pc_abnormal'), (0.8619132020709073, 'appet_poor'), (0.861913202070907, 'appet_good'), (0.8592833054679994, 'pe_yes'), (0.8592833054679994, 'pe_no'), (0.8504820097620381, 'rbc_abnormal'), (0.8504820097620378, 'rbc_normal'), (0.6274850804078085, 'cad_yes'), (0.6274850804078085, 'cad_no'), (0.49341600089668086, 'pcc_present'), (0.4934160008966808, 'pcc_notpresent'), (0.49150767269010703, 'ane_yes'), (0.49150767269010703, 'ane_no'), (0.4774667056822345, 'ba_notpresent'), (0.4774667056822342, 'ba_present'), (0.018194804297560947, 'sg')]

As expected, the two most important features (highest variance) are **bgr** and **wc**.

According to the labelling, red plots correspond to chronic kidney disease and green plots are non-CKD patients.

Looking at the scatter plot, the two classes are not completely separable: a few records are mixed together.

Now we repeat the PCA but with scaling before!

We prepare a helper function:

# A Note on SKLearn .transform() calls: # # Any time you transform your data, you lose the column header names. # This actually makes complete sense. There are essentially two types # of transformations, those that change the scale of your features, # and those that change your features entire. Changing the scale would # be like changing centimeters to inches. Changing the features would # be like using PCA to reduce 300 columns to 30. In either case, the # original column's units have been altered or no longer exist, so it's # up to you to rename your columns after ANY transformation. Due to # this, SKLearn returns an NDArray from *transform() calls. # --------- # Feature scaling is the type of transformation that only changes the # scale and not number of features, so we'll use the original dataset # column names. However we'll keep in mind that the _units_ have been # altered: def scaleFeatures(usdf): scaled = preprocessing.StandardScaler().fit_transform(usdf) scaled = pd.DataFrame(scaled, columns = usdf.columns) print ("New Variances:\n", scaled.var()) print ("New Describe:\n", scaled.describe()) return scaled

And now we can proceed:

from sklearn import preprocessing scaled = preprocessing.StandardScaler().fit_transform(df) # returns an array S = pd.DataFrame(scaled, columns = df.columns) # make it a data frame print ("New Variances:\n", S.var())

New Variances: age 1.006369 bp 1.006369 sg 1.006369 al 1.006369 su 1.006369 bgr 1.006369 bu 1.006369 sc 1.006369 sod 1.006369 pot 1.006369 hemo 1.006369 pcv 1.006369 wc 1.006369 rc 1.006369 rbc_abnormal 1.006369 rbc_normal 1.006369 pc_abnormal 1.006369 pc_normal 1.006369 pcc_notpresent 1.006369 pcc_present 1.006369 ba_notpresent 1.006369 ba_present 1.006369 htn_no 1.006369 htn_yes 1.006369 dm_no 1.006369 dm_yes 1.006369 cad_no 1.006369 cad_yes 1.006369 appet_good 1.006369 appet_poor 1.006369 pe_no 1.006369 pe_yes 1.006369 ane_no 1.006369 ane_yes 1.006369 dtype: float64

print ("New Describe:\n", S.describe())

New Describe: age bp sg al su \ count 1.580000e+02 1.580000e+02 1.580000e+02 1.580000e+02 1.580000e+02 mean 8.432074e-17 5.846238e-16 -1.304161e-15 -1.349132e-16 -2.248553e-17 std 1.003180e+00 1.003180e+00 1.003180e+00 1.003180e+00 1.003180e+00 min -2.817246e+00 -2.158952e+00 -2.713365e+00 -5.661221e-01 -3.122333e-01 25% -6.669624e-01 -1.261282e+00 2.309247e-02 -5.661221e-01 -3.122333e-01 50% 6.057713e-02 5.340564e-01 2.309247e-02 -5.661221e-01 -3.122333e-01 75% 6.749439e-01 5.340564e-01 9.352451e-01 1.437770e-01 -3.122333e-01 max 2.162358e+00 3.227064e+00 9.352451e-01 2.273474e+00 5.854375e+00 bgr bu sc sod pot \ count 1.580000e+02 1.580000e+02 158.000000 1.580000e+02 1.580000e+02 mean -4.497106e-17 8.994212e-17 0.000000 9.893633e-16 5.621382e-17 std 1.003180e+00 1.003180e+00 1.003180 1.003180e+00 1.003180e+00 min -9.475974e-01 -9.011706e-01 -0.583015 -3.730148e+00 -6.165957e-01 25% -5.305059e-01 -5.625116e-01 -0.485227 -5.154386e-01 -2.703085e-01 50% -2.447210e-01 -2.767680e-01 -0.354843 2.034626e-02 -3.945044e-02 75% 6.306235e-03 -5.981458e-02 -0.191863 6.900774e-01 7.597862e-02 max 5.540492e+00 5.427520e+00 4.241194 1.493755e+00 1.222489e+01 ... dm_no dm_yes cad_no cad_yes \ count ... 1.580000e+02 158.0000 1.580000e+02 1.580000e+02 mean ... -1.349132e-16 0.000000 1.798842e-16 -6.745659e-17 std ... 1.003180e+00 1.003180 1.003180e+00 1.003180e+00 min ... -2.154729e+00 -0.464095 -3.655631e+00 -2.735506e-01 25% ... 4.640955e-01 -0.464095 2.735506e-01 -2.735506e-01 50% ... 4.640955e-01 -0.464095 2.735506e-01 -2.735506e-01 75% ... 4.640955e-01 -0.464095 2.735506e-01 -2.735506e-01 max ... 4.640955e-01 2.154729 2.735506e-01 3.655631e+00 appet_good appet_poor pe_no pe_yes ane_no \ count 1.580000e+02 1.580000e+02 158.0000 158.0000 1.580000e+02 mean 4.497106e-17 -4.497106e-17 0.000000 0.000000 4.497106e-17 std 1.003180e+00 1.003180e+00 1.003180 1.003180 1.003180e+00 min -2.704772e+00 -3.697170e-01 -2.626785 -0.380693 -2.979094e+00 25% 3.697170e-01 -3.697170e-01 0.380693 -0.380693 3.356725e-01 50% 3.697170e-01 -3.697170e-01 0.380693 -0.380693 3.356725e-01 75% 3.697170e-01 -3.697170e-01 0.380693 -0.380693 3.356725e-01 max 3.697170e-01 2.704772e+00 0.380693 2.626785 3.356725e-01 ane_yes count 1.580000e+02 mean -8.994212e-17 std 1.003180e+00 min -3.356725e-01 25% -3.356725e-01 50% -3.356725e-01 75% -3.356725e-01 max 2.979094e+00 [8 rows x 34 columns]

S.head()

# Run PCA on your dataset and reduce it to 2 components # pcaModel.fit(S) TS = pcaModel.transform(S) # TS is an array of length 158 and two components pcaModel.explained_variance_ratio_

array([0.46548191, 0.08269405])

The two values together are higher.

pcaModel.singular_values_

array([50.00568791, 21.07682243])

TS[0]

array([ 7.40571153, -5.4338862 ])

# Plot the transformed data as a scatter plot. Recall that transforming # the data will result in a NumPy NDArray. You can either use MatPlotLib # to graph it directly, or you can convert it to DataFrame and have pandas # do it for you. # # Since we transformed via PCA, we no longer have column names. We know we # are in P.C. space, so we'll just define the coordinates accordingly: ax = drawVectors(TS, pcaModel.components_, S.columns.values, plt) TS = pd.DataFrame(TS) TS.columns = ['PC 1', 'PC 2'] TS.plot.scatter(x='PC 1', y='PC 2', marker='o', c=labelsCol, alpha=0.75, ax=ax);

Features by importance:

[(2.969615039499333, 'ane_yes'), (2.969615039499333, 'ane_no'),

(2.758875925894978, 'bgr'), (2.7155049553940604, 'dm_yes'),

(2.7155049553940604, 'dm_no'), (2.6589320941392414, 'pcv'),

(2.645572100109875, 'hemo'), (2.60211266284851, 'al'),

(2.5934172193046927, 'htn_no'), (2.593417219304692, 'htn_yes'),

(2.5767305659998674, 'su'), (2.4859440233443406, 'cad_yes'),

(2.4859440233443406, 'cad_no'), (2.484132428132544, 'sc'),

(2.4794711137049417, 'pc_normal'), (2.4794711137049417, 'pc_abnormal'),

(2.465109416186055, 'appet_poor'), (2.465109416186054, 'appet_good'),

(2.4575665060850933, 'bu'), (2.393726969612571, 'sg'),

(2.388197579628534, 'rc'), (2.210578246131017, 'pe_yes'),

(2.210578246131017, 'pe_no'), (2.182818062607651, 'sod'),

(2.005926150817528, 'rbc_normal'), (2.005926150817528, 'rbc_abnormal'),

(1.9861731688066884, 'ba_present'), (1.986173168806688, 'ba_notpresent'),

(1.9842911319072, 'pcc_present'), (1.9842911319071999, 'pcc_notpresent'),

(1.2846796771566376, 'age'), (1.054194604216631, 'bp'),

(0.919384598744807, 'wc'), (0.5640412875529991, 'pot')]

Note that scaling the features did affect their variances. After scaling, the green patients without chronic kidney disease are more cleanly separable from the red patients with chronic kidney disease.

## Armadillo: From 3D rendering to 2D plot

The goal is to reduce the dimensionality of a 3D scan from three to two using PCA to cast a shadow of the data onto its two most important principal components. Then render the resulting 2D scatter plot.

The scan is a real life armadillo sculpture scanned using a Cyberware 3030 MS 3D scanner at Stanford University. The sculpture is available as part of their 3D Scanning Repository – where the PLY file is available – and is a very dense 3D mesh consisting of 172974 vertices!

### Read the data

Every 100 data samples, we save 1.

If things run too slow, try increasing this number.

If things run too fast, try decreasing it… =)

reduce_factor = 100 # this is the number to change from plyfile import PlyData # Load up the scanned armadillo plyfile = PlyData.read('Datasets/stanford_armadillo.ply') armadillo = pd.DataFrame({ 'x':plyfile['vertex']['z'][::reduce_factor], 'y':plyfile['vertex']['x'][::reduce_factor], 'z':plyfile['vertex']['y'][::reduce_factor] })

### Render the Original Armadillo

import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.set_title('Armadillo 3D') ax.set_xlabel('X') ax.set_ylabel('Y') ax.set_zlabel('Z') ax.scatter(armadillo.x, armadillo.y, armadillo.z, c='green', marker='.', alpha=0.75);

### Decompose to 2D using PCA

from sklearn.decomposition import PCA def do_PCA(armadillo): # # import the libraries required for PCA. # Then, train your PCA on the armadillo dataframe. Finally, # drop one dimension (reduce it down to 2D) and project the # armadillo down to the 2D principal component feature space. # pca = PCA(n_components=2) pca.fit(armadillo) reducedArmadillo = pca.transform(armadillo) return reducedArmadillo

Time the execution of PCA 5000x : PCA is ran 5000x in order to help decrease the potential of rogue processes altering the speed of execution.

import datetime t1 = datetime.datetime.now() for i in range(5000): pca = do_PCA(armadillo) time_delta = datetime.datetime.now() - t1

Render the newly transformed PCA armadillo!

if not pca is None: fig = plt.figure() ax = fig.add_subplot(111) ax.set_title('PCA, build time: ' + str(time_delta)) ax.scatter(pca[:,0], pca[:,1], c='blue', marker='.', alpha=0.75)

Just as an example, we can run PCA also changing the solver algorithm, specifically the *Randomized* (truncated) method for the SVD solver. To find out how to use it, check out the full docs.

def do_RandomizedPCA(armadillo): pca = PCA(n_components=2, svd_solver='randomized') pca.fit(armadillo) reducedArmadillo = pca.transform(armadillo) return reducedArmadillo

Time the execution of *randomized* PCA 5000x

t1 = datetime.datetime.now() for i in range(5000): rpca = do_RandomizedPCA(armadillo) time_delta = datetime.datetime.now() - t1

Render the newly transformed armadillo!

if not rpca is None: fig = plt.figure() ax = fig.add_subplot(111) ax.set_title('RandomizedPCA, build time: ' + str(time_delta)) ax.scatter(rpca[:,0], rpca[:,1], c='red', marker='.', alpha=0.75)

No visual differences between the transformed PCA results and the transformed *RandomizedPCA* results.

And was not executed faster.

# Extra: the Isomap algorithm

Goal: Replicate Joshua Tenenbaum’s – the primary creator of the isometric feature mapping algorithm – canonical, dimensionality reduction research experiment for visual perception.

His original dataset from December 2000 consists of 698 samples of 4096-dimensional vectors.

These vectors are the coded brightness values of 64×64-pixel heads that have been rendered facing various directions and lighted from many angles.

The dataset – originally stored on Stanford’s website – is now accessible on webArchive or on my gitHub.

Tasks:

– Applying both PCA and Isomap to the 698 raw images to derive 2D principal components and a 2D embedding of the data’s intrinsic geometric structure.

– Project both onto a 2D and 3D scatter plot, with a few superimposed face images on the associated samples.

## Read the data

Note: a .MAT file is a .MATLAB file.

import scipy.io mat = scipy.io.loadmat('Datasets/face_data.mat') df = pd.DataFrame(mat['images']).T num_images, num_pixels = df.shape num_pixels = int(math.sqrt(num_pixels)) df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 698 entries, 0 to 697

Columns: 4096 entries, 0 to 4095

dtypes: float64(4096)

memory usage: 21.8 MB

Pictures are rotated.

Rotate them, so we don’t have to crane our necks:

for i in range(num_images): df.loc[i,:] = df.loc[i,:].values.reshape(num_pixels, num_pixels).T.reshape(-1)

Prepare the function to plot the 2D images representation:

import random, math </pre><pre># The format is: Plot2D(T, title, x, y, num_to_plot=40): # T is the transformed data, NDArray. # title is the chart title # x is the principal component you want displayed on the x-axis, Can be 0 or 1 # y is the principal component you want displayed on the y-axis, Can be 1 or 2 def Plot2D(T, title, x, y, num_to_plot=40): # This method picks a bunch of random samples (images in our case) # to plot onto the chart: fig = plt.figure() ax = fig.add_subplot(111) ax.set_title(title) ax.set_xlabel('Component: {0}'.format(x)) ax.set_ylabel('Component: {0}'.format(y)) x_size = (max(T[:,x]) - min(T[:,x])) * 0.08 y_size = (max(T[:,y]) - min(T[:,y])) * 0.08 for i in range(num_to_plot): img_num = int(random.random() * num_images) x0, y0 = T[img_num,x]-x_size/2., T[img_num,y]-y_size/2. x1, y1 = T[img_num,x]+x_size/2., T[img_num,y]+y_size/2. img = df.iloc[img_num,:].values.reshape(num_pixels, num_pixels) ax.imshow(img, aspect='auto', cmap=plt.cm.gray, interpolation='nearest', zorder=100000, extent=(x0, x1, y0, y1)) # It also plots the full scatter: ax.scatter(T[:,x],T[:,y], marker='.',alpha=0.7)

## Reduce components using PCA

Reduce the dataframe down to THREE components.

pca = PCA(n_components=3) pca.fit(df) T = pca.transform(df)

We plot two of the components:

Plot2D(T, "PCA transformation", 0, 1, num_to_plot=40)

## Reduce components using Isomap

Again, reduce the dataframe down to THREE components:

from sklearn import manifold iso = manifold.Isomap(n_neighbors=8, n_components=3) iso.fit(df) manifold = iso.transform(df) Plot2D(manifold, "ISO transformation", 0, 1, num_to_plot=40)

Between linear PCA and the non-linear Isomap, the latter algorithm is better able to capture the true nature of the faces dataset when reduced to two components.

Each coordinate axis of the 3D manifold correlate highly with one degree of freedom from the original, underlying data.

In the isomap plot of the first two components (0 and 1), above, you can see that’s the Left and Right Head Position ‘degree of freedom’ was encoded onto the first component (the X-axis). In other words, what varies as you move horizontally in the manifold rendering.

Look at the code on GitHub for more insights on the Isomap example.

# Final thoughts on PCA

- It is most useful for linear-type models.
- It can make it harder to interpret predictors (as they are combined)
- watch out for outliers!
- transform first with logs or boxcox plot predictors to identify problems
- bad use of PCA: to prevent overfitting. Use regularisation instead