PCA: Principal Component Analysis

Principal component analysis (PCA) is an unsupervised linear transformation technique that is widely used across different fields, most prominently for dimensionality reduction, noise reduction and to identify data patterns based on the correlation between features.

The basic idea is that we might not need every predictor: some predictors are correlated and / or a weighted combination of predictors may be better. We can simply pick that combination to capture the most information possible.
The benefit – beside reducing the number of predictors – is also to reduce the noise (due to averaging).

In a nutshell, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions that the original one.

This is why PCA is useful for dimensionality reduction: it allows us to map a sample vector onto a new smaller feature subspace that has fewer dimensions than the original feature space.

The PCA algorithm for dimensionality reduction (from n-dimensions to k-dimensions) consists of a few simple steps:

  • Pre-processing: standardise the -dimensional dataset by feature scaling and mean normalisation (every feature has mean zero) on the training set.
  • Compute the covariance matrix.
  • Decompose the covariance matrix into its eigenvectors and eigenvalues.
  • Select eigenvectors that correspond to the largest eigenvalues, where is the dimensionality of the new feature subspace.
  • Construct a projection matrix from the “top” eigenvectors.
  • Transform the n-dimensional input dataset using the projection matrix to obtain the new k-dimensional feature subspace.

Variance and eigenvalues

We will tackle the first four steps of a principal component analysis through a simple example: standardising the data, constructing the covariance matrix, obtaining the eigenvalues and eigenvectors of the covariance matrix and finally sorting the eigenvalues by decreasing order to rank the eigenvectors.

We do that by looking at the wheat dataset that we already used previously for the Decision Tree example where we had to choose just two predictors: the perimeter and the groove. Maybe a different pair would have been better?

The dataset is available on the UCI’s wheat-seeds page.

You can find the dataset and this code on my GitHub repository as well

import pandas as pd 

#
# Load up the dataset into a variable called X.
#
X = pd.read_csv("datasets/wheat.data", index_col='id')
X.head()

IDareaperimetercompactnesslengthwidthasymmetrygroovewheat_type
015.2614.840.87105.7633.3122.2215.220kama
114.8814.570.88115.5543.3331.0184.956kama
214.2914.090.90505.2913.3372.6994.825kama
313.8413.940.89555.3243.3792.2594.805kama
416.1414.990.90345.6583.5621.3555.175kama

Standardise the data

Pre-process the data: drop the NaN end extract the labels for the classification:

X.dropna(axis=0, inplace=True)

labels = X.wheat_type.copy() # copy “y” values out
X.drop(['wheat_type'], axis=1, inplace=True) # drop output column

labels = labels.map({'canadian':0, 'kama':1, 'rosa':2}) # encoding

Next we split the data into separate training and testing sets, using the usual 70-30 proportions.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=0)

Until now we followed the same procedure as with the Decision Tree model.
Now we will standardise the Wheat data to unit variance using SKlearn.
SKLearn has many different methods for transforming features by scaling them (this is a type of pre-processing): RobustScaler, Normalizer, MinMaxScaler, MaxAbsScaler, StandardScaler...

However in order to be effective at PCA, there are a few requirements that must be met, and which will drive the selection of the scaler.
PCA required that the data is standardised — in other words its mean is equal to 0, and it has unit variance.

SKLearn’s regular Normalizer doesn’t zero out the mean of your data, it only clamps it, so it’s inappropriate to use here (depending on the data).
MinMaxScaler and MaxAbsScaler both fail to set a unit variance, so we won’t be using them either.
RobustScaler can work, again depending on the data (but watch for outliers).
For these reasons we’re going to use the StandardScaler.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

Construct the covariance matrix

After completing the mandatory preprocessing step we move to the second step: constructing the covariance matrix. We use NumPy for this.
The symmetric covariance matrix stores the pairwise covariances between the different features.

import numpy as np
covMatrix = np.cov(X_train_std.T)
covMatrix

array([[ 1.0070922 , 1.0011297 , 0.61670387, 0.95285342, 0.97793977, -0.22604464, 0.87202777],
[ 1.0011297 , 1.0070922 , 0.53529585, 0.97686436, 0.95131999, -0.21918194, 0.90047068],
[ 0.61670387, 0.53529585, 1.0070922 , 0.36459673, 0.76670126, -0.29691044, 0.22746871],
[ 0.95285342, 0.97686436, 0.36459673, 1.0070922 , 0.86062855, -0.18497896, 0.94432928],
[ 0.97793977, 0.95131999, 0.76670126, 0.86062855, 1.0070922 , -0.24239365, 0.75483918],
[-0.22604464, -0.21918194, -0.29691044, -0.18497896, -0.24239365, 1.0070922 , -0.0444695 ],
[ 0.87202777, 0.90047068, 0.22746871, 0.94432928, 0.75483918, -0.0444695 , 1.0070922 ]])

As we started from a dataset with 7 features , we got a 7×7 covariance matrix.

A positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions. I talked about covariance in a previous post.

Obtain and sort the eigenvalues and eigenvectors

The eigenvectors of the covariance matrix represent the principal components (the directions of maximum variance), whereas the corresponding eigenvalues will define their magnitude.

In the case of the Wheat dataset, we would obtain 7 eigenvectors and eigenvalues from the 7-dimensional covariance matrix.

Since the manual computation of eigenvectors and eigenvalues is a tedious task, we will use the linalg.eig function from NumPy to obtain the eigenpairs of the Wheat covariance matrix:

eigenvalues, eigenvectors = np.linalg.eig(covMatrix)
f"Eigenvalues:  {eigenvalues}." # the f print function works from Python 3.6, you can use print otherwise

 'Eigenvalues: [5.06540460e+00 1.15619185e+00 7.36028723e-01 6.50774882e-02 2.04535544e-02 7.88788482e-04 5.70039211e-03].'

Since we want to reduce the dimensionality of our dataset by compressing it onto a new feature subspace, we only select the subset of the eigenvectors (principal components) that contains most of the information (variance).
As the eigenvalues define the magnitude of the eigenvectors, we can sort the eigenvalues by descending magnitude and select the top eigenvectors based on the values of their corresponding eigenvalues.

tot = sum(eigenvalues)
varianceExplained = [(i / tot) for i in sorted(eigenvalues, reverse=True)]
cumVarianceExp = np.cumsum(varianceExplained) # cumulative
varianceExplained

[0.7185332477964889,
0.16400709284053022,
0.10440648886885688,
0.009231313716071133,
0.0029013593303879085,
0.0008086069285229908,
0.0001118905191418737]

Let’s plot the variance explained ratios of the eigenvalues:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 5))
width = 0.8
tickLocations = np.arange(0,7)
rectLocations = tickLocations-(width/2.0)
labels = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7']

ax.bar(rectLocations, varianceExplained, width, color='wheat', edgecolor='brown', linewidth=1.0, align='center',label='individual explained variance')
ax.step(tickLocations, cumVarianceExp, where='mid',label='cumulative explained variance')

# --- pretty-up the plot
ax.set_xticks(ticks= tickLocations)
ax.set_xticklabels(labels)

fig.suptitle("Explained Variance for the eigenvalues")

ax.yaxis.grid(True)
ax.set_ylabel('Explained variance ratio')
ax.set_xlabel('Principal components')
ax.legend(loc='center right');

output_29_0

The resulting plot indicates that the first principal component alone accounts for 70 percent of the variance. Also, we can see that the first two principal components combined explain almost 90 percent of the variance in the data.

Transform the feature space into a smaller one

After we have successfully decomposed the covariance matrix into eigenpairs, let’s now proceed with the last steps to transform the Wheat dataset onto the new principal component axes.
In this section, we will use the sorted eigenpairs by descending order of the eigenvalues, to construct a projection matrix from the selected eigenvectors and use the projection matrix to transform the data onto the lower-dimensional subspace.

eigenpairs =[(np.abs(eigenvalues[i]),eigenvectors[:,i]) for i in range(len(eigenvalues))]
eigenpairs.sort(reverse=True)

Next, we collect the two eigenvectors that correspond to the two largest values to capture about 90 percent of the variance in this dataset.
Note that we only chose two eigenvectors for the purpose of illustration, since we are going to plot the data via a two-dimensional scatter plot later in this subsection.
In practice, the number of principal components has to be determined from a trade-off between computational efficiency and the performance of the classifier.
In this case two is a great choice as it amounts to 90% of the variance.

V = np.hstack((eigenpairs[0][1][:, np.newaxis], eigenpairs[1][1][:, np.newaxis]))
print('Matrix V:\n', V)

Matrix V:
[[ 0.44455801 0.02566934]
[ 0.44188066 0.08388916]
[ 0.27591508 -0.53312904]
[ 0.42307046 0.20911146]
[ 0.43198716 -0.11511336]
[-0.11816204 0.71635014]
[ 0.38912303 0.37140397]]

We have created a 7×2-dimensional projection matrix V from the top two eigenvectors. Using the projection matrix, we can now transform a sample X onto the PCA subspace obtaining X_pca, a now two-dimensional sample vector consisting of two new features.

X_train_std[0].dot(V) # how the first row looks like

array([3.06229007, 0.67948187])

X_train_pca = X_train_std.dot(V)

Lastly, let’s visualise the transformed Wheat training set in a two-dimensional scatterplot:

fig, ax = plt.subplots(figsize=(10, 7))

colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']
classes = ['canadian', 'kama', 'rosa']

for l, c, m in zip(np.unique(y_train), colors, markers):
ax.scatter(X_train_pca[y_train==l, 0], X_train_pca[y_train==l, 1],  c=c, label=classes[l], marker=m)

ax.set_xlabel('PC 1')
ax.set_ylabel('PC 2')
ax.legend(loc='lower right')

fig.suptitle("Wheat type distribution by Principal Components");

output_39_0

As we can see in the plot, the data is more spread along the x-axis — the first principal component — than the second principal component (y-axis), which is consistent with the explained variance ratio plot that we created in the previous subsection.
Moreover, we can intuitively see that a linear classifier will likely be able to separate the classes quite well.

Examples of Principal Component Analysis using SKlearn

Let’s see how PCA (using SKlearn) can be used in concrete examples.

Kidney study

Goal: reduce the dense kidney clinic study feature set to its two main components.

We use a dataset to predict the chronic kidney disease, that can be found on the usual UCI repository:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

You can take a look at the dataset webpage in the attribute info section.

As usual, we start by reading the data:

df = pd.read_csv("Datasets/kidney_disease.csv", index_col='id')
df.head()

IDagebpsgalsurbcpcpccbabgrpcvwcrchtndmcadappetpeaneclassification
048.080.01.0201.00.0NaNnormalnotpresentnotpresent121.04478005.2yesyesnogoodnonockd
17.050.01.0204.00.0NaNnormalnotpresentnotpresentNaN386000NaNnononogoodnonockd
262.080.01.0102.03.0normalnormalnotpresentnotpresent423.0317500NaNnoyesnopoornoyesckd
348.070.01.0054.00.0normalabnormalpresentnotpresent117.03267003.9yesnonopooryesyesckd
451.080.01.0102.00.0normalnormalnotpresentnotpresent106.03573004.6nononogoodnonockd

5 rows × 25 columns  

There are some NaN. Let’s drop them.

df.dropna(inplace=True)
df.info()

 
Int64Index: 158 entries, 3 to 399
Data columns (total 25 columns):
#  Column Non-Null Count    Dtype 
-- ------ -------- -------- ----- 
0 age     158      non-null float64
1 bp      158      non-null float64
2 sg      158      non-null float64
3 al      158      non-null float64
4 su      158      non-null float64
5 rbc     158      non-null object 
6 pc      158      non-null object 
7 pcc     158      non-null object 
8 ba      158      non-null object 
9 bgr     158      non-null float64
10 bu     158      non-null float64
11 sc     158      non-null float64
12 sod    158      non-null float64
13 pot    158      non-null float64
14 hemo   158      non-null float64
15 pcv    158      non-null object 
16 wc     158      non-null object 
17 rc     158      non-null object 
18 htn    158      non-null object 
19 dm     158      non-null object 
20 cad    158      non-null object 
21 appet  158      non-null object 
22 pe     158      non-null object 
23 ane    158      non-null object 
24 classification 158 non-null object 
dtypes: float64(11), object(14)
memory usage: 32.1+ KB

There are 158 entries.
Each one has the classification target column, plus 24 other classes: 11 numeric and 14 nominal.
See the UCI site page for an explanation of each class.

# Create some color coded labels; the actual label feature
# will be removed prior to executing PCA, since it's unsupervised.
# we're only labeling by color so you can see the effects of PCA

labelsCol = ['red' if i=='ckd' else 'green' for i in df.classification]

We start by transforming the nominal, specifically the object features:

# Get rid of nominal columns
#

df.drop(labels=['classification'], axis=1, inplace=True)

df = pd.get_dummies(df,columns=['rbc', 'pc', 'pcc', 'ba', 'htn',
'dm', 'cad', 'appet', 'pe', 'ane'])

And the last three nominal can be changed to a numeric type:

# Print out and check the dataframe's dtypes.
#

df.wc = pd.to_numeric(df.wc, errors='coerce')
df.rc = pd.to_numeric(df.rc, errors='coerce')
df.pcv = pd.to_numeric(df.pcv, errors='coerce')

print(df.dtypes)

age float64
bp float64
sg float64
al float64
su float64
bgr float64
bu float64
sc float64
sod float64
pot float64
hemo float64
pcv int64
wc int64
rc float64
rbc_abnormal uint8
rbc_normal uint8
pc_abnormal uint8
pc_normal uint8
pcc_notpresent uint8
pcc_present uint8
ba_notpresent uint8
ba_present uint8
htn_no uint8
htn_yes uint8
dm_no uint8
dm_yes uint8
cad_no uint8
cad_yes uint8
appet_good uint8
appet_poor uint8
pe_no uint8
pe_yes uint8
ane_no uint8
ane_yes uint8
dtype: object

After adding in all the other features and properly encoding those that  need to be, the separation between CKD and Non-CKD patients increases significantly: by taking many secondary features and combining them, machine learning is usually able to come up with more informative descriptions of the data.

PCA operates based on variance. The variable with the greatest variance will dominate.
Let’s peek into the data and check the variance of every feature in the dataset.

df.var()

age 2.406297e+02
bp 1.248891e+02
sg 3.023865e-05
al 1.996936e+00
su 6.616141e-01
bgr 4.217182e+03
bu 2.246322e+03
sc 9.471717e+00
sod 5.609143e+01
pot 1.208501e+01
hemo 8.307100e+00
pcv 8.290402e+01
wc 9.777380e+06
rc 1.039104e+00
rbc_abnormal 1.015883e-01
rbc_normal 1.015883e-01
pc_abnormal 1.508103e-01
pc_normal 1.508103e-01
pcc_notpresent 8.127066e-02
pcc_present 8.127066e-02
ba_notpresent 7.062807e-02
ba_present 7.062807e-02
htn_no 1.699589e-01
htn_yes 1.699589e-01
dm_no 1.467387e-01
dm_yes 1.467387e-01
cad_no 6.518584e-02
cad_yes 6.518584e-02
appet_good 1.064662e-01
appet_poor 1.064662e-01
pe_no 1.112634e-01
pe_yes 1.112634e-01
ane_no 9.159074e-02
ane_yes 9.159074e-02
dtype: float64

Large variance amount: wc, bgr, bu

df.describe()

 agebpsgalsubgrbuscsodpotdm_nodm_yescad_nocad_yesappet_goodappet_poorpe_nope_yesane_noane_yes
count158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000158.000000
mean49.56329174.0506331.0198730.7974680.253165131.34177252.5759492.188608138.8481014.6367090.8227850.1772150.9303800.0696200.8797470.1202530.8734180.1265820.8987340.101266
std15.51224411.1753810.0054991.4131300.81339764.93983247.3953823.0776157.4894213.4763510.3830650.3830650.2553150.2553150.3262920.3262920.3335620.3335620.3026400.302640
min6.00000050.0000001.0050000.0000000.00000070.00000010.0000000.400000111.0000002.5000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%39.25000060.0000001.0200000.0000000.00000097.00000026.0000000.700000135.0000003.7000001.0000000.0000001.0000000.0000001.0000000.0000001.0000000.0000001.0000000.000000
50%50.50000080.0000001.0200000.0000000.000000115.50000039.5000001.100000139.0000004.5000001.0000000.0000001.0000000.0000001.0000000.0000001.0000000.0000001.0000000.000000
75%60.00000080.0000001.0250001.0000000.000000131.75000049.7500001.600000144.0000004.9000001.0000000.0000001.0000000.0000001.0000000.0000001.0000000.0000001.0000000.000000
max83.000000110.0000001.0250004.0000005.000000490.000000309.00000015.200000150.00000047.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000
8 rows × 34 columns

 

The first thing PCA does is center the dataset about its mean by subtracting the mean from each sample. Looking at the .describe() output of the dataset, particularly the min, max, and mean readings per feature, some features will dominate the X axis (likely wc) and some the Y axis (likely bgr).

Now we use PCA from SKLearn:

from sklearn.decomposition import PCA 

# Run PCA on the dataset and reduce it to 2 components

pcaModel = PCA(n_components=2)
pcaModel.fit(df) # fit the model

T = pcaModel.transform(df) # T is an array of len 158 (one for each df's entry) and two components
pcaModel.explained_variance_ratio_ 

array([9.99309365e-01, 4.57758701e-04])

explained_variance_ratio_ is the percentage of variance explained by each of the selected components.

singular_values_ is an array. The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space. Equal to n_components largest eigenvalues of the covariance matrix of X.

pcaModel.singular_values_

array([39180.18992145, 838.56138993])

T[0]

array([-1.77596096e+03, -1.26057430e+00])

Let’s plot the features with their vectors.

First we define a helper function to draw the vectors on a plot:

import math 

def drawVectors(transformed_features, components_, columns, plt):

num_columns = len(columns)

# This function will project your *original* feature (columns)
# onto your principal component feature-space, so that you can
# visualise how "important" each one was in the
# multi-dimensional scaling

# Scale the principal components by the max value in
# the transformed set belonging to that component
xvector = components_[0] * max(transformed_features[:,0])
yvector = components_[1] * max(transformed_features[:,1])

# Sort each column by its length. These are your *original*
# columns, not the principal components.
important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
print ("Features by importance:\n", important_features)

## visualize projections

ax = plt.axes()

for i in range(num_columns):
# Use an arrow to project each original feature as a
# labeled vector on your principal component axes
plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)

return ax

And now we can plot:

# Plot the transformed data as a scatter plot. Recall that transforming
# the data will result in a NumPy NDArray. Therefore we convert it to DataFrame and have pandas
# do the plot.
#
# Since we transformed via PCA, we no longer have column names. We know we
# are in P.C. space, so we'll just define the coordinates accordingly:

ax = drawVectors(T, pca.components_, df.columns.values, plt)
T = pd.DataFrame(T)
T.columns = ['PC 1', 'PC 2']
T.plot.scatter(x='PC 1', y='PC 2', marker='o', c=labelsCol, alpha=0.75, ax=ax);

Features by importance:
[(17924.12862704472, 'wc'), (309.72755212441524, 'bgr'), 
(133.4661920624122, 'bu'), (29.873170788201644, 'pcv'), 
(26.924712516320835, 'age'), (15.348887614206896, 'sod'), 
(14.728205659896712, 'bp'), (9.26958992968382, 'hemo'), 
(8.315782432584017, 'sc'), (4.812817492433102, 'al'), 
(3.854866211705058, 'pot'), (2.9506615569902213, 'rc'), 
(2.939116338871963, 'su'), (1.4500345016834413, 'dm_yes'), 
(1.4500345016834406, 'dm_no'), (1.4262548164785005, 'htn_no'), 
(1.4262548164785, 'htn_yes'), (1.0748414625899176, 'pc_normal'), 
(1.0748414625899176, 'pc_abnormal'), (0.8619132020709073, 'appet_poor'), 
(0.861913202070907, 'appet_good'), (0.8592833054679994, 'pe_yes'), 
(0.8592833054679994, 'pe_no'), (0.8504820097620381, 'rbc_abnormal'), 
(0.8504820097620378, 'rbc_normal'), (0.6274850804078085, 'cad_yes'), 
(0.6274850804078085, 'cad_no'), (0.49341600089668086, 'pcc_present'), 
(0.4934160008966808, 'pcc_notpresent'), (0.49150767269010703, 'ane_yes'), 
(0.49150767269010703, 'ane_no'), (0.4774667056822345, 'ba_notpresent'), 
(0.4774667056822342, 'ba_present'), (0.018194804297560947, 'sg')]
output_66_1

As expected, the two most important features (highest variance) are bgr and wc.
According to the labelling, red plots correspond to chronic kidney disease and green plots are non-CKD patients.
Looking at the scatter plot, the two classes are not completely separable: a few records are mixed together.

Now we repeat the PCA but with scaling before!

We prepare a helper function:

# A Note on SKLearn .transform() calls:
#
# Any time you transform your data, you lose the column header names.
# This actually makes complete sense. There are essentially two types
# of transformations,  those that change the scale of your features,
# and those that change your features entire. Changing the scale would
# be like changing centimeters to inches. Changing the features would
# be like using PCA to reduce 300 columns to 30. In either case, the
# original column's units have been altered or no longer exist, so it's
# up to you to rename your columns after ANY transformation. Due to
# this, SKLearn returns an NDArray from *transform() calls.

# ---------
# Feature scaling is the type of transformation that only changes the
# scale and not number of features, so we'll use the original dataset
# column names. However we'll keep in mind that the _units_ have been
# altered:

def scaleFeatures(usdf):
scaled = preprocessing.StandardScaler().fit_transform(usdf)
scaled = pd.DataFrame(scaled, columns = usdf.columns)
print ("New Variances:\n", scaled.var())
print ("New Describe:\n", scaled.describe())
return scaled

And now we can proceed:

from sklearn import preprocessing
scaled = preprocessing.StandardScaler().fit_transform(df) # returns an array
S = pd.DataFrame(scaled, columns = df.columns) # make it a data frame
print ("New Variances:\n", S.var())

New Variances:
age 1.006369
bp 1.006369
sg 1.006369
al 1.006369
su 1.006369
bgr 1.006369
bu 1.006369
sc 1.006369
sod 1.006369
pot 1.006369
hemo 1.006369
pcv 1.006369
wc 1.006369
rc 1.006369
rbc_abnormal 1.006369
rbc_normal 1.006369
pc_abnormal 1.006369
pc_normal 1.006369
pcc_notpresent 1.006369
pcc_present 1.006369
ba_notpresent 1.006369
ba_present 1.006369
htn_no 1.006369
htn_yes 1.006369
dm_no 1.006369
dm_yes 1.006369
cad_no 1.006369
cad_yes 1.006369
appet_good 1.006369
appet_poor 1.006369
pe_no 1.006369
pe_yes 1.006369
ane_no 1.006369
ane_yes 1.006369
dtype: float64

print ("New Describe:\n", S.describe())

New Describe:
      age          bp             sg            al            su \
count 1.580000e+02 1.580000e+02   1.580000e+02  1.580000e+02  1.580000e+02 
mean  8.432074e-17 5.846238e-16  -1.304161e-15 -1.349132e-16 -2.248553e-17 
std   1.003180e+00 1.003180e+00   1.003180e+00  1.003180e+00  1.003180e+00 
min  -2.817246e+00 -2.158952e+00 -2.713365e+00 -5.661221e-01 -3.122333e-01 
25%  -6.669624e-01 -1.261282e+00  2.309247e-02 -5.661221e-01 -3.122333e-01 
50%   6.057713e-02 5.340564e-01   2.309247e-02 -5.661221e-01 -3.122333e-01 
75%   6.749439e-01 5.340564e-01   9.352451e-01  1.437770e-01 -3.122333e-01 
max   2.162358e+00 3.227064e+00   9.352451e-01  2.273474e+00  5.854375e+00 

      bgr           bu            sc         sod            pot \
count 1.580000e+02  1.580000e+02  158.000000  1.580000e+02  1.580000e+02 
mean -4.497106e-17  8.994212e-17  0.000000    9.893633e-16  5.621382e-17 
std   1.003180e+00  1.003180e+00  1.003180    1.003180e+00  1.003180e+00 
min  -9.475974e-01 -9.011706e-01 -0.583015   -3.730148e+00 -6.165957e-01 
25%  -5.305059e-01 -5.625116e-01 -0.485227   -5.154386e-01 -2.703085e-01 
50%  -2.447210e-01 -2.767680e-01 -0.354843    2.034626e-02 -3.945044e-02 
75%   6.306235e-03 -5.981458e-02 -0.191863    6.900774e-01  7.597862e-02 
max   5.540492e+00  5.427520e+00  4.241194    1.493755e+00  1.222489e+01 

      ... dm_no          dm_yes    cad_no        cad_yes \
count ...  1.580000e+02  158.0000  1.580000e+02  1.580000e+02 
mean  ... -1.349132e-16  0.000000  1.798842e-16 -6.745659e-17 
std   ...  1.003180e+00  1.003180  1.003180e+00  1.003180e+00 
min   ... -2.154729e+00 -0.464095 -3.655631e+00 -2.735506e-01 
25%   ...  4.640955e-01 -0.464095  2.735506e-01 -2.735506e-01 
50%   ...  4.640955e-01 -0.464095  2.735506e-01 -2.735506e-01 
75%   ...  4.640955e-01 -0.464095  2.735506e-01 -2.735506e-01 
max   ...  4.640955e-01  2.154729  2.735506e-01  3.655631e+00 

      appet_good   appet_poor    pe_no      pe_yes    ane_no \
count 1.580000e+02  1.580000e+02  158.0000  158.0000  1.580000e+02 
mean  4.497106e-17 -4.497106e-17  0.000000  0.000000  4.497106e-17 
std   1.003180e+00  1.003180e+00  1.003180  1.003180  1.003180e+00 
min  -2.704772e+00 -3.697170e-01 -2.626785 -0.380693 -2.979094e+00 
25%   3.697170e-01 -3.697170e-01  0.380693 -0.380693  3.356725e-01 
50%   3.697170e-01 -3.697170e-01  0.380693 -0.380693  3.356725e-01 
75%   3.697170e-01 -3.697170e-01  0.380693 -0.380693  3.356725e-01 
max   3.697170e-01  2.704772e+00  0.380693  2.626785  3.356725e-01 

      ane_yes 
count 1.580000e+02 
mean -8.994212e-17 
std   1.003180e+00 
min  -3.356725e-01 
25%  -3.356725e-01 
50%  -3.356725e-01 
75%  -3.356725e-01 
max   2.979094e+00 

[8 rows x 34 columns]

S.head()

IDagebpsgalsubgrbuscsodpotdm_nodm_yescad_nocad_yesappet_goodappet_poorpe_nope_yesane_noane_yes
0-0.101098-0.363613-2.7133652.273474-0.312233-0.2215490.0724740.525250-3.730148-0.6165960.464095-0.4640950.273551-0.273551-2.7047722.704772-2.6267852.626785-2.9790942.979094
10.2222531.4317260.0230920.853676-0.312233-0.9475971.1519501.633514-3.328309-0.270309-2.1547292.1547290.273551-0.273551-2.7047722.7047720.380693-0.380693-2.9790942.979094
20.868954-0.363613-1.8012131.563575-0.3122333.8412310.1571390.166693-1.051224-0.126022-2.1547292.1547290.273551-0.273551-2.7047722.704772-2.6267852.6267850.335673-0.335673
31.1923050.534056-1.8012131.5635752.1544100.3963640.7921250.623038-1.1851700.508838-2.1547292.154729-3.6556313.655631-2.7047722.704772-2.6267852.6267850.335673-0.335673
40.7396140.534056-0.8890600.853676-0.3122330.6435292.0197640.557846-0.5154390.162550-2.1547292.154729-3.6556313.655631-2.7047722.704772-2.6267852.626785-2.9790942.979094

# Run PCA on your dataset and reduce it to 2 components
#

pcaModel.fit(S)

TS = pcaModel.transform(S) # TS is an array of length 158 and two components

pcaModel.explained_variance_ratio_

array([0.46548191, 0.08269405])

The two values together are higher.

pcaModel.singular_values_

array([50.00568791, 21.07682243])

TS[0]

array([ 7.40571153, -5.4338862 ])

# Plot the transformed data as a scatter plot. Recall that transforming
# the data will result in a NumPy NDArray. You can either use MatPlotLib
# to graph it directly, or you can convert it to DataFrame and have pandas
# do it for you.
#
# Since we transformed via PCA, we no longer have column names. We know we
# are in P.C. space, so we'll just define the coordinates accordingly:

ax = drawVectors(TS, pcaModel.components_, S.columns.values, plt)
TS = pd.DataFrame(TS)
TS.columns = ['PC 1', 'PC 2']
TS.plot.scatter(x='PC 1', y='PC 2', marker='o', c=labelsCol, alpha=0.75, ax=ax);

Features by importance: 
[(2.969615039499333, 'ane_yes'), (2.969615039499333, 'ane_no'),
(2.758875925894978, 'bgr'), (2.7155049553940604, 'dm_yes'),
(2.7155049553940604, 'dm_no'), (2.6589320941392414, 'pcv'),
(2.645572100109875, 'hemo'), (2.60211266284851, 'al'),
(2.5934172193046927, 'htn_no'), (2.593417219304692, 'htn_yes'),
(2.5767305659998674, 'su'), (2.4859440233443406, 'cad_yes'),
(2.4859440233443406, 'cad_no'), (2.484132428132544, 'sc'),
(2.4794711137049417, 'pc_normal'), (2.4794711137049417, 'pc_abnormal'),
(2.465109416186055, 'appet_poor'), (2.465109416186054, 'appet_good'),
(2.4575665060850933, 'bu'), (2.393726969612571, 'sg'),
(2.388197579628534, 'rc'), (2.210578246131017, 'pe_yes'),
(2.210578246131017, 'pe_no'), (2.182818062607651, 'sod'),
(2.005926150817528, 'rbc_normal'), (2.005926150817528, 'rbc_abnormal'),
(1.9861731688066884, 'ba_present'), (1.986173168806688, 'ba_notpresent'),
(1.9842911319072, 'pcc_present'), (1.9842911319071999, 'pcc_notpresent'),
(1.2846796771566376, 'age'), (1.054194604216631, 'bp'),
(0.919384598744807, 'wc'), (0.5640412875529991, 'pot')]

output_78_1

Note that scaling the features did affect their variances. After scaling, the green patients without chronic kidney disease are more cleanly separable from the red patients with chronic kidney disease.

Armadillo: From 3D rendering to 2D plot

The goal is to reduce the dimensionality of a 3D scan from three to two using PCA to cast a shadow of the data onto its two most important principal components. Then render the resulting 2D scatter plot.

The scan is a real life armadillo sculpture scanned using a Cyberware 3030 MS 3D scanner at Stanford University. The sculpture is available as part of their 3D Scanning Repository – where the PLY file is available – and is a very dense 3D mesh consisting of 172974 vertices!

Read the data

Every 100 data samples, we save 1.
If things run too slow, try increasing this number.
If things run too fast, try decreasing it… =)

reduce_factor = 100 # this is the number to change

from plyfile import PlyData

# Load up the scanned armadillo
plyfile = PlyData.read('Datasets/stanford_armadillo.ply')
armadillo = pd.DataFrame({
  'x':plyfile['vertex']['z'][::reduce_factor],
  'y':plyfile['vertex']['x'][::reduce_factor],
  'z':plyfile['vertex']['y'][::reduce_factor]
})

Render the Original Armadillo

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_title('Armadillo 3D')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.scatter(armadillo.x, armadillo.y, armadillo.z, c='green', marker='.', alpha=0.75);

output_88_0

Decompose to 2D using PCA

from sklearn.decomposition import PCA

def do_PCA(armadillo):
  #
  # import the libraries required for PCA.
  # Then, train your PCA on the armadillo dataframe. Finally,
  # drop one dimension (reduce it down to 2D) and project the
  # armadillo down to the 2D principal component feature space.
  #
  
  pca = PCA(n_components=2)
  pca.fit(armadillo)

  reducedArmadillo = pca.transform(armadillo)

  return reducedArmadillo

Time the execution of PCA 5000x : PCA is ran 5000x in order to help decrease the potential of rogue processes altering the speed of execution.

import datetime

t1 = datetime.datetime.now()
for i in range(5000): pca = do_PCA(armadillo)
time_delta = datetime.datetime.now() - t1

Render the newly transformed PCA armadillo!

 

if not pca is None:
  fig = plt.figure()
  ax = fig.add_subplot(111)
  ax.set_title('PCA, build time: ' + str(time_delta))
  ax.scatter(pca[:,0], pca[:,1], c='blue', marker='.', alpha=0.75)

output_96_0

Just as an example, we can run PCA also changing the solver algorithm, specifically the Randomized (truncated) method for the SVD solver. To find out how to use it, check out the full docs.

def do_RandomizedPCA(armadillo):

  pca = PCA(n_components=2, svd_solver='randomized')
  pca.fit(armadillo)

  reducedArmadillo = pca.transform(armadillo)

  return reducedArmadillo

Time the execution of randomized PCA 5000x

t1 = datetime.datetime.now()
for i in range(5000): rpca = do_RandomizedPCA(armadillo)
time_delta = datetime.datetime.now() - t1

Render the newly transformed armadillo!

if not rpca is None:
  fig = plt.figure()
  ax = fig.add_subplot(111)
  ax.set_title('RandomizedPCA, build time: ' + str(time_delta))
  ax.scatter(rpca[:,0], rpca[:,1], c='red', marker='.', alpha=0.75)

output_102_0

No visual differences between the transformed PCA results and the transformed RandomizedPCA results.
And was not executed faster.

Extra: the Isomap algorithm

Goal: Replicate Joshua Tenenbaum’s – the primary creator of the isometric feature mapping algorithm – canonical, dimensionality reduction research experiment for visual perception.
His original dataset from December 2000 consists of 698 samples of 4096-dimensional vectors.
These vectors are the coded brightness values of 64×64-pixel heads that have been rendered facing various directions and lighted from many angles.
The dataset – originally stored on Stanford’s website – is now accessible on webArchive or on my gitHub.  

Screen Shot 2020-06-27 at 14.34.08

Tasks:
– Applying both PCA and Isomap to the 698 raw images to derive 2D principal components and a 2D embedding of the data’s intrinsic geometric structure.
– Project both onto a 2D and 3D scatter plot, with a few superimposed face images on the associated samples.

Read the data

Note: a .MAT file is a .MATLAB file.

import scipy.io

mat = scipy.io.loadmat('Datasets/face_data.mat')
df = pd.DataFrame(mat['images']).T
num_images, num_pixels = df.shape
num_pixels = int(math.sqrt(num_pixels))

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 698 entries, 0 to 697
Columns: 4096 entries, 0 to 4095
dtypes: float64(4096)
memory usage: 21.8 MB


Pictures are rotated.
Rotate them, so we don’t have to crane our necks:

for i in range(num_images):
  df.loc[i,:] = df.loc[i,:].values.reshape(num_pixels, num_pixels).T.reshape(-1)

Prepare the function to plot the 2D images representation:

import random, math
</pre><pre># The format is: Plot2D(T, title, x, y, num_to_plot=40):
# T is the transformed data, NDArray.
# title is the chart title
# x is the principal component you want displayed on the x-axis, Can be 0 or 1
# y is the principal component you want displayed on the y-axis, Can be 1 or 2

def Plot2D(T, title, x, y, num_to_plot=40):
  # This method picks a bunch of random samples (images in our case)
  # to plot onto the chart:
  fig = plt.figure()
  ax = fig.add_subplot(111)
    
  ax.set_title(title)
  ax.set_xlabel('Component: {0}'.format(x))
  ax.set_ylabel('Component: {0}'.format(y))

  x_size = (max(T[:,x]) - min(T[:,x])) * 0.08
  y_size = (max(T[:,y]) - min(T[:,y])) * 0.08

  for i in range(num_to_plot):
    img_num = int(random.random() * num_images)
    x0, y0 = T[img_num,x]-x_size/2., T[img_num,y]-y_size/2.
    x1, y1 = T[img_num,x]+x_size/2., T[img_num,y]+y_size/2.
    img = df.iloc[img_num,:].values.reshape(num_pixels, num_pixels)
    ax.imshow(img, aspect='auto', cmap=plt.cm.gray, interpolation='nearest', zorder=100000, extent=(x0, x1, y0, y1))

  # It also plots the full scatter:
  ax.scatter(T[:,x],T[:,y], marker='.',alpha=0.7)

Reduce components using PCA

Reduce the dataframe down to THREE components.

pca = PCA(n_components=3)
pca.fit(df)

T = pca.transform(df)

We plot two of the components:

Plot2D(T, "PCA transformation", 0, 1, num_to_plot=40)


output_119_0

Reduce components using Isomap

Again, reduce the dataframe down to THREE components:

from sklearn import manifold

iso = manifold.Isomap(n_neighbors=8, n_components=3)
iso.fit(df)
manifold = iso.transform(df)
Plot2D(manifold, "ISO transformation", 0, 1, num_to_plot=40)

output_123_0

Between linear PCA and the non-linear Isomap, the latter algorithm is better able to capture the true nature of the faces dataset when reduced to two components.

Each coordinate axis of the 3D manifold correlate highly with one degree of freedom from the original, underlying data.
In the isomap plot of the first two components (0 and 1), above, you can see that’s the Left and Right Head Position ‘degree of freedom’ was encoded onto the first component (the X-axis). In other words, what varies as you move horizontally in the manifold rendering.

Look at the code on GitHub for more insights on the Isomap example.

Final thoughts on PCA

  • It is most useful for linear-type models.
  • It can make it harder to interpret predictors (as they are combined)
  • watch out for outliers!
  • transform first with logs or boxcox plot predictors to identify problems
  • bad use of PCA: to prevent overfitting. Use regularisation instead

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s