Random Forest

We know that decision trees have a tendency to overfit. One way of avoiding this is to build and combine multiple decision trees, a technique called random forests.

But before talking about forests, a word about Bootstrap aggregation or Bagging and one about Boosting. If you want to see instead the random forest’s coding example, jump to the “Human Activity Prediction” chapter.

Bootstrap aggregation (or bagging the trees)

Bootstrap aggregation is a general-purpose procedure for reducing the variance of a statistical learning method; it is particularly useful and frequently used in the context of decision trees.

As we said, for random forests we would need to have multiple random trees but since our tree-building process was deterministic, we need a different technique. Rather than training each tree on all the inputs in the training set, we train each tree using different data, so each tree will be different from every other tree.

Recall that given a set of n independent observations Z1, . . . , Zn, each with variance \sigma^{2} , the variance of the mean of the observations is given by:

\bar{Z} = \frac{\sigma^{2}}{n}

In other words, averaging a set of observations reduces variance.

Of course, this is not practical because we generally do not have access to multiple training sets.
Instead, we can bootstrap, by taking repeated samples from the (single) training data set.

In this approach we generate B different bootstrapped training data sets. We then train our method on the i-th bootstrapped training set in order to get the prediction at a point x: f_{i}(x) . We then average all the predictions to obtain

f(x) = \frac{\sum [f_{i}(x)]}{B}

This is called bagging.
It works great for overfit models (it decreases the variance without changing the bias) but doesn’t help much with underfit or high bias models.

For regression trees: instead of pruning the tree to reduce the variance (which increases the bias) you keep the trees as they are, with low bias, and get rid of the variance by averaging.

For classification trees: for each test observation, we record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions.

On a side note, the term bootstrap is not the same as the one used in computer science meaning to “boot” a computer from a set of core instructions, but its use derives from the phrase “to pull oneself up by one’s bootstraps”, as in the eighteenth century novel “The Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe:

The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.

Well, of course something like the procedure outlined above cannot be applied, because for real data we cannot generate new samples from the original population. However, the bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets.

Boosting

Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy and then combining all of the trees in order to create a single predictive model.

Notably, each tree is built on a bootstrap data set, independent of the other trees.
CoverII_small
Boosting is a general approach that can be applied to many statistical learning methods for regression or classification; it works in a similar way as Bagging, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. We fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals.

Basically instead of selecting data points randomly, it favours the misclassified points.
By fitting small trees to the residuals, we slowly improve f in areas where it does not perform well. We will not go into detail here, as it is a very specific and complex topic, but you can learn about the details in the great book Elements of Statistical Learning, chapter 10.

Ensemble learning

Random forests provide a further improvement over bagged trees by way of a small tweak that de-correlates the trees. This reduces the variance when we average the trees.

As in bagging, we build a number of decision trees on bootstrapped training samples.

Instead of sharing the entire dataset with each decision tree, the forest performs the Bagging operation we have seen earlier: an operation which is essential a train / test split of the training data.  Through doing so, each tree exist in an independent subspace and the variation between trees is controlled.

In addition to the tree bagging of training samples at the forest level, each tree within the forest is allowed to become highly specialised in a specific area but still retains some general knowledge about most areas. When a random forest classifies, it is actually each tree in the forest working together to cast votes on what label they think a specific sample should be assigned.

Each individual decision tree further ‘feature bags’ at each node-branch split. A random selection of M predictors is chosen as split candidates from the full set of P predictors (typically we choose M approximately equal to the square root of the total number of predictors P).
This is helpful because some datasets contain a feature that is very correlated to the target (the ‘y’-label). By selecting a random sampling of features every split it wouldn’t show up on as many branches of the tree and there would be more diversity of the features examined.

This is an example of a broader technique called ensemble learning in which we combine several weak learners (typically high-bias, low-variance models) in order to build an overall strong learner and more robust model, that has a better generalisation error and is less susceptible to overfitting.

An example of ensemble where the output of other learning algorithms (‘weak learners’) is combined into a weighted sum that represents the final output of the boosted classifier is the famous Adaboost, created in 2003 by Shapire & Freund, who have a useful tutorial.
More tutorials are available on boosting.org

Intuitively, a random forest can be considered as an ensemble of decision trees.

Random Forests

Random forests are one of the most popular and versatile models around.

The random forest algorithm can be summarised in four simple steps:

  1. Draw a random bootstrap sample of size N (randomly choose N samples from the training set with replacement).
  2. Grow a decision tree from the bootstrap sample.
    At each node:

    • Randomly select F features without replacement.
    • Split the node using the feature that provides the best split according to the objective function, for instance, by maximising the information gain.
  3. Repeat the steps 1 to 2 for K times.
  4. Aggregate the prediction by each tree to assign the class label by majority vote.

Scikit-learn has an ensemble module that includes a RandomForestClassifier as well as other ensemble methods.

Human activity prediction

As an example, we will predict human activity by looking at data from wearables.
For this, we train a random forest against a public domain Human Activity Dataset titled “Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements” from a study by the Universities of Genova, Italy and Barcelona, Spain.

Within the dataset, there are five target activities:

  • Sitting
  • Sitting Down
  • Standing
  • Standing Up
  • Walking

These activities were captured from 30 people wearing accelerometers mounted on their waist, left thigh, right arm and right ankle. It contains 165633 data points.

The following code and the data are also available on my GitHub.

Read the data

The original dataset can be found on the UCI MachineLearning Repository

A copy can be found also on my GitHub and on Kaggle.

import pandas as pd
import time
# Grab the DLA HAR dataset from the links above
# we assume that is stored in a dataset folder

#
# Load up the dataset into dataframe 'X'
#
X = pd.read_csv("../datasets/dataset-har-PUC-rio-ugulino.csv", sep=';', low_memory=False)
X.head(2)
user gender age how_tall_in_meters weight body_mass_index x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 class
0 debora Woman 46 1,62 75 28,6 -3 92 -63 -23 18 -19 5 104 -92 -150 -103 -147 sitting
1 debora Woman 46 1,62 75 28,6 -3 94 -64 -21 18 -18 -14 104 -90 -149 -104 -145 sitting
X.describe()
age weight x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4
count 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000
mean 38.265146 70.819408 -6.649327 88.293667 -93.164611 -87.827504 -52.065047 -175.055200 17.423515 104.517167 -93.881726 -167.641448 -92.625171
std 13.184091 11.296527 11.616238 23.895829 39.409423 169.435194 205.159763 192.816615 52.635388 54.155843 45.389646 38.311342 19.968610
min 28.000000 55.000000 -306.000000 -271.000000 -603.000000 -494.000000 -517.000000 -617.000000 -499.000000 -506.000000 -613.000000 -702.000000 -526.000000
25% 28.000000 55.000000 -12.000000 78.000000 -120.000000 -35.000000 -29.000000 -141.000000 9.000000 95.000000 -103.000000 -190.000000 -103.000000
50% 31.000000 75.000000 -6.000000 94.000000 -98.000000 -9.000000 27.000000 -118.000000 22.000000 107.000000 -90.000000 -168.000000 -91.000000
75% 46.000000 83.000000 0.000000 101.000000 -64.000000 4.000000 86.000000 -29.000000 34.000000 120.000000 -80.000000 -153.000000 -80.000000
max 75.000000 83.000000 509.000000 533.000000 411.000000 473.000000 295.000000 122.000000 507.000000 517.000000 410.000000 -13.000000 86.000000

Pre-processing the data

What we want to do is to predict the activity class based on the accelerometer’s data from the wearables. Can we say if the subject is sitting, standing or walking from the data?

#
# An easy way to show which rows have NaNs in them:
print (X[pd.isnull(X).any(axis=1)])
Empty DataFrame
Columns: [user, gender, age, how_tall_in_meters, weight, body_mass_index, x1, y1, z1, x2, y2, z2, x3, y3, z3, x4, y4, z4, class]
Index: []

Great, no NaNs here. Let’s go on.

#
# Encode the gender column: 0 as male, 1 as female
#
X.gender  = X.gender.map({'Woman':1, 'Man':0})

#
# Clean up any column with commas in it
# so that they're properly represented as decimals instead
#
X.how_tall_in_meters = X.how_tall_in_meters.str.replace(',','.').astype(float)
X.body_mass_index = X.body_mass_index.str.replace(',','.').astype(float)

#
# Check data types
print (X.dtypes)
user object
gender int64
age int64
how_tall_in_meters float64
weight int64
body_mass_index float64
x1 int64
y1 int64
z1 int64
x2 int64
y2 int64
z2 int64
x3 int64
y3 int64
z3 int64
x4 int64
y4 int64
z4 object
class object
dtype: object

Column z4 is type “object”. Something is wrong with the dataset.
We are going to fix the error and convert the column.

#
# Convert column Z4 into numeric
# Use errors='raise'. This will alert you if something ends up being
# problematic
#
# INFO: There is an error raised ... you will find it if you try the method
#
# print (X[pd.isnull(X).any(axis=1)])
# 122076 --> z4 =    -14420-11-2011 04:50:23.713
#
# !! The data point #122076 is a wrong coded record,
# change it or drop it before calling the to_numeric methods:
#
#X.at[122076, 'z4'] = -144   // change to correct value

# I keep this value for later and drop it from the dataset
wrongRow = X.loc[122076]
X.drop(X.index[[122076]], inplace=True)
# Now convert the z4 column
X.z4 = pd.to_numeric(X.z4, errors='raise')

print (X.dtypes)
user object
gender int64
age int64
how_tall_in_meters float64
weight int64
body_mass_index float64
x1 int64
y1 int64
z1 int64
x2 int64
y2 int64
z2 int64
x3 int64
y3 int64
z3 int64
x4 int64
y4 int64
z4 int64
class object
dtype: object

Everything is ok now.

Extract the target values

# Activity to predict is in "class" column
# Encode 'y' value as a dummies version of dataset's "class" column
#
y = pd.get_dummies(X['class'].copy())
# this produces a 5 column wide dummies dataframe as the y value

#
# Get rid of the user and class columns in X
#
X.drop(['class','user'], axis=1, inplace=True) 

print (X.head(2))

  gender age how_tall_in_meters weight body_mass_index x1 y1 z1 x2 \
0 1      46  1.62               75     28.6            -3 92 -63 -23
1 1      46  1.62               75     28.6            -3 94 -64 -21

  y2 z2  x3  y3  z3  x4   y4   z4
0 18 -19 5   104 -92 -150 -103 -147
1 18 -18 -14 104 -90 -149 -104 -145
print (y.head(1))
  sitting sittingdown standing standingup walking
0 1       0           0        0          0

Split the dataset into training and test

#
# Split  data into test / train sets
#
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=7)  

Train the Random Forest model

#
# Create an RForest classifier 'model'
#
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0)

You can check the sklearn documentation to see all possible parameters.

Although random forests don’t offer the same level of interpretability as decision trees, a big advantage of random forests is that we don’t have to worry so much about choosing good hyper-parameter values. We typically don’t need to prune the random forest since the ensemble model is quite robust to noise from the individual decision trees.

The only parameter that we really need to care about in practice is the number of trees that we choose for the random forest. Typically, the larger the number of trees, the better the performance of the random forest classifier at the expense of an increased computational cost.

The parameters used here:

  • n_estimators: integer, optional (default=100)
    The number of trees in the forest. Note that this number changed from 10 to 100 (following the progress in computing performance and memory)
  • max_depth: integer or None, optional (default=None)
    The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
    Setting a limit helps with the computing time and memory needed.Not setting a max depth will lead to have unpruned and fully grown trees which – depending on the dataset – will require large memory footprint.
  • oob_score: bool (default=False)
    Whether to use out-of-bag samples to estimate the generalization accuracy.
  • random_state: int, RandomState instance or None, optional (default=None)
    Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider

And other useful / important:

  • criterion: string, optional (default=”gini”)
    The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
    Same as for the Trees.
  • bootstrap: boolean, optional (default=True)
    Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.
  • max_samples: int or float  (default=None)
    If bootstrap is True, the number of samples to draw from X to train each base estimator.None means that the sample size of the bootstrap sample is chosen to be equal to the number of samples in the original training set, which usually provides a good bias-variance tradeoff.
  • max_features: int, float, string or None, optional (default=”auto”)
    This controls the selectors used at each branch split that we mentioned previously in the “Ensemble Learning” paragraph. For the number of features at each split, we want to choose a value that is smaller than the total number of features in the training set. A reasonable default that is used in scikit-learn is the square root of  the number of features in the training set.
print ("Fitting...")
s = time.time()

model.fit(X_train, y_train)

print("completed in: ", time.time() - s, "seconds")
Fitting...
completed in: 26.671831130981445 seconds

Note that it takes a much longer time to train a forest than a single decision tree.

This is the score based on the test dataset that we split earlier. Note how good it is.

print ("Scoring...")
s = time.time()

score = model.score(X_test, y_test)

print ("Score: ", round(score*100, 3))
print ("Scoring completed in: ", time.time() - s)
Scoring...
Score: 99.286
Scoring completed in: 1.3118188381195068

These are the top 5 features used in the classification.
They are all related to the movements, no gender, no weight, no age.

# Extract feature importances
fi = pd.DataFrame({'feature': list(X_train.columns),
                   'importance': model.feature_importances_}).\
                    sort_values('importance', ascending = False)

# Display
fi.head()
feature importance
7 z1 0.174239
12 y3 0.126232
10 z2 0.114018
9 y2 0.110663
6 y1 0.074385

Example prediction

Let’s use the wrong row – that we extracted earlier from the dataset – as a prediction example.
But first we need to correct it:

outputClassPredictionExample = wrongRow['class']

forPredictionExample = wrongRow.drop(labels=['class','user']) # remove class and user
forPredictionExample.z4 = -144 # correct the value

print("We use this example for prediction later:")
print(forPredictionExample)
print("The class shall be: ", outputClassPredictionExample)
We use this example for prediction later:
gender 0
age 75
how_tall_in_meters 1.67
weight 67
body_mass_index 24
x1 -8
y1 101
z1 -120
x2 -13
y2 91
z2 -101
x3 17
y3 123
z3 -108
x4 -207
y4 -82
z4 -144
Name: 122076, dtype: object
The class shall be: standingup
model.predict(forPredictionExample.values.reshape(1, -1))
array([[0, 0, 0, 1, 0]], dtype=uint8)

Remember that these were the categories for the classes:

y_test.iloc[0]
sitting 0
sittingdown 0
standing 0
standingup 0
walking 1
Name: 130833, dtype: uint8

The fourth one is “standing up”. Seems that the model predicted correctly.

OutOfBag error instead of splitting into train and test

Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself.
It does this by scoring each tree’s predictions against that tree’s out-of-bag samples. A tree’s out of bag samples are those forest training samples that were withheld from a specific tree during training.

One of the advantages of using the out of bag (OOB) error is that eliminates the need to split your data into a training / testing before feeding it into the forest model, since that’s part of the forest algorithm. However using the OOB error metric often underestimates the actual performance improvement and the optimal number of training iterations.

Let’s see an example.

modelOOB = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0,<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>oob_score=True)
print ("Fitting...")
s = time.time()

modelOOB.fit(X, y)

print("completed in: ", time.time() - s, "seconds")

Fitting...
completed in: 23.41502809524536 seconds

Time needed is similar.
Let’s check the score:

# Display the OOB Score of data
scoreOOB = modelOOB.oob_score_
print ("OOB Score: ", round(scoreOOB*100, 3))

OOB Score: 99.753

The out-of-bag estimation is not far away from the more precise score estimated from the test dataset. And we didn’t have to split the dataset into training and testing.

Now we predict the same user’s movement. Class output shall be “standing up”, the fourth one.

modelOOB.predict(forPredictionExample.values.reshape(1, -1))
array([[0, 0, 0, 1, 0]], dtype=uint8)

Yup!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s