We know that decision trees have a tendency to overfit. One way of avoiding this is to build and combine multiple decision trees, a technique called random forests.
But before talking about forests, a word about Bootstrap aggregation or Bagging and one about Boosting. If you want to see instead the random forest’s coding example, jump to the “Human Activity Prediction” chapter.
Bootstrap aggregation (or bagging the trees)
Bootstrap aggregation is a general-purpose procedure for reducing the variance of a statistical learning method; it is particularly useful and frequently used in the context of decision trees.
As we said, for random forests we would need to have multiple random trees but since our tree-building process was deterministic, we need a different technique. Rather than training each tree on all the inputs in the training set, we train each tree using different data, so each tree will be different from every other tree.
Recall that given a set of n independent observations Z1, . . . , Zn, each with variance , the variance of the mean of the observations is given by:
In other words, averaging a set of observations reduces variance.
Of course, this is not practical because we generally do not have access to multiple training sets.
Instead, we can bootstrap, by taking repeated samples from the (single) training data set.
In this approach we generate B different bootstrapped training data sets. We then train our method on the i-th bootstrapped training set in order to get the prediction at a point x: . We then average all the predictions to obtain
This is called bagging.
It works great for overfit models (it decreases the variance without changing the bias) but doesn’t help much with underfit or high bias models.
For regression trees: instead of pruning the tree to reduce the variance (which increases the bias) you keep the trees as they are, with low bias, and get rid of the variance by averaging.
For classification trees: for each test observation, we record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions.
On a side note, the term bootstrap is not the same as the one used in computer science meaning to “boot” a computer from a set of core instructions, but its use derives from the phrase “to pull oneself up by one’s bootstraps”, as in the eighteenth century novel “The Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe:
The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.
Well, of course something like the procedure outlined above cannot be applied, because for real data we cannot generate new samples from the original population. However, the bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets.
Boosting
Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy and then combining all of the trees in order to create a single predictive model.
Notably, each tree is built on a bootstrap data set, independent of the other trees.
Boosting is a general approach that can be applied to many statistical learning methods for regression or classification; it works in a similar way as Bagging, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. We fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals.
Basically instead of selecting data points randomly, it favours the misclassified points.
By fitting small trees to the residuals, we slowly improve f in areas where it does not perform well. We will not go into detail here, as it is a very specific and complex topic, but you can learn about the details in the great book Elements of Statistical Learning, chapter 10.
Ensemble learning
Random forests provide a further improvement over bagged trees by way of a small tweak that de-correlates the trees. This reduces the variance when we average the trees.
As in bagging, we build a number of decision trees on bootstrapped training samples.
Instead of sharing the entire dataset with each decision tree, the forest performs the Bagging operation we have seen earlier: an operation which is essential a train / test split of the training data. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled.
In addition to the tree bagging of training samples at the forest level, each tree within the forest is allowed to become highly specialised in a specific area but still retains some general knowledge about most areas. When a random forest classifies, it is actually each tree in the forest working together to cast votes on what label they think a specific sample should be assigned.
Each individual decision tree further ‘feature bags’ at each node-branch split. A random selection of M predictors is chosen as split candidates from the full set of P predictors (typically we choose M approximately equal to the square root of the total number of predictors P).
This is helpful because some datasets contain a feature that is very correlated to the target (the ‘y’-label). By selecting a random sampling of features every split it wouldn’t show up on as many branches of the tree and there would be more diversity of the features examined.
This is an example of a broader technique called ensemble learning in which we combine several weak learners (typically high-bias, low-variance models) in order to build an overall strong learner and more robust model, that has a better generalisation error and is less susceptible to overfitting.
An example of ensemble where the output of other learning algorithms (‘weak learners’) is combined into a weighted sum that represents the final output of the boosted classifier is the famous Adaboost, created in 2003 by Shapire & Freund, who have a useful tutorial.
More tutorials are available on boosting.org
Intuitively, a random forest can be considered as an ensemble of decision trees.
Random Forests
Random forests are one of the most popular and versatile models around.
The random forest algorithm can be summarised in four simple steps:
- Draw a random bootstrap sample of size N (randomly choose N samples from the training set with replacement).
- Grow a decision tree from the bootstrap sample.
At each node:- Randomly select F features without replacement.
- Split the node using the feature that provides the best split according to the objective function, for instance, by maximising the information gain.
- Repeat the steps 1 to 2 for K times.
- Aggregate the prediction by each tree to assign the class label by majority vote.
Scikit-learn has an ensemble module that includes a RandomForestClassifier as well as other ensemble methods.
Human activity prediction
As an example, we will predict human activity by looking at data from wearables.
For this, we train a random forest against a public domain Human Activity Dataset titled “Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements” from a study by the Universities of Genova, Italy and Barcelona, Spain.
Within the dataset, there are five target activities:
- Sitting
- Sitting Down
- Standing
- Standing Up
- Walking
These activities were captured from 30 people wearing accelerometers mounted on their waist, left thigh, right arm and right ankle. It contains 165633 data points.
The following code and the data are also available on my GitHub.
Read the data
The original dataset can be found on the UCI MachineLearning Repository
A copy can be found also on my GitHub and on Kaggle.
import pandas as pd import time
# Grab the DLA HAR dataset from the links above # we assume that is stored in a dataset folder # # Load up the dataset into dataframe 'X' # X = pd.read_csv("../datasets/dataset-har-PUC-rio-ugulino.csv", sep=';', low_memory=False)
X.head(2)
X.describe()
Pre-processing the data
What we want to do is to predict the activity class based on the accelerometer’s data from the wearables. Can we say if the subject is sitting, standing or walking from the data?
# # An easy way to show which rows have NaNs in them: print (X[pd.isnull(X).any(axis=1)])
Empty DataFrame Columns: [user, gender, age, how_tall_in_meters, weight, body_mass_index, x1, y1, z1, x2, y2, z2, x3, y3, z3, x4, y4, z4, class] Index: []
Great, no NaNs here. Let’s go on.
# # Encode the gender column: 0 as male, 1 as female # X.gender = X.gender.map({'Woman':1, 'Man':0}) # # Clean up any column with commas in it # so that they're properly represented as decimals instead # X.how_tall_in_meters = X.how_tall_in_meters.str.replace(',','.').astype(float) X.body_mass_index = X.body_mass_index.str.replace(',','.').astype(float) # # Check data types print (X.dtypes)
user object gender int64 age int64 how_tall_in_meters float64 weight int64 body_mass_index float64 x1 int64 y1 int64 z1 int64 x2 int64 y2 int64 z2 int64 x3 int64 y3 int64 z3 int64 x4 int64 y4 int64 z4 object class object dtype: object
Column z4 is type “object”. Something is wrong with the dataset.
We are going to fix the error and convert the column.
# # Convert column Z4 into numeric # Use errors='raise'. This will alert you if something ends up being # problematic # # INFO: There is an error raised ... you will find it if you try the method # # print (X[pd.isnull(X).any(axis=1)]) # 122076 --> z4 = -14420-11-2011 04:50:23.713 # # !! The data point #122076 is a wrong coded record, # change it or drop it before calling the to_numeric methods: # #X.at[122076, 'z4'] = -144 // change to correct value # I keep this value for later and drop it from the dataset wrongRow = X.loc[122076] X.drop(X.index[[122076]], inplace=True) # Now convert the z4 column X.z4 = pd.to_numeric(X.z4, errors='raise') print (X.dtypes)
user object gender int64 age int64 how_tall_in_meters float64 weight int64 body_mass_index float64 x1 int64 y1 int64 z1 int64 x2 int64 y2 int64 z2 int64 x3 int64 y3 int64 z3 int64 x4 int64 y4 int64 z4 int64 class object dtype: object
Everything is ok now.
Extract the target values
# Activity to predict is in "class" column # Encode 'y' value as a dummies version of dataset's "class" column # y = pd.get_dummies(X['class'].copy()) # this produces a 5 column wide dummies dataframe as the y value # # Get rid of the user and class columns in X # X.drop(['class','user'], axis=1, inplace=True) print (X.head(2))
gender age how_tall_in_meters weight body_mass_index x1 y1 z1 x2 \ 0 1 46 1.62 75 28.6 -3 92 -63 -23 1 1 46 1.62 75 28.6 -3 94 -64 -21 y2 z2 x3 y3 z3 x4 y4 z4 0 18 -19 5 104 -92 -150 -103 -147 1 18 -18 -14 104 -90 -149 -104 -145
print (y.head(1))
sitting sittingdown standing standingup walking 0 1 0 0 0 0
Split the dataset into training and test
# # Split data into test / train sets # from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)
Train the Random Forest model
# # Create an RForest classifier 'model' # from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0)
You can check the sklearn documentation to see all possible parameters.
Although random forests don’t offer the same level of interpretability as decision trees, a big advantage of random forests is that we don’t have to worry so much about choosing good hyper-parameter values. We typically don’t need to prune the random forest since the ensemble model is quite robust to noise from the individual decision trees.
The only parameter that we really need to care about in practice is the number of trees that we choose for the random forest. Typically, the larger the number of trees, the better the performance of the random forest classifier at the expense of an increased computational cost.
The parameters used here:
- n_estimators: integer, optional (default=100)
The number of trees in the forest. Note that this number changed from 10 to 100 (following the progress in computing performance and memory) - max_depth: integer or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
Setting a limit helps with the computing time and memory needed.Not setting a max depth will lead to have unpruned and fully grown trees which – depending on the dataset – will require large memory footprint. - oob_score: bool (default=False)
Whether to use out-of-bag samples to estimate the generalization accuracy. - random_state: int, RandomState instance or None, optional (default=None)
Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider
And other useful / important:
- criterion: string, optional (default=”gini”)
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
Same as for the Trees. - bootstrap: boolean, optional (default=True)
Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree. - max_samples: int or float (default=None)
If bootstrap is True, the number of samples to draw from X to train each base estimator.None means that the sample size of the bootstrap sample is chosen to be equal to the number of samples in the original training set, which usually provides a good bias-variance tradeoff. - max_features: int, float, string or None, optional (default=”auto”)
This controls the selectors used at each branch split that we mentioned previously in the “Ensemble Learning” paragraph. For the number of features at each split, we want to choose a value that is smaller than the total number of features in the training set. A reasonable default that is used in scikit-learn is the square root of the number of features in the training set.
print ("Fitting...") s = time.time() model.fit(X_train, y_train) print("completed in: ", time.time() - s, "seconds")
Fitting... completed in: 26.671831130981445 seconds
Note that it takes a much longer time to train a forest than a single decision tree.
This is the score based on the test dataset that we split earlier. Note how good it is.
print ("Scoring...") s = time.time() score = model.score(X_test, y_test) print ("Score: ", round(score*100, 3)) print ("Scoring completed in: ", time.time() - s)
Scoring... Score: 99.286 Scoring completed in: 1.3118188381195068
These are the top 5 features used in the classification.
They are all related to the movements, no gender, no weight, no age.
# Extract feature importances fi = pd.DataFrame({'feature': list(X_train.columns), 'importance': model.feature_importances_}).\ sort_values('importance', ascending = False) # Display fi.head()
Example prediction
Let’s use the wrong row – that we extracted earlier from the dataset – as a prediction example.
But first we need to correct it:
outputClassPredictionExample = wrongRow['class'] forPredictionExample = wrongRow.drop(labels=['class','user']) # remove class and user forPredictionExample.z4 = -144 # correct the value print("We use this example for prediction later:") print(forPredictionExample) print("The class shall be: ", outputClassPredictionExample)
We use this example for prediction later: gender 0 age 75 how_tall_in_meters 1.67 weight 67 body_mass_index 24 x1 -8 y1 101 z1 -120 x2 -13 y2 91 z2 -101 x3 17 y3 123 z3 -108 x4 -207 y4 -82 z4 -144 Name: 122076, dtype: object The class shall be: standingup
model.predict(forPredictionExample.values.reshape(1, -1))
array([[0, 0, 0, 1, 0]], dtype=uint8)
Remember that these were the categories for the classes:
y_test.iloc[0]
sitting 0 sittingdown 0 standing 0 standingup 0 walking 1 Name: 130833, dtype: uint8
The fourth one is “standing up”. Seems that the model predicted correctly.
OutOfBag error instead of splitting into train and test
Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself.
It does this by scoring each tree’s predictions against that tree’s out-of-bag samples. A tree’s out of bag samples are those forest training samples that were withheld from a specific tree during training.
One of the advantages of using the out of bag (OOB) error is that eliminates the need to split your data into a training / testing before feeding it into the forest model, since that’s part of the forest algorithm. However using the OOB error metric often underestimates the actual performance improvement and the optimal number of training iterations.
Let’s see an example.
modelOOB = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0,<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>oob_score=True)
print ("Fitting...") s = time.time() modelOOB.fit(X, y) print("completed in: ", time.time() - s, "seconds")
Fitting... completed in: 23.41502809524536 seconds
Time needed is similar.
Let’s check the score:
# Display the OOB Score of data scoreOOB = modelOOB.oob_score_ print ("OOB Score: ", round(scoreOOB*100, 3))
OOB Score: 99.753
The out-of-bag estimation is not far away from the more precise score estimated from the test dataset. And we didn’t have to split the dataset into training and testing.
Now we predict the same user’s movement. Class output shall be “standing up”, the fourth one.
modelOOB.predict(forPredictionExample.values.reshape(1, -1))
array([[0, 0, 0, 1, 0]], dtype=uint8)
Yup!