Decision Trees

Decision trees are a supervised, probabilistic, machine learning classifier that are often used as decision support tools. Like any other classifier, they are capable of predicting the label of a sample, and the way they do this is by examining the probabilistic outcomes of your samples’ features.

Decision trees are one of the oldest and most used machine learning algorithms, perhaps even pre-dating machine learning. They’re very popular and have been around for decades. Following through with sequential cause-and-effect decisions comes very naturally.

Why not logistic regression?
Logistic regression models are generally not interpretable. Model coefficients indicate importance and relative effect of variables, but do not give a simple explanation of how decision is made.

Trees – on the other way – do not assume a linear model and are interpretable.

Decision trees are a good tool to use when you want backing evidence to support a decision.

As usual, we see how Trees are working with a simple example. We train the UCI’s wheat-seeds dataset with decision trees and see the decision boundary plots produced by it.
We want to classify the seed type based on its characteristics such as length, width, etc.

You can find this code on my GitHub repository as well.

Wheat Seed prediction

Read the data

import pandas as pd
#
# Load up the wheat dataset into dataframe 'df'
#
df = pd.read_csv("Datasets/wheat.data", index_col='id')

df.head()

Screen Shot 2019-11-16 at 15.09.10

Pre-processing the data

What we want to do is to predict the wheat type based on the wheat seed characteristics. To simplify the boundary plot, we consider only two of the characteristics and we drop all the others.

# we keep groove and perimeter, plus of course the target: wheat type
wheatSimpler = df.drop(columns = ['compactness', 'width', 'length', 'area', 'asymmetry'])
wheatSimpler.info()
Int64Index: 210 entries, 0 to 209
Data columns (total 3 columns):
perimeter 210 non-null float64
groove 206 non-null float64
wheat_type 210 non-null object
dtypes: float64(2), object(1)
memory usage: 6.6+ KB
# An easy way to show which rows have NaNs in them:

print (wheatSimpler[pd.isnull(wheatSimpler).any(axis=1)])
id  perimeter groove wheat_type

7   14.10     NaN    canadian
60  12.86     NaN    canadian
135 14.66     NaN    canadian
201 13.32     NaN    canadian
#
# Only a few rows. Go ahead and drop any row with a NaN
#
wheatSimpler.dropna(axis=0, inplace=True)

#
# INFO: In the future, you might try setting the NaN values to the
# mean value of that column, groove; the mean should only be calculated for
# the specific class rather than across all classes, in this case for canadian type

Extract the target values

#
# Copy the labels out of the dataframe into variable 'labels' , then remove
# them from the dataframe. Encode the labels:
# canadian:0, kama:1, and rosa:2
#
labels = wheatSimpler.wheat_type.copy() # copy “y” values out
wheatSimpler.drop(['wheat_type'], axis=1, inplace=True) # drop output column 

labels = labels.map({'canadian':0, 'kama':1, 'rosa':2}) # encoding

Split the dataset into training and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(wheatSimpler, labels, test_size=0.3,
                                                    random_state=7)

Train the tree model

As the name decision tree implies, this kind of model works by making a series of decisions based on questions, to eventually infer the class labels of the training data.

We start at the tree root with one first question and split the data in two based on the feature that optimises a defined objective function. This splitting procedure is repeated at each child node until the data at each node all belong to the same class (aka the tree leaves are pure).

In practice, this can result in a very deep tree with many nodes, which can easily lead to overfitting. Therefore, typically a limit for the maximal depth of the tree is set.

The choice of the objective function is critical. In order to split the nodes at the most informative features, we need to define an objective function that maximises the information gain at each split and that can be optimised via the tree learning algorithm.

The information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities—the lower the impurity of the child nodes, the larger the information gain.

Three impurity measures – or splitting criteria – are commonly used in binary decision trees:  Gini impurity, entropy and the classification error and can be defined in the SciKit-learn module we are going to use.

#
# We use the SkLearn module tree
#
from sklearn import tree
#
# note that I force the tree depth to a maximum of two (for simplification)
#

model = tree.DecisionTreeClassifier(max_depth=2, random_state=2)
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=2, splitter='best')

Note that by default, the classifier uses Gini as the impurity criterion.

Let’s plot the tree using the sklearn.tree internal feature, which gives an idea of how the features are considered in the model

tree.plot_tree(model, feature_names=X_train.columns, filled=True)
[Text(167.4, 181.2, 'groove <= 5.576\nentropy = 0.666\nsamples = 144\nvalue = [51, 47, 46]'),
Text(83.7, 108.72, 'perimeter <= 13.54\nentropy = 0.51\nsamples = 96\nvalue = [49, 46, 1]'),
Text(41.85, 36.23999999999998, 'entropy = 0.045\nsamples = 43\nvalue = [42, 1, 0]'),
Text(125.55000000000001, 36.23999999999998, 'entropy = 0.261\nsamples = 53\nvalue = [7, 45, 1]'),
Text(251.10000000000002, 108.72, 'perimeter <= 13.95\nentropy = 0.119\nsamples = 48\nvalue = [2, 1, 45]'),
Text(209.25, 36.23999999999998, 'entropy = 0.0\nsamples = 1\nvalue = [1, 0, 0]'),
Text(292.95, 36.23999999999998, 'entropy = 0.082\nsamples = 47\nvalue = [1, 1, 45]')]

output_21_1

As you can see the tree consists of a series of splitting rules, starting at the top of the tree with a question (answer is True on the left and False on the right) and goes on with “branches” or “nodes” to finally ends in “leaves” or “terminal nodes” after several splits (in this case two, since I have limited it above using max_depth).

Looking at each box: the first line is the question, followed by the impurity metric (the closer to 0, the purer the leaf), then the samples (note that the root node has all 144 samples) and finally values [] tells how many samples at the given node falls into each category.

Basically each node is a question whose answer splits the data into smaller subsets.
Decision trees are simple and very easy to interpret, you just need to look at the questions.
The first question is: “has this seed a groove less or equal than 5.576?
This question in fact splits the predictor space in two: one with 96 samples and one with 48 samples.

Let’s plot the decision boundaries to see how they work.

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
x,y = np.meshgrid(wheatSimpler.perimeter, wheatSimpler.groove)
Z = model.predict(np.c_[x.ravel(), y.ravel()])
Z = Z.reshape(x.shape)
fig, ax = plt.subplots()
ax.scatter(wheatSimpler.perimeter, wheatSimpler.groove, c = labels)

ax.hlines(y=5.6, xmin=12, xmax=17.5, linestyle='-', label="1st question")
ax.vlines(x=13.5, ymin=3.5, ymax=5.6, linestyle='--', label="2nd question")
ax.vlines(x=14, ymin=5.6, ymax=6.9, linestyle=':', label="3rd question")

ax.legend(loc='best')
fig.suptitle("Predictor space")
ax.set_xlabel("Seed Perimeter")
ax.set_ylabel("Seed Groove");

output_27_0

You can see that the predictor space has been segmented in a number of simple regions, four to be precise; and a wheat type is predicted according to its position in one of the regions.

Another example: poisonous mushrooms

We will see a complete tree example without limitations now: classify a mushroom edible or poisonous (you can eat it or better not).
We use as training dataset the Mushroom Data Set, drawn from the Audubon Society Field Guide to North American Mushrooms (1981). The data set details mushrooms described in terms of many physical characteristics, such as cap size and stalk length, along with a classification of poisonous or edible.

As a standard disclaimer, if you eat a random mushroom you find, you are doing
so at your own risk.

As usual we will first read the data into a Pandas data frame and then process it to remove missing values (trees are very sensitive to this) and encode variables in binary ones.

Read the data

#dataset is here:
#    https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names

#
# Load up the mushroom dataset into dataframe 'X'
# Header information is on the dataset's website at the UCI ML Repo
#
colNames=['label', 'cap-shape','cap-surface','cap-color','bruises','odor',
          'gill-attachment','gill-spacing','gill-size','gill-color','stalk-shape',
          'stalk-root','stalk-surface-above-ring','stalk-surface-below-ring',
          'stalk-color-above-ring','stalk-color-below-ring','veil-type',
          'veil-color','ring-number','ring-type','spore-print-color','population',
          'habitat']
X = pd.read_csv("Datasets/agaricus-lepiota.data", header=None, na_values='?',
                names=colNames)
X.head()
label cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat
0 p x s n t p f c n k s w w p w o p k s u
1 e x s y t a f c b k s w w p w o p n n g
2 e b s w t l f c b n s w w p w o p n n m
3 p x y w t p f c n n s w w p w o p k s u
4 e x s g f n f w b k s w w p w o e n a g

5 rows × 23 columns

Pre-process the data

remove missing data

#
# drop any row with a nan
#
X.dropna(axis=0, inplace=True)
print (X.shape)
(5644, 23)

Separate label

#
# : Copy the labels out of the dset into variable 'y' then Remove
# them from X. 

y = X[X.columns[0]].copy()
X.drop(X.columns[0], axis=1,inplace=True)

Encode the dataset using dummies

#
# Encode labels poisonous / edible
y = y.map({'p':0, 'e':1})

# Encode the rest
#
X = pd.get_dummies(X)

Split the data into test and training sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=7)

Train the model

#
# Create a DT classifier. No need to set any parameters, let;s get the ful depth
#

model = tree.DecisionTreeClassifier()
#
# train the classifier on the training data / labels:
#
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
tree.plot_tree(model, feature_names=X_train.columns, filled=True)
[Text(218.90769230769232, 199.32, 'spore-print-color_h <= 0.5\nentropy = 0.469\nsamples = 3950\nvalue = [1487, 2463]'),
Text(193.15384615384616, 163.07999999999998, 'gill-size_b <= 0.5\nentropy = 0.235\nsamples = 2851\nvalue = [388, 2463]'),
Text(103.01538461538462, 126.83999999999999, 'odor_n <= 0.5\nentropy = 0.457\nsamples = 479\nvalue = [310, 169]'),
Text(51.50769230769231, 90.6, 'stalk-shape_t <= 0.5\nentropy = 0.296\nsamples = 365\nvalue = [299, 66]'),
Text(25.753846153846155, 54.359999999999985, 'entropy = 0.0\nsamples = 299\nvalue = [299, 0]'),
Text(77.26153846153846, 54.359999999999985, 'entropy = 0.0\nsamples = 66\nvalue = [0, 66]'),
Text(154.52307692307693, 90.6, 'population_c <= 0.5\nentropy = 0.174\nsamples = 114\nvalue = [11, 103]'),
Text(128.76923076923077, 54.359999999999985, 'entropy = 0.0\nsamples = 103\nvalue = [0, 103]'),
Text(180.27692307692308, 54.359999999999985, 'entropy = 0.0\nsamples = 11\nvalue = [11, 0]'),
Text(283.2923076923077, 126.83999999999999, 'ring-number_o <= 0.5\nentropy = 0.064\nsamples = 2372\nvalue = [78, 2294]'),
Text(257.53846153846155, 90.6, 'habitat_p <= 0.5\nentropy = 0.401\nsamples = 108\nvalue = [78, 30]'),
Text(231.7846153846154, 54.359999999999985, 'stalk-color-below-ring_n <= 0.5\nentropy = 0.093\nsamples = 82\nvalue = [78, 4]'),
Text(206.03076923076924, 18.119999999999976, 'entropy = 0.0\nsamples = 78\nvalue = [78, 0]'),
Text(257.53846153846155, 18.119999999999976, 'entropy = 0.0\nsamples = 4\nvalue = [0, 4]'),
Text(283.2923076923077, 54.359999999999985, 'entropy = 0.0\nsamples = 26\nvalue = [0, 26]'),
Text(309.04615384615386, 90.6, 'entropy = 0.0\nsamples = 2264\nvalue = [0, 2264]'),
Text(244.66153846153847, 163.07999999999998, 'entropy = 0.0\nsamples = 1099\nvalue = [1099, 0]')]

output_45_1

RESULT:
top two features you should consider when deciding if a mushroom is edible or not:
Odor and Gill Size.

# score the classifier on the testing data / labels:
# Returns the mean accuracy on the given test data and labels.

score = model.score(X_test, y_test)

print ("High-Dimensionality Score: ", round((score*100), 3))
High-Dimensionality Score: 100.0

Conclusion

Decision trees can be applied to both regression and classification problems and from the more classical approaches for regression and classification

  • Trees are simple and very easy to explain. Useful for interpretation.
  • Often decision trees more closely mirror human decision-making than do the regression and classification approaches seen previously.
  • Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
  • Trees can easily handle qualitative predictors without the need to create dummy variables.

Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book. They typically are not competitive with the best supervised learning approaches in terms of prediction accuracy.

Additionally, trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree.

However, by aggregating many decision trees, using methods like bagging, random forests and boosting, the predictive performance of trees can be substantially improved, at the expense of some loss interpretation.

2 thoughts on “Decision Trees

  1. Pingback: PCA: Principal Component Analysis – Look back in respect

  2. Pingback: Random Forest – Look back in respect

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s