Decision Trees

Decision trees are a supervised, probabilistic, machine learning classifier that are often used as decision support tools. Like any other classifier, they are capable of predicting the label of a sample, and the way they do this is by examining the probabilistic outcomes of your samples’ features.

Decision trees are one of the oldest and most used machine learning algorithms, perhaps even pre-dating machine learning. They’re very popular and have been around for decades. Following through with sequential cause-and-effect decisions comes very naturally.

Why not logistic regression?
Logistic regression models are generally not interpretable. Model coefficients indicate importance and relative effect of variables, but do not give a simple explanation of how decision is made.

Trees – on the other way – do not assume a linear model and are interpretable.

Decision trees are a good tool to use when you want backing evidence to support a decision.

As usual, we see how Trees are working with a simple example. We train the UCI’s wheat-seeds dataset with decision trees and see the decision boundary plots produced by it.
We want to classify the seed type based on its characteristics such as length, width, etc.

You can find this code on my GitHub repository as well.

Wheat Seed prediction

Read the data

import pandas as pd
# Load up the wheat dataset into dataframe 'df'
df = pd.read_csv("Datasets/", index_col='id')


Screen Shot 2019-11-16 at 15.09.10

Pre-processing the data

What we want to do is to predict the wheat type based on the wheat seed characteristics. To simplify the boundary plot, we consider only two of the characteristics and we drop all the others.

# we keep groove and perimeter, plus of course the target: wheat type
wheatSimpler = df.drop(columns = ['compactness', 'width', 'length', 'area', 'asymmetry'])
Int64Index: 210 entries, 0 to 209
Data columns (total 3 columns):
perimeter 210 non-null float64
groove 206 non-null float64
wheat_type 210 non-null object
dtypes: float64(2), object(1)
memory usage: 6.6+ KB
# An easy way to show which rows have NaNs in them:

print (wheatSimpler[pd.isnull(wheatSimpler).any(axis=1)])
id  perimeter groove wheat_type

7   14.10     NaN    canadian
60  12.86     NaN    canadian
135 14.66     NaN    canadian
201 13.32     NaN    canadian
# Only a few rows. Go ahead and drop any row with a NaN
wheatSimpler.dropna(axis=0, inplace=True)

# INFO: In the future, you might try setting the NaN values to the
# mean value of that column, groove; the mean should only be calculated for
# the specific class rather than across all classes, in this case for canadian type

Extract the target values

# Copy the labels out of the dataframe into variable 'labels' , then remove
# them from the dataframe. Encode the labels:
# canadian:0, kama:1, and rosa:2
labels = wheatSimpler.wheat_type.copy() # copy “y” values out
wheatSimpler.drop(['wheat_type'], axis=1, inplace=True) # drop output column 

labels ={'canadian':0, 'kama':1, 'rosa':2}) # encoding

Split the dataset into training and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(wheatSimpler, labels, test_size=0.3,

Train the tree model

As the name decision tree implies, this kind of model works by making a series of decisions based on questions, to eventually infer the class labels of the training data.

We start at the tree root with one first question and split the data in two based on the feature that optimises a defined objective function. This splitting procedure is repeated at each child node until the data at each node all belong to the same class (aka the tree leaves are pure).

In practice, this can result in a very deep tree with many nodes, which can easily lead to overfitting. Therefore, typically a limit for the maximal depth of the tree is set.

The choice of the objective function is critical. In order to split the nodes at the most informative features, we need to define an objective function that maximises the information gain at each split and that can be optimised via the tree learning algorithm.

The information gain is simply the difference between the impurity of the parent node and the sum of the child node impurities—the lower the impurity of the child nodes, the larger the information gain.

Three impurity measures – or splitting criteria – are commonly used in binary decision trees:  Gini impurity, entropy and the classification error and can be defined in the SciKit-learn module we are going to use.

# We use the SkLearn module tree
from sklearn import tree
# note that I force the tree depth to a maximum of two (for simplification)

model = tree.DecisionTreeClassifier(max_depth=2, random_state=2), y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=2, splitter='best')

Note that by default, the classifier uses Gini as the impurity criterion.

Let’s plot the tree using the sklearn.tree internal feature, which gives an idea of how the features are considered in the model

tree.plot_tree(model, feature_names=X_train.columns, filled=True)
[Text(167.4, 181.2, 'groove <= 5.576\nentropy = 0.666\nsamples = 144\nvalue = [51, 47, 46]'),
Text(83.7, 108.72, 'perimeter <= 13.54\nentropy = 0.51\nsamples = 96\nvalue = [49, 46, 1]'),
Text(41.85, 36.23999999999998, 'entropy = 0.045\nsamples = 43\nvalue = [42, 1, 0]'),
Text(125.55000000000001, 36.23999999999998, 'entropy = 0.261\nsamples = 53\nvalue = [7, 45, 1]'),
Text(251.10000000000002, 108.72, 'perimeter <= 13.95\nentropy = 0.119\nsamples = 48\nvalue = [2, 1, 45]'),
Text(209.25, 36.23999999999998, 'entropy = 0.0\nsamples = 1\nvalue = [1, 0, 0]'),
Text(292.95, 36.23999999999998, 'entropy = 0.082\nsamples = 47\nvalue = [1, 1, 45]')]


As you can see the tree consists of a series of splitting rules, starting at the top of the tree with a question (answer is True on the left and False on the right) and goes on with “branches” or “nodes” to finally ends in “leaves” or “terminal nodes” after several splits (in this case two, since I have limited it above using max_depth).

Looking at each box: the first line is the question, followed by the impurity metric (the closer to 0, the purer the leaf), then the samples (note that the root node has all 144 samples) and finally values [] tells how many samples at the given node falls into each category.

Basically each node is a question whose answer splits the data into smaller subsets.
Decision trees are simple and very easy to interpret, you just need to look at the questions.
The first question is: “has this seed a groove less or equal than 5.576?
This question in fact splits the predictor space in two: one with 96 samples and one with 48 samples.

Let’s plot the decision boundaries to see how they work.

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
x,y = np.meshgrid(wheatSimpler.perimeter, wheatSimpler.groove)
Z = model.predict(np.c_[x.ravel(), y.ravel()])
Z = Z.reshape(x.shape)
fig, ax = plt.subplots()
ax.scatter(wheatSimpler.perimeter, wheatSimpler.groove, c = labels)

ax.hlines(y=5.6, xmin=12, xmax=17.5, linestyle='-', label="1st question")
ax.vlines(x=13.5, ymin=3.5, ymax=5.6, linestyle='--', label="2nd question")
ax.vlines(x=14, ymin=5.6, ymax=6.9, linestyle=':', label="3rd question")

fig.suptitle("Predictor space")
ax.set_xlabel("Seed Perimeter")
ax.set_ylabel("Seed Groove");


You can see that the predictor space has been segmented in a number of simple regions, four to be precise; and a wheat type is predicted according to its position in one of the regions.

Another example: poisonous mushrooms

We will see a complete tree example without limitations now: classify a mushroom edible or poisonous (you can eat it or better not).
We use as training dataset the Mushroom Data Set, drawn from the Audubon Society Field Guide to North American Mushrooms (1981). The data set details mushrooms described in terms of many physical characteristics, such as cap size and stalk length, along with a classification of poisonous or edible.

As a standard disclaimer, if you eat a random mushroom you find, you are doing
so at your own risk.

As usual we will first read the data into a Pandas data frame and then process it to remove missing values (trees are very sensitive to this) and encode variables in binary ones.

Read the data

#dataset is here:

# Load up the mushroom dataset into dataframe 'X'
# Header information is on the dataset's website at the UCI ML Repo
colNames=['label', 'cap-shape','cap-surface','cap-color','bruises','odor',
X = pd.read_csv("Datasets/", header=None, na_values='?',
label cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat
0 p x s n t p f c n k s w w p w o p k s u
1 e x s y t a f c b k s w w p w o p n n g
2 e b s w t l f c b n s w w p w o p n n m
3 p x y w t p f c n n s w w p w o p k s u
4 e x s g f n f w b k s w w p w o e n a g

5 rows × 23 columns

Pre-process the data

remove missing data

# drop any row with a nan
X.dropna(axis=0, inplace=True)
print (X.shape)
(5644, 23)

Separate label

# : Copy the labels out of the dset into variable 'y' then Remove
# them from X. 

y = X[X.columns[0]].copy()
X.drop(X.columns[0], axis=1,inplace=True)

Encode the dataset using dummies

# Encode labels poisonous / edible
y ={'p':0, 'e':1})

# Encode the rest
X = pd.get_dummies(X)

Split the data into test and training sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,

Train the model

# Create a DT classifier. No need to set any parameters, let;s get the ful depth

model = tree.DecisionTreeClassifier()
# train the classifier on the training data / labels:
#, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
tree.plot_tree(model, feature_names=X_train.columns, filled=True)
[Text(218.90769230769232, 199.32, 'spore-print-color_h <= 0.5\nentropy = 0.469\nsamples = 3950\nvalue = [1487, 2463]'),
Text(193.15384615384616, 163.07999999999998, 'gill-size_b <= 0.5\nentropy = 0.235\nsamples = 2851\nvalue = [388, 2463]'),
Text(103.01538461538462, 126.83999999999999, 'odor_n <= 0.5\nentropy = 0.457\nsamples = 479\nvalue = [310, 169]'),
Text(51.50769230769231, 90.6, 'stalk-shape_t <= 0.5\nentropy = 0.296\nsamples = 365\nvalue = [299, 66]'),
Text(25.753846153846155, 54.359999999999985, 'entropy = 0.0\nsamples = 299\nvalue = [299, 0]'),
Text(77.26153846153846, 54.359999999999985, 'entropy = 0.0\nsamples = 66\nvalue = [0, 66]'),
Text(154.52307692307693, 90.6, 'population_c <= 0.5\nentropy = 0.174\nsamples = 114\nvalue = [11, 103]'),
Text(128.76923076923077, 54.359999999999985, 'entropy = 0.0\nsamples = 103\nvalue = [0, 103]'),
Text(180.27692307692308, 54.359999999999985, 'entropy = 0.0\nsamples = 11\nvalue = [11, 0]'),
Text(283.2923076923077, 126.83999999999999, 'ring-number_o <= 0.5\nentropy = 0.064\nsamples = 2372\nvalue = [78, 2294]'),
Text(257.53846153846155, 90.6, 'habitat_p <= 0.5\nentropy = 0.401\nsamples = 108\nvalue = [78, 30]'),
Text(231.7846153846154, 54.359999999999985, 'stalk-color-below-ring_n <= 0.5\nentropy = 0.093\nsamples = 82\nvalue = [78, 4]'),
Text(206.03076923076924, 18.119999999999976, 'entropy = 0.0\nsamples = 78\nvalue = [78, 0]'),
Text(257.53846153846155, 18.119999999999976, 'entropy = 0.0\nsamples = 4\nvalue = [0, 4]'),
Text(283.2923076923077, 54.359999999999985, 'entropy = 0.0\nsamples = 26\nvalue = [0, 26]'),
Text(309.04615384615386, 90.6, 'entropy = 0.0\nsamples = 2264\nvalue = [0, 2264]'),
Text(244.66153846153847, 163.07999999999998, 'entropy = 0.0\nsamples = 1099\nvalue = [1099, 0]')]


top two features you should consider when deciding if a mushroom is edible or not:
Odor and Gill Size.

# score the classifier on the testing data / labels:
# Returns the mean accuracy on the given test data and labels.

score = model.score(X_test, y_test)

print ("High-Dimensionality Score: ", round((score*100), 3))
High-Dimensionality Score: 100.0


Decision trees can be applied to both regression and classification problems and from the more classical approaches for regression and classification

  • Trees are simple and very easy to explain. Useful for interpretation.
  • Often decision trees more closely mirror human decision-making than do the regression and classification approaches seen previously.
  • Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
  • Trees can easily handle qualitative predictors without the need to create dummy variables.

Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book. They typically are not competitive with the best supervised learning approaches in terms of prediction accuracy.

Additionally, trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree.

However, by aggregating many decision trees, using methods like bagging, random forests and boosting, the predictive performance of trees can be substantially improved, at the expense of some loss interpretation.

Classification metrics and Naive Bayes

We have seen how classification via logistic regression works and here we will look into a special classifier called Naive Bayes and the metrics used in classification problems, all using a text classification example.

We build an analytics model using text as our data, specifically trying to understand the sentiment of tweets about the company Apple. This is  a special classification problem, often called Sentiment Analysis.

The challenge is to see if we can correctly classify tweets as being negative, positive, or neutral about Apple.

The code is available as a Python notebook on GitHub.

Get the data

Twitter data is publicly available and you can collect it by using their API; in this case, only the content of the tweets, not sender nor date, is used. Language of the tweets is English.

The set is available in the Git repo, together with the Python code.
Here are other twitter data if you want to experiment more:

As usual we start by reading the data into a panda data frame:

import pandas as pd  # Start by importing the tweets data
X = pd.read_csv('tweets.csv')
(1181, 2)
Index(['Tweet', 'Avg'], dtype='object')
RangeIndex: 1181 entries, 0 to 1180
Data columns (total 2 columns):
Tweet 1181 non-null object
Avg 1181 non-null float64
dtypes: float64(1), object(1)
memory usage: 18.5+ KB
Tweet Avg
0 I have to say, Apple has by far the best custo… 2.0
1 iOS 7 is so fricking smooth & beautiful!! #Tha… 2.0
3 Thank you @apple, loving my new iPhone 5S!!!!!… 1.8
4 .@apple has the best customer service. In and … 1.8

Data contains 1181 tweets (as text) and one sentiment label.
The sentiment label has been applied manually, as strongly negative, negative, neutral, positive and strongly positive (a discrete number on the scale from negative 2 to 2); it is the average of five different meanings (this is why at the end is a real number).


2 means very positive, 0 is neutral and -2 is very negative


The graph shows the distribution of the number of tweets classified into each of the categories.
We can see here that the majority of tweets were classified as neutral (score = zero), with a small number classified as strongly or strongly positive.

Now we have a set of tweets that are labeled with their sentiment.
But how do we build independent features just from the text to be used to predict the sentiment?

Number of tweets by their sentiment average score

Process the data

Fully understanding text is difficult, but the Bag of Words (BoW) algorithm provides a very simple approach: it just counts the number of times each word appears in the text and uses these counts as the independent features.
For example, in the sentence, “This phone model is great. I would recommend it to my friends anytime” the word recommend is seen once, as well as the word great, etcetera.
In Bag of Words, there’s one feature for each “meaningful” word.
This is a very simple approach, but is often very effective and is commonly used as the  baseline in text analytics projects.

I wrote that the words used for the features shall be meaningful. Preprocessing the text can dramatically improve the performance of the Bag of Words method.
As we have seen before, preprocessing routine tasks are to clean up irregularities: removing mixture of uppercase and lowercase letters, punctuation and unhelpful terms (you can refer to the links above for an explanation of all those Natural Languages Processing terms).

First of all, we clean the tweets by lowering all the letters, removing punctuations and stop words and finally by tokenising and stemming them.

corpusTweets = X.Tweet.tolist() # get a list of all tweets, then is easier to apply preprocessign to each item

# Convert to lower-case
corpusLowered = [s.lower() for s in corpusTweets]

corpusLowered[0:5]  # check
['i have to say, apple has by far the best customer care service i have ever received! @apple @appstore',
 'ios 7 is so fricking smooth & beautiful!! #thanxapple @apple',
 'love u @apple',
 'thank you @apple, loving my new iphone 5s!!!!!  #apple #iphone5s',
 '.@apple has the best customer service. in and out with a new phone in under 10min!']

As you see from the first five tweets (we will use them as a check) capital letters are now gone. We can check that this has worked by comparing the first five tweets with the ones in the table above.

Note the variable name corpusTweets.
One of the NLP concepts is that of a corpus. A corpus is a collection of documents.
For our example, the corpus is our tweets.

Now we remove the punctuation, using a Regular Expression (RE):

# Remove punctuation

import re
corpusNoPunct = [re.sub(r'([^\s\w_]|_)+', ' ', s.strip()) for s in corpusLowered]
['i have to say apple has by far the best customer care service i have ever received apple appstore',
'ios 7 is so fricking smooth beautiful thanxapple apple',
'love u apple',
'thank you apple loving my new iphone 5s apple iphone5s pic twitter com xmhjcu4pcb',
' apple has the best customer service in and out with a new phone in under 10min ']

And then the stop words. which are common words as articles or conjunctions that do not add too much value.
First we define which are the common words to be removed:

import os
def readStopwords():
    returns stopwords as strings

    !! Assume that a file called "stopwords.txt"
    exists in the folder
    filename = "stopwords.txt"
    path = os.path.join("", filename)
    file = open(path, 'r')
    return  # splitlines is used to remove newlines

stopWords = set(readStopwords())
"the" in stopWords  # quick test

Let’s remove other common words in this context (e.g. Apple or iPhone) that do not tell us anything about the sentiment:


print ("apple" in stopWords)
print ("google" in stopWords)

To remove a word from the corpus if that word is contained in our stop words set, we need first to tokenise the corpus (i.e., split it into words or tokens):

# tokenise
corpusTokens = [s.split() for s in corpusNoPunct]
corpusTokens[0:3] # just the first three tweets as example
['love', 'u', 'apple']]

Lastly, an important preprocessing step is called stemming.
This step is motivated by the desire to represent words with different endings
as the same word.
We probably do not need to draw a distinction between argue, argued, argues, and arguing.
They could all be represented by a common stem: argu.
The algorithmic process of performing this reduction is called stemming.

We use the Porter’s version from the NLTK library and we apply it, while removing the stop words at the same time:

# clean stop words and stem the corpus
from nltk import PorterStemmer
porter = PorterStemmer()

corpus = []
for tweet in corpusTokens:
    cleanTokens = [token for token in tweet if token not in stopWords] # a list of tokens
    stemmedTokens = [porter.stem(token) for token in cleanTokens]
    cleanTweet = ' '.join(stemmedTokens)


['say far best custom care servic ever receiv appstor',
'7 frick smooth beauti thanxappl',
'love u',
'thank love new 5s iphone5 pic twitter com xmhjcu4pcb',
'best custom servic new phone 10min']

Now we can see that we have significantly fewer words, and we’re ready to extract the word frequencies to be used in our prediction problem.

Create a Document-Term matrix

In text mining, an important step is to create the document-term matrix (DTM) of the corpus we are interested in. A DTM is basically a matrix with the documents (in our case the tweets) designated by rows and the single words by columns, and where the matrix elements are the counts of those words in each tweet. I.e., if the word great appears 3 times in the tweet n, then the matrix will contain 3 as element in the row n and the column identifying the word great.

The scikit-learn package provides a function that generates the DT matrix where the rows correspond to documents and the columns correspond to words.

Let’s go ahead and generate this matrix.
We use the function CountVectorizer.  It has several parameters, we will use the default values beside for the lowercase (we don’t need it, we have already lowered our text) and we will limit the features extracted to the first 500 words ordered by number of counts.
Limiting the features is a normal step to avoid a too complex model and usually only the most frequent words will help with the prediction.
The number of terms is an issue for two main reasons.
One is computational: more terms means more independent variables, which means it takes longer to build the model.
The other is that – in building models – the ratio of independent variables to observations
will affect how good the model will generalise.
Alternatively to limit the features one could also limit the minimum or maximum number of counts (see the sklearn documentation).

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(lowercase=False, max_features=500)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=, encoding='utf-8', input='content',
lowercase=False, max_df=1.0, max_features=500, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
'apple' in cv.vocabulary_  # a quick test
['09', '10', '18xc8dk', '1za', '20', '2013', '20th', '244tsuyoponzu',
 '3g', '4sq', '5c', '5s', '64', '7evenstarz', 'act', 'actual', 'ad', 
 'adambain', 'add', 'again']

Above are the first twenty features (words) in alphabetical order counted in the corpus.

Now we use the vectoriser to transform the corpus into a sparse matrix (called bagOfWords) where each tweet has 1 if the feature is present in it or 0 if not.

bagOfWords = cv.transform(corpus)

<1181x500 sparse matrix of type '' with 5893 stored elements in Compressed Sparse Row format>

The matrix has 1181 rows (the number of tweets) and 500 columns (because we limited the features) and there are only 5893 elements in it. This data is what we call sparse, it means that there are many zeros in our matrix.

We can look at what the most popular terms are:

sum_words = bagOfWords.toarray().sum(axis=0)

words_freq = [(word, sum_words[idx]) for word, idx in cv.vocabulary_.items()]

words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

words_freq[:10]  # top 10
[('new', 113),
('ly', 112),
('com', 109),
('twitter', 108),
('ipodplayerpromo', 102),
('5s', 97),
('phone', 91),
('pic', 84),
('get', 75),
('5c', 63)]

Now we need one last step before training the model with a classifier: convert the sparse matrix into a data frame that we’ll be able to use for our predictive models.

df = pd.DataFrame(bagOfWords.toarray())

(1181, 500)
RangeIndex: 1181 entries, 0 to 1180
Columns: 500 entries, 0 to 499
dtypes: int64(500)
memory usage: 4.5 MB


0 1 2 3 4 5 496 497 498 499
0 0 0 0 0 0 0 0 0 0
1 rows × 500 columns

Pretty boring data frame but we just need it to train the classifier so we don’t bother to rename columns, etc.

We start by splitting the tweets into training and test sets, as usual.
Note that the target variable (that was a real number, the average of tweet ratings)  is transformed into the nearest discrete number, therefore it’s now reduced to exactly five classes: -2, -1, 0, 1, 2.

from sklearn.model_selection import train_test_split

X.Avg = [int(round(a)) for a in X.Avg] # cluster target into 5 classes

import random
random.seed(100)  # just for reproducibility
X_train, X_test, y_train, y_test = train_test_split(df, X.Avg, test_size=0.25)

(296, 500)

We have 296 tweets in the test dataset.
Our data is now ready, and we can build our predictive model.

In the next chapter I explain a bit how it works the Naive Bayes classifier but feel free to skip it if you just want to use it.

Naive Bayes classifier

Naive Bayes classifiers are built on rules derived from the Bayes theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities.

Let’s start by taking a look at the Bayes equation, when we are interested in finding the probability of a label L given a word W,  in a set of documents, which we can write as:

P(L | W) = \frac{P( W | L) \cdot P(L) }{P(W)}

  • The term P(L) is our original belief i.e., the original label of the document being positive or negative (in terms of sentiments). It is known as Prior.
  • The term P(L|W) is the probability of label L given a word W in the documents. This term is also known as Posterior (latin for After).
  • P(W|L) is similar to the Posterior and is the probability of a document containing a word W given a label L. Known as Likelihood.
  • P(W) is the probability that a given document has the word W.

You can think of the term Posterior as your updated rule or updated belief obtained by multiplying Prior and Likelihood.

But how do we find out P(W|L) and P(L)? This is exactly where bag of words will come in handy.

P(L) is basically concerned with this question : “How often does this label occur?”

For example, you want to know if a tweet is positive.
You have 100 documents or tweets in the training set and you have only two words in the corpus: “innovative” and “awesome”.
Out of these 100 documents, 60 documents are labeled as positive and the remaining 40 are labeled as negative.
So P(+) = 0.6 and P(−) = 0.4

Further, out of 60 positively labeled documents, 48 of them contain both words  “innovative” and “awesome”. While the remaining twelve tweets do not have both words.
So, the probability P(W|+) = P(innovative AND awesome | +) = 48 / 60

Finally, only 4 tweets out of the 40 negative-labelled tweets have both the words awesome and innovative and the remaining thirty-six negative tweets do NOT have both words.
Therefore, the probability P(W|-) = 4 / 40

Remains only to found out P(W), the probability that a tweet has both words.
Well, there are 52 tweets that have both words: 48 that are positive-labelled plus 4 that aren’t.

Our Bayes equation can be therefore written as:

P(+|W) = \frac{P(W|+) \cdot P(+) }{P(W)}

= \frac{48/60 \cdot 60/100 }{ 52/100 }  = 48/52 = 0.92

that tells us a (new) tweet with the words innovative and awesome has a 92% probability of having a positive sentiment.

This example had considered only two words. Remember you have to compute the likelihood probabilities for all the words. So, the total number of combinations a scenario where you have 1000 total words and each document contains 100 words on an average will be (1000)ˆ100 which is an insanely large number!

Here is where the Naive Bayes classifier comes to rescue. Its Conditional Independence assumption states that the value of a particular feature is independent of the value of any other feature, given the class variable, i.e. the feature probabilities P(xi|cj) – not P(x1), P(x2) but P(x1|cj), P(x2|cj) – are independent of each other.

For example, a fruit may be considered to be an apple if it is red, round and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness and diameter features.

This assumption concretely means that we can exchange the following expressions:

P(x_{1}, ..., x_{n} | c) = P(x_{1} | c)\cdot P(x_{2} | c) \cdot ... \cdot P(x_{n} | c)

So, naturally, Xn combinations get reduced to Xn which is exponentially less.

Naive Bayes has two advantages:

  • Reduced number of parameters.
  • Linear time complexity as opposed to exponential time complexity.

Moreover the classifier needs to specify the hypothetical random process that generates the data. For example, in the Gaussian naive Bayes classifier, the assumption is that data from each label is drawn from a simple Gaussian distribution.

Another useful example is the multinomial naive Bayes, the one we will use, where the features are assumed to be generated from a simple multinomial distribution. The multinomial distribution describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent counts or count rates and is often used is in text classification, where the features are related to word counts or frequencies within the documents to be classified.

Train and test the classifier

Now we train the sklearn MultinomialNB classifier (a Naive Bayes  with multiple classes):

from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB(), y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
predictions = classifier.predict(X_test)
array([ 0,  0,  1,  0, -1,  0,  0, -1,  0,  0,  0,  1,  0,  0, -1, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0, -1,  0, -1, -1, -1,  0,
        0, -1, -1,  0,  0,  0,  0, -1,  1,  1,  0,  0,  0, -1, -1,  1,  0,
        0,  1, -1,  0,  0, -1, -1,  0,  0, -1, -1,  0,  0,  0, -1, -1,  0,
       -2,  1,  0,  0, -1,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0, -1,  1,
        0,  1,  0,  1,  0,  0,  0, -1,  1,  0,  0, -1, -1,  0, -1])

Metrics: accuracy and confusion matrix

We can check the quality of the predictions by using the scikit-learn metric, specifically the accuracy (we will see in a moment what i means):

from sklearn import metrics

# Model Accuracy, how often is the classifier correct?
print("Accuracy: {:.2}".format(metrics.accuracy_score(y_test, predictions)))
Accuracy: 0.64

The classifier was correct 64% of times (not only if a tweet was negative but also if it was strongly negative or moderately negative).
A very useful metric is the confusion matrix that displays the predictions and the actual values in a table:

mat = metrics.confusion_matrix(y_test, predictions)
array([[  4,   5,   3,   0,   0],
       [  2,  42,  28,   5,   1],
       [  1,  26, 128,   9,   2],
       [  0,   8,  16,  14,   0],
       [  0,   1,   1,   0,   0]])

It’s more clear if we visualise it as a heat map:

import matplotlib.pyplot as plt

labels = ['strongly neg.', 'negative', 'neutral', 'positive', 'strongly pos.']
fig = plt.figure()
ax = fig.add_subplot(111)
cm = ax.matshow(mat)
# plot the title, use y to leave some space before the labels
plt.title("Confusion matrix - Tweets arranged by sentiment", y=1.2)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.setp(ax.get_xticklabels(), rotation=-30, ha="right",

# Loop over data dimensions and create text annotations.
for i in range(len(mat)):
    for j in range(len(mat)):
        text = ax.text(j, i, mat[i, j],
                       ha="center", va="center", color="w")
    # Create colorbar
On the diagonal the correct prediction

The numbers in the diagonal are all the times when the predicted sentiment for a tweet was the same as the actual sentiment.
Now we can define accuracy as the sum of all the values in the diagonal (these are the observations we predicted correctly) divided by the total number of observations in the table.
The best accuracy would be 1.0 when all values are on the diagonal (no errors!), whereas the worst is 0.0 (nothing correct!):

correctPredictions = sum(mat[i][i] for i in range(len(mat)))
print("Accuracy: {:.2}".format(correctPredictions / len(y_test)))
Accuracy: 0.64

Which is the same value as above.

A simple baseline

Now, how good is this accuracy?
Let’s compare this to a simple baseline model that always predicts neutral (the most common tweets in the test dataset).

neutralTweets = sum(1 for sentiment in y_test if sentiment == 0)  # neutral tweets in Test dataset
len(y_test) - neutralTweets

This tells us that in our test dataset we have 166 observation with neutral sentiment and 130 with positive or negative tweets.
So the accuracy of a baseline model that always predict non-negative tweets would be:

print("Accuracy baseline: {:.2}".format(neutralTweets / len(y_test)))
Accuracy baseline: 0.56

So our Naive Bayesian model does better than the simple baseline.

By using a bag-of-words approach and a Naive Bayes model, we can reasonably predict sentiment with a relatively small data set of tweets.

Predict the sentiment of a new tweet

The classifier can be applied to new tweets, of course, to predict their sentiment:

# for simplicity, it re-uses the vectoriser and the classifier without passing them
# as arguments. Industrialising it would mean to create a pipeline with
# vectoriser > classifier > label string
def predictSentiment(t):
    bow = cv.transform([t])
    prediction = classifier.predict(bow)
    if prediction == 0:
        return "Neutral"
    elif prediction > 0:
        return "Positive"
        return "Negative"
predictSentiment("I don't know what to think about apple!")

Ok. We try with two new tweets and see what we get, one positive and one negative:

predictSentiment("I love apple, its products are always the best, really!")
predictSentiment("Apple lost its mojo, I will never buy again an iphone better an Android")

So far so good. One thing we can note from the confusion matrix – specifically the diagonal – is that the strongly positive or strongly negative sentiments are not easy to predict and in general you have better results with fewer classes.

Binary Classification

Now, this is a more generic case, where we have 5 classes as target.
The case you see more often is the binary one – with only two classes – which has some special characteristics and metrics.
Let’s convert our target into a binary one: a tweet can be either negative or not negative (i.e., positive or neutral).

First of all, we need to transform our original dataset to reduce the sentiment classes to only two classes:

X.loc[X.Avg = 0] = 1 # NON-negative sentiment

We need then to re-apply the classifier:

X_train, X_test, y_train, y_test = train_test_split(df, X.Avg, test_size=0.25), y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
predictionsTwo = classifier.predict(X_test)

array([ 1,  1,  1, -1, -1,  1,  1,  1, -1,  1, -1,  1,  1,  1,  1, -1,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1, -1,  1,  1, -1,  1,
       -1,  1,  1,  1, -1,  1,  1, -1,  1,  1, -1,  1, -1,  1,  1,  1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1, -1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1, -1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1])

As you can see, there is no more classes 2, 0 or -2 now.

# Model Accuracy, how often is the classifier correct?
print("Accuracy: {:.2}".format(metrics.accuracy_score(y_test, predictionsTwo)))
Accuracy: 0.79

Of course it is now better, we have less classes to predict, less errors to make.
Let’s see how the confusion matrix for a binary classification looks like:

matBinary = metrics.confusion_matrix(y_test, predictionsTwo)
array([[ 45,  40],
       [ 22, 189]])
labels = ['negative', 'NOT negative']
fig = plt.figure()
ax = fig.add_subplot(111)
cm = ax.matshow(matBinary)
# plot the title, use y to leave some space before the labels
plt.title("Confusion matrix - Tweets arranged by sentiment", y=1.2)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.setp(ax.get_xticklabels(), rotation=-30, ha="right",

# Loop over data dimensions and create text annotations.
for i in range(len(matBinary)):
    for j in range(len(matBinary)):
        text = ax.text(j, i, matBinary[i, j],
                       ha="center", va="center", color="w")
    # Create colorbar
Now with two classes

In a two-class problem, we are often looking to discriminate between observations with a specific outcome, from normal observations. Such as a disease state or no disease state or spam versus no-spam.
One being the positive event and the other the no-event, the negative event.

In our case, let’s say the negative event is the negative tweet and the positive event is the NON-negative tweet.

These are basic terms used in binary classification:

  • “true positive” for correctly predicted event values (in our scenario the non-negative tweets: positive or neutral).
  • “true negative” for correctly predicted no-event values (in our scenario the negative tweets).
  • “false positive” for incorrectly predicted event values. In Hypothesis Testing it is also known as Type 1 error or the incorrect rejection of Null Hypothesis.
  • “false negative” for incorrectly predicted no-event values. It is also known as Type 2 error, which leads to the failure in rejection of Null Hypothesis.
tn, fp, fn, tp = matBinary.ravel()

print("True Negatives: ",tn)
print("False Positives: ",fp)
print("False Negatives: ",fn)
print("True Positives: ",tp)
True Negatives: 45
False Positives: 40
False Negatives: 22
True Positives: 189

Accuracy can be re-formulated as the ratio between the true events (positive and negative) and the total events:

Accuracy = (tn+tp)/(tp+tn+fp+fn)
print("Accuracy: {:.2f}".format(Accuracy))
Accuracy: 0.79

Accuracy is not a reliable metric for the real performance of a classifier, because it will yield misleading results if the data set is unbalanced (that is, when the numbers of observations in different classes vary greatly).

Then you may consider additional metrics like Precision, Recall, F score (combined metric):

Sensitivity or Recall

It is the ‘Completeness’, ability of the model to identify all relevant instances, True Positive Rate, aka Sensitivity.
Imagine a scenario where your focus is to have the least False Negatives, for example if you are trying to predict if an email is a spam or not, you don’t want authentic messages to be wrongly classified as spam. Then Sensitivity can come to your rescue:

Sensitivity = tp/(tp+fn)
print("Sensitivity {:0.2f}".format(Sensitivity))
Sensitivity 0.9

Sensitivity is a real number between 0 and 1. A sensitivity of 1 means that ALL the Negative cases have been correctly classified.


Specificity = tn/(tn+fp)
print("Specificity {:0.2f}".format(Specificity))
Specificity 0.53

ROC (Receiver Operating Characteristic) curve

Until now, we have seen classification problems where we predict the target class directly.

Sometimes it can be more insightful or flexible to predict the probabilities for each class instead. From one side you will get an idea of how confident is the classifier for each class, on the other side you can use them to calibrate the threshold for how to interpret the predicted probabilities.

For example, in a binary classifier the default is to use a threshold of 0.5, meaning that a probability less than 0.5 is a negative outcome and a probability equal or over 0.5 is a positive outcome.
But this threshold can be adjusted to tune the behaviour of the model for the specific problem, e.g. to reduce more of one or another type of error, as we have seen above.
Think about a classifier that predict if an event is a nuclear attack or not. Clearly you want to have as less as possible false alarms!

A diagnostic tools help in in choosing the right threshold is the ROC curve.

This is the plot of the ‘True Positive Rate’ (Sensitivity) on the y-axis against the ‘False Positive Rate’ (1 minus Specificity) on the x-axis, at different classification thresholds between 0 and 1.

It captures all the thresholds simultaneously and the area under the ROC curve measures how well a parameter can distinguish between two groups.
Threshold =0 is at the axis origin (0,0) while the threshold = 1 is at the top right end of the curve.

  • high threshold: means high specificity and low sensitivity
  • Low threshold: means low specificity and high sensitivity

Put another way, it plots the false alarm rate versus the hit rate.

Let’s see an example using our binary classification above.
First, we need probabilities to create the ROC curve.

probs = classifier.predict_proba(X_test) # get the probabilities

preds = probs[:,1]  ## keep probabilities for the positive outcome only
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)  # calculate roc
roc_auc = metrics.auc(fpr, tpr)  # calculate AUC

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.plot([0, 1], [0, 1],'r--')  # plot random guessing

plt.legend(loc = 'lower right')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

ROC curve in blue , in red the baseline

The ROC curve is a useful tool for a few reasons:

  • The curves of different models can be compared directly in general or for different thresholds.
  • The area under the curve (AUC) can be used as a summary of the model skill.

A random guessing classifier (the red line above) has an Area Under the Curve (often referred as AUC) of 0.5, while AUC for a perfect classifier is equal to 1.
In general AUC of above 0.8 is considered “good”.

Looking at the ROC curve you can choose a threshold that gives a desirable balance between the:

  • cost of failing to detect positive
  • cost of raising false alarms

Precision and F1 score metrics

Called also Positive Predictive Power, the Precision measures somehow how “exact” it is, i.e. the ability of the model to return only relevant instances. If your use case/problem statement involves minimising the False Positives then Precision is something you need:

# Precision
Precision = tp/(tp+fp)
print("Precision or Positive Predictive Power: {:0.2f}".format(Precision))
Precision or Positive Predictive Power: 0.83

Similarly, you can calculate the Negative Predictive Power

# Negative Predictive Value
print("Negative predictive Power: {:0.2f}".format(tn / (tn+fn)))
Negative predictive Power: 0.67

The F1 score is the harmonic mean of the Precision & Sensitivity, and is used to indicate a balance between them. It ranges from 0 to 1; F1 Score reaches its best value at 1 (perfect precision & sensitivity) and worst at 0.

# F1 Score
f1 = (2 * Precision * Sensitivity) / (Precision + Sensitivity)
print("F1 Score {:0.2f}".format(f1))
F1 Score 0.86

What do we use for the ROC?

classifierTuned = MultinomialNB(class_prior=[.4,.6]) # try to max specificity, y_train)
predictionsTuned = classifierTuned.predict(X_test)
matTuned = metrics.confusion_matrix(y_test, predictionsTuned)
array([[ 53, 32],
[ 36, 175]])
tn, fp, fn, tp = matTuned.ravel()
Accuracy = (tn+tp)/(tp+tn+fp+fn)
print("Accuracy: {:.2f}".format(Accuracy)) # it was 0.79

Sensitivity = tp/(tp+fn)
print("Sensitivity {:0.2f}".format(Sensitivity)) #it was 0.9

Specificity = tn/(tn+fp)
print("Specificity {:0.2f}".format(Specificity)) # it was 0.53
Accuracy: 0.77
Sensitivity: 0.83
Specificity: 0.62

We have greatly improved the specificity at the cost of a smaller decrease of the sensitivity and the accuracy.

And the metrics for multiple classes?

In a 2×2, once you have picked one category as positive, the other is automatically negative. With 5 categories, you basically have 5 different sensitivities, depending on which of the five categories you pick as “positive”. You could still calculate their metrics by collapsing to a 2×2, i.e. Class1 versus not-Class1, then Class2 versus not-Class2, and so on, as we did above.

You can actually have sensitivity and specificity regardless of the number of classes. The only difference here is that you will get one specificity and sensitivity and accuracy and F1-score for each of the classes. If you want to report, you can report the average of these values.

We have to do these calculations for each class separately, then we average these measures, to get the average of precision and the average of recall. I leave this as exercise for you.

Note: this post is part of a series about Machine Learning with Python.

MonteCarlo and Pi

This is a post to introduce a couple of probability concepts that are useful for machine learning. To make it more interesting, I am mixing in it some MonteCarlo simulation ideas too!

To see examples in Python we need first to introduce the concept of random numbers.

Stochastic vs deterministic numbers

The English word stochastic is an adjective describing something that was randomly determined. It originally came from Greek στόχος (stokhos), meaning ‘aim, guess’.

On the other side, deterministic means that the outcome – given the same input – will always be the same. There is no unpredictability.

Random number

Randomly generated is a big area by itself, for our scope is enough to say that randomness is the lack of pattern or predictability in events. A random sequence of events therefore has no order and does not follow an intelligible combination.

Individual random events are by definition unpredictable, but in many cases the frequency of different outcomes over a large number of events is predictable.
And this is what is interesting for us: if I throw a die with six faces thousands of times, how many times in percent shall I expect to see the face number six?

Let’s see some practical examples with Python. As usual they are also available in a notebook. Continue reading “MonteCarlo and Pi”

Multi-class logistic regression

We have seen several examples of binary logistic regression where the outcomes that we wanted to predict had two classes, such as a model predicting if a student will be admitted to the University (Yes or No) based on the previous exam results  or if a random Titanic passenger will survive or not.

Binary classification such as these are very common but you can also encounter classification problems where the outcome is a multi-class of more than two: for example if tomorrow weather will be sunny, cloudy or rainy; or if an incoming email shall be tagged as work, family, friends or hobby.

We see now a couple of approaches to handle such classification problems with a practical example: to classify a sky object based on a set of observed variables.
Data is from the Sloan Digital Sky Survey (Release 14).
For the example that we are using the sky object to be classified can be one of three classes: Star, Galaxy or Quasar.

The code is available also on a notebook in GitHub. Data is also available on GitHub. Continue reading “Multi-class logistic regression”

Introduction to time series – Part II: an example

Exploring a milk production Time Series

Time series models are used in a wide range of applications, particularly for forecasting, which is the goal of this example, performed in four steps:

– Explore the characteristics of the time series data.
– Decompose the time series into trend, seasonal components, and remainder components.
– Apply time series models.
– Forecast the production for a 12 month period.

Part I of this mini-series explored the basic terms of time series analysis.

Load and clean the data

The dataset is the production amount of several diary products in California, month by month, for 18 years.
Our goal: forecast the next year production for one of those products: milk.

You can follow along with the associated notebook in GitHubContinue reading “Introduction to time series – Part II: an example”

Introduction to time series – Part I: the basics

A Time series is a data set collected through time.

What makes it different from other datasets that we used for regular regression problems are two things:

  1. It is time dependent. So the basic assumption of a linear regression model that the observations are independent doesn’t hold in this case.
  2. Most time series have some form of trend – either an increasing or decreasing trend – or some kind of seasonality pattern, i.e. variations specific to a particular time frame.

Basically, this means that the present is correlated with the past.
A value at time T is correlated with the value at T minus 1 but it may also correlated with the value at time T minus 2, maybe not quite as much as T minus 1.
And even at 20 times steps behind, we could still know something about the value of T because they’re still correlated, depending on which kind of time series it is.
And this obviously is not true with normal random data.

Time series are everywhere, for example in:

  • Financial data (stocks, currency exchange rates, interest rates)
  • Marketing (click-through rates for web advertising)
  • Economics (sales and demand forecasts)
  • Natural phenomenon (water flow, temperature, precipitation, wind speed, animal species abundance, heart rate)
  • Demographic and population and so on.

What might you want to do with time series?

  • Smoothing – extract an underlying signal (a trend) from a noise.
  • Modelling – explain how the time series arose, for intervention.
  • Forecasting – predict the values of the time series in the future.

We first see here which specific characteristics the Time Series (TS) have, and will then see in a second part a concrete example of TS analysis (smoothing + modelling + forecasting).

You can follow along with the associated notebook in GitHub. Continue reading “Introduction to time series – Part I: the basics”