We have seen what is linear regression, how to make models and algorithms for estimating the parameters of such models, how to measure the loss.

Now we see how to assess how well the considered method should perform in predicting new data, how to select amongst possible models to choose the best performing.

We will first explore the concept of training and test error, how they vary with model complexity and how they might be utilised to form a valid assessment of predictive performance. This leads directly to an important bias-variance tradeoff, which is fundamental to machine learning.

Variables can be described as either quantitative or qualitative.
Quantitative variables have a numerical value, e.g. a person’s income, or the price of a house. Qualitative variables have a values taken from one of different classes or categories. E.g., a person’s gender (male or female), the type of house purchased (villa, flat, penthouse, …) the colour of the eye (brown, blue, green) or a cancer diagnosis.

Linear regression predicts a continuous variable but sometime we want to predict a categorical variable, i.e. a variable with a small number of possible discrete outcomes, usually unordered (there is no order among the outcomes).

This kind of problems are called Classification.

Classification

Given a feature vector X and a qualitative response y taking values from one fixed set, the classification task is to build a function f(X) that takes as input the feature vector X and predicts its value for y.
Often we are interested also (or even more) in estimating the probabilities that X belongs to each category in C.
For example, it is more valuable to have the probability that an insurance claim is fraudulent, than if a classification is fraudulent or not.

There are many possible classification techniques, or classifiers, available to predict a qualitative response.

I recently watched parts of both Google and Microsoft developer conferences (respectively Build 2017 and I/O 2017).
As expected, there was big emphasis on Artificial Intelligence but, in all, I liked more the Microsoft’s one while the Google’s felt too heterogeneous and without real meat (the new capabilities from Google Lens have been available e.g. at Baidu since years).

Got this morning the first draft 12 chapters of Prof. Andrew Ng‘s new book titled “Machine Learning Yearning – Technical strategy for AI engineers, in the era of deep learning”.

The book aims to help readers to quickly become better at building AI systems by gaining the practical skills needed when organising a machine learning project.
The book assumes readers are already familiar with machine learning concepts and does not go into technical explanations of how they work.

These first chapters look great, I think this book will help to close the gap between machine learning knowledge and proper execution.

My favourite chapter is the Ninth: Optimizing and satisficing metrics which suggests how to handle the problem of establishing a single-number evaluation metric when the interesting metrics are not compatible:

Suppose you care about both the accuracy and the running time of a learning algorithm.

It seems unnatural to derive a single metric by putting accuracy and running time into a single formula, such as:

Accuracy – 0.5*RunningTime

Here’s what you can do instead:
First, define what is an “acceptable” running time. Let’s say anything that runs in 100ms is acceptable.
Then, maximize accuracy, subject to your classifier meeting the running time criteria.
Here, running time is a “satisficing metric”— your classifier just has to be “good enough” on this metric, in the sense that it should take at most 100ms. Accuracy is the “optimizing metric.”

P.S. I know, satisficing is not a common word and is marked wrong by spelling checkers but it really exists ! I had to look it myself but here is the definition:
To satisfice means to choose or adopt the first option fulfilling all requirements that one comes across (in contrast to look for the optimal one).

In the previous post we have seen how an artificial neuron works to classify simple functions and we have also seen its limitations when the space function to be classified is not linear, e.g. it fails to compute correctly the XOR function.

We have also seen that the way to overcome this limitation is to combine more neurons together in a network but one thing we did not yet seen was how to compute the “weights” of the network (they were hard-coded in the previous example).

As a reminder, here is the perceptron able to compute an OR function:

The input values are multiplied with specific weights and combined together into an activation function which outputs the OR function.
The “artificial neural network” (ANN) is really just this matrix of weights. Here the matrix has only one column (the OR function is very simple) but we may have “layers” and layers of neurons and their weights.
The values calculated are just transient values based on the input dataset. We don’t save them, they are forgotten.
All of the learning is stored in the weight matrix.

Like similar supervised models, the ANNs can be trained to learn the weights using some labeled training data (the “supervised” data) and once the weights matrix is learnt, we can use it to compute the function that we trained for.

Sigmoid neurons

Before seeing how the weights can be learned, we will slightly tune the activation function (the red circle in the figure above) in order to make the learning phase easier.
In the perceptron example, this activation function was a simple threshold one, outputting 0 if the input was below a threshold and 1 otherwise. Instead, we will use a sigmoid function which has the advantage that small changes in the weights and bias cause only a small change in their output. The sigmoid function – often called the logistic function – is defined by:

The sigmoid activation function is always positive, bounded between 0 and 1 and strictly increasing.

And this is its Python implementation:

import numpy as np
def sigmoid(z):
"""
Compute the sigmoid function
Argument:
z: float number
Returns:
float
"""
return 1 / (1 + np.exp(-z))

The sigmoid function can be thought as a “squashing function”, because it takes the input and squashes it to be between zero and one: very negative values are squashed towards zero and positive values get squashed towards one.
For example we have:

sigmoid(-5) = 0.007; sigmoid(5) = 0.993

The output of a sigmoid neuron

The output of a sigmoid neuron is obviously not 0 or 1 but is a real number between 0 and 1. This can be useful, for example, if we want to use the output value to represent the intensity of the pixels in an image input to a neural network. But sometimes it is not, in fact to model an OR gate we need to have as output 0 or 1, as it was in the perceptron.
In practice we can use a convention to deal with this, for example, by deciding to interpret any output of at least 0.5 as indicating a “1”, and any output less than 0.5 as indicating a “0”.

We can quickly verify that the original perceptron simulating the OR function is still working with this new activation function:

def sigmoidActivation(z):
"""
Compute the activation function for boolean operators using sigmoid
Since a sigmoid function is always between 0 and 1,
we just cut off at 0.5
Arguments:
z: input data points including bias, array
Returns:
integer (0 or 1)
"""
if sigmoid(z) > 0.5:
return 1
else:
return 0
def predictOneLayer(X, weights):
"""
Compute the binary output of an Artificial Neural Network with single layer,
given its neuron weights and an input dataset
Arguments:
X: input data points without bias, list
weights: the single-layer weights, array
Returns:
integer (0 or 1)
"""
X = [1] + X # add bias to the input list
# forward propagation
y = np.dot(X, weights).sum()
return sigmoidActivation(y)
def forwardOR(operands):
"""
Simulate an OR function using a simple forward neural network
(single layer)
Arguments:
operands: input data points without bias, list
Returns:
integer (0 or 1)
"""
m = len(operands) # number of training examples
ORweights = [-0.5] + [1]*m # hard-coded weights
return predictOneLayer(operands, ORweights)
print("*** Simulation of OR using one forward neuron ***")
print("X1 | X2 | Y = X1 OR X2")
print("-----------------------")
for x1 in (0,1):
for x2 in (0,1):
print(x1," | ",x2, " | ", forwardOR([x1,x2]))

And here is its output:

*** Simulation of OR using one forward neuron ***
X1 | X2 | Y = X1 OR X2
-----------------------
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 1

The OR function still works correctly using sigmoid neurons.

As you notice, the weights in this OR example above are hard-coded because we can get them manually very quick since the example is pretty straightforward. But this is obviously not scalable as the complexity of a function is growing.
Therefore we will see now how we can find the appropriate weights to model a simple function like the OR.

Training the network

What we need is an algorithm that will find the appropriate weights (remember: they ARE the artificial neural network) programmatically for us so that the output from the network approximates the expected y(x) for all training inputs x.

Cost function

To quantify how well we’re achieving this goal we will define a cost function, sometimes referred to as a loss or objective function.
We have already seen examples of cost functions in the regression models and it can be a complex function, depending on the domain – a common choice is the categorical cross-entropy loss – but for this simple example we just stick to the difference between the network output and the training output:

error = actual output(label) – predicted output

Learning the weights

The output of the ANN is:

output = h(W * X + w0)

where X is the input, w0 is the bias, W are the weights and h is the non-linear activation function.

Learning the parameters for our network means finding weights that minimise the error on our training data. By finding parameters that minimise the error we maximise the likelihood of our training data.
Designing and training a neural network is not much diﬀerent from training any other machine learning model with gradient descent.

Sigmoid derivative

The gradient descent works by using the derivative function. The derivatives of the sigmoid with respect to the inputs and the weights are very simple:

therefore its Python implementation is clear-cut:

def sigmoidDeriv(s):
"""
Compute the derivative of a sigmoid function
Argument:
s: float number
Returns:
float
"""
return s * (1-s)

Now we have everything to explore the learning algorithm for artificial neural network.

The learning algorithm

The general idea is to forward-feed the input dataset to an initial set of weights to compute the output and the error, then back-propagate the error using the gradient (this is the “core” part!) and update the weights until the error is acceptable:

0. start with X and y (input and output training data)
1. initialise weights Wij randomly
2. loop until error < tolerance or max iterations:
3. for each training example Xi:
4. forward: out = theta(Wij * Xi) # theta is the activation function
5. compute the error
6. backward: compute gradient
7. update all weights Wij using the gradient
8. return the final weights

We will see it in details line by line line but here is the entire working Python implementation in the case of one single layer of neurons:

def learnWeightsOneLayer(X, y, tolerance = 0.1, max_iterations = 1000):
"""
Learn the weights for an artificial neural network with single layer,
given a training set of input and output values.
Uses a backpropagation algorithm.
Arguments:
X: input data points without bias, matrix
y: output data points, array
tolerance: minimum error to stop the training, float. Optional (default=0.1)
max_iterations: maximum number of iterations performed by the algorithm.
Optional. Default=1000
Returns:
weights, array
"""
# input and output training sizes
n_inputs = X.shape[1]
n_outputs = y.shape[1]
n_examples = X.shape[0]
# add bias
bias = np.ones(n_examples)
X = np.c_[bias, X] # add bias column to X
# initialization
converged = False
i = 0 # iteration counter
# initialize weights randomly with mean 0
w = 2 * np.random.random((n_inputs+1, n_outputs)) - 1
# main loop until converged or max iterations reached
while not converged and i <= max_iterations:
# note: all training examples are used together
# forward propagation
pred_y = sigmoid(np.dot(X, w)) # predicted output
# how much did we miss? Compute the error
error = y - pred_y
# backpropagation: multiply how much we missed by the
# slope of the sigmoid at the values in pred_y
delta = error * sigmoidDeriv(pred_y)
# update weights
w += np.dot(X.T, delta)
# did we converge?
maxError = max(error) # biggest error among training examples
if abs(maxError) < tolerance:
converged = True # yes !
print ("** converged after iterations: ",i)
i += 1 # no convergence yet: next iteration
# end of loop
if not converged:
print("** Not converged!")
return w

And this is how it works with a simple two-values input and an OR output:

# seed random numbers to make calculation
# deterministic (just a good practice when testing / debugging)
np.random.seed(1)
print("*** Simulation of OR gate using neural network ***")
# input dataset: 4 training examples
X_train = np.array([ [0,0], [0,1], [1,0], [1,1] ])
# output dataset: the expected results for each training input
y_train = np.array([[0,1,1,1]]).T
# learn the weights through training!
ORweights = learnWeightsOneLayer(X_train, y_train)
# print the truth table
print("X1 | X2 | Y = X1 OR X2")
print("-----------------------")
for x1 in (0,1):
for x2 in (0,1):
print(x1," | ",x2, " | ", predictOneLayer([x1,x2], ORweights))

And this is the output:

*** Simulation of OR gate using neural network ***
** converged after iterations: 149
X1 | X2 | Y = X1 OR X2
-----------------------
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 1

Well, it works!

Let’s explore the code piece by piece.

Training data and bias

# input and output training sizes
n_inputs = X.shape[1]
n_examples = X.shape[0]

The input dataset is a numpy matrix. Each row is a single “training example”. Each column corresponds to one of our input nodes.
In our OR example above, we have 2 input nodes (X1 and X2) and 4 training examples.
Similarly the number of nodes in the output layer (the “labels” of our dataset) is determined by the number of classes, in this case we have only one output node for two classes: 0 and 1. That is the number of column for the y matrix while the rows are determined again by the number of training examples.

# add bias
bias = np.ones(n_examples)
X = np.c_[bias, X] # add bias column to X

This will add a “bias” neuron, i.e. an additional column of weights. It’s always a good idea to add a bias to a neural network.
The bias node in a neural network is a node that is always ‘on’. That is, its value is set to 1. It is analogous to the intercept in a regression model, and serves the same function: allows you to shift the activation function to the left or right, which may be critical for successful learning.

For example, assume that you want a neuron to fire y≈1 when all the inputs are x≈0. If there is no bias no matter what weights W you have, given the equation y = σ(WX) the neuron will always fire y≈0.5.

Initial weights

# initialise weights randomly with mean 0
w = 2 * np.random.random((n_inputs+1, n_outputs)) - 1

This creates our weight matrix for the neural network.
And it initialises the weights to a random number, using the numpy function random() which returns a random float number or array.
The size of the output is controlled by the parameter. In our example it’s a tuple (n_inputs+1, n_outputs) which means it will return a matrix.

We can choose the dimensionality of the hidden layer.
The more nodes we put into the hidden layer the more complex functions the neural network will be able to model but higher dimensionality means more computation is required to learn the network weights.

Since we only have one layer (the input values) we only need a matrix of weights with one column, to connect them. We keep its dimension here minimal: (3,1) because we have two inputs plus the bias and one output.

Results from the random() function are from the “continuous uniform” distribution over the half-open interval [0.0, 1.0). It’s a best practice to sample the weights from a uniform distribution with mean zero: to sample Unif[-1,1) we just have to multiply the output of random() by 1-(-1)=2 and add -1.

Iterate

# initialisation
converged = False
i = 0 # iteration counter
# main loop until converged or max iterations reached
while not converged and i <= max_iterations:

This is the loop to train the network weights.
It iterates over the training code until we have reached the wished error (converged) or until we have reached the maximum number of iterations.
Both the max_iterations and the error tolerance are optional inputs of this function.

This is the forward propagation prediction step: the network predicts the output given the input and the current weights by matrix multiplication and applying the activation function.

y = σ(Xw)

Note that X contains all training examples (input) and we process all of them at the same time in this implementation. This is known as “full batch” training.

The X matrix is multiplied with the current weights matrix w, using the numpy function dot() and its result is then going through the sigmoid squashing function, our activation function.
The result will be also a matrix, in this case of 4 rows (the four training examples) and one column (the predicted output):

(4 x 3) dot (3 x 1) = (4 x 1) # 4 examples, 3 input and 1 layer of weights

Each output corresponds with the network’s prediction for a given training input. This would work independently on the number of training examples, i.e. if I train the network with more examples, the matrix will be bigger but no changes in code are required.

Loss function (the error)

# Compute the error
error = y - pred_y

We can now compare how well it did by subtracting the true answer (y from the training dataset) from the prediction (pred_y).
The error is an array: the difference between actual and prediction for each training example.
Again: this is a super-simple loss function just for illustration. For more complex examples you will use e.g. a quadratic function like we have seen in regression algorithms.

Back-propagation

One popular approach is an algorithm called back-propagation that has similarities to the gradient descent algorithm we looked at earlier. The back-propagation algorithm (Rumelhart et al., 1986), often simply called backprop, allows the error information to ﬂow backwards through the network, in order to compute the gradient. Once we have the gradient of this error as a function of the neuron’s weights, we can propagate these output errors backward to infer errors for the other layers and adjust its weights in the direction that most decreases the error.

The term back-propagation is often misunderstood as meaning the whole learning algorithm for neural networks but actually, it refers only to the method for computing the gradient, while another algorithm, such as stochastic or batch gradient descent, is used to perform learning using this gradient.

We use the gradient to find the minimum; this means we first need to compute the gradient. The gradient is just made up of the derivatives of all the inputs concatenated in a vector (remember that a derivative of a function measures the sensitivity to change):

for all i,j,l

where w are the weights and e(w) is the error on the example X,y.

Back-propagation is the key algorithm that makes training deep models computationally tractable, basically is just a clever “trick” to efficiently calculate the gradients starting from the output. I am not going into details here but it’s an old trick and you can find the generic implementation under the name “reverse-mode differentiation”:

Classical methods are slow at computing the partial derivatives of a function with respect to many inputs, as is needed for gradient-based optimization algorithms. Automatic differentiation solves all of these problems […]
In reverse accumulation, one first fixes the dependent variable to be differentiated and computes the derivative with respect to each sub-expression recursively. In a pen-and-paper calculation, one can perform the equivalent by repeatedly substituting the derivative of the outer functions in the chain rule:

The chain rule of calculus is used to compute the derivatives of functions formed by composing other functions whose derivatives are known. The chain rule very simply states that the right thing to do is to simply multiply the gradients together to chain them.
Reverse-mode differentiation is an algorithm that computes the chain rule, with a speciﬁc order of operations that is highly eﬃcient.

I won’t go into more details here but there are many excellent explanations in the web, such as the Andrew Ng or Yaser Abu-Mostafa courses.

Bottom line: back-propagation tracks how every neuron affects one output.

# backpropagation: multiply how much we missed by the
# slope of the sigmoid at the values in pred_y
delta = error * sigmoidDeriv(pred_y)

The first thing we need is a function that computes the gradient of the sigmoid function we created earlier. This is sigmoidDeriv(). The derivative will give back the slope of the predictions.

The variable error is the total error across the whole data, calculated by running the data plus current parameters through the “network” (the forward-propagate function) and comparing the output to the true labels. This is the part we discussed earlier as the cost function.

When we multiply the “slopes” by the error, we are reducing the error of high confidence predictions.
The multiplication is essentially answering the question “how can I adjust my parameters to reduce the error the next time I run through the network”?
It does this by computing the contributions at each layer to the total error and adjusting appropriately by coming up with a gradient matrix (or, how much to change each parameter and in what direction).

Update the weights

# update weights
w += np.dot(X.T, delta)

We are now ready to update our network! This line computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line. A small error and a small slope means a VERY small update. However, because we’re using a “full batch” configuration, we’re doing the above step on all four training examples.

Check if converged

# did we converge?
maxError = max(error) # biggest error among training examples
if abs(maxError) < tolerance:
converged = True # yes !
print ("** converged after iterations: ",i)
i += 1 # next iteration (to check if max is reached)

The last step is to check if we have reached the end of our optimisation, either because the error is smaller than our tolerance or because we reached the maximum number of iterations.

In our example above, the convergence (i.e. the error was less than 0.1) was reached after 149 iterations.

We can also print out the weights and we can see that they are different than the hard-coded ones (but follow the same pattern: the bias is negative, the inputs are similar and positive).

Weights for ANN modelling OR function:
[[-1.63295267]
[ 3.84814604]
[ 3.83626687]]

There are several combinations of weights that can bring to the same result and there are other equivalent solutions that gradient descent could also ﬁnd.
The convergence point of gradient descent depends on the initial values of the parameters: the initial set of weights (that is random chosen), the gradient descent meta-parameters (not used in this simple example) and in minor part also on the error tolerance defined.

It is also possible to demonstrate that multiplying the resulting weights for a constant C will keep the ANN to output the same result but this is out of the scope for this post. You can try it yourself if you are curious.

An OR gate with 3 inputs

Let´s see if the ANN modelling an OR gate can work with 3 inputs:

print("*** Simulation of OR gate with 3 inputs using ANN ***")
# input dataset
X_train = np.array([ [0,0,0], [0,0,1], [0,1,0], [1,0,0] ])
# output dataset
y_train = np.array([[0,1,1,1]]).T
ORweights = learnWeightsOneLayer(X_train, y_train)
print("X1 | X2 | X3 | Y = X1 OR X2 OR X3")
print("----------------------------------")
inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1))
for x1,x2,x3 in inputs:
print (x1," | ",x2, " | ", x3, " | ",
predictOneLayer([x1,x2,x3], ORweights))

Note that the training set X_train is now a matrix with 3 columns for the 3 inputs and that you do not need to list ALL combinations as training examples.
For an OR function with 3 operators there would be a total of 2^3 = 8 possible combinations but here we train the network with only 4 examples.
Choosing the right training dataset (large and representative enough) is a science for itself but I will not dig more here. You can refer to the previous posts about overfitting and cross-validations.

This is the output:

*** Simulation of OR gate with 3 inputs using ANN ***
** converged after iterations: 169
X1 | X2 | X3 | Y = X1 OR X2 OR X3
----------------------------------
0 | 0 | 0 | 0
0 | 0 | 1 | 1
0 | 1 | 0 | 1
0 | 1 | 1 | 1
1 | 0 | 0 | 1
1 | 0 | 1 | 1
1 | 1 | 0 | 1
1 | 1 | 1 | 1

It works perfectly for all 8 combinations and convergence has been reached after 169 iterations, pretty fast.

Non-linear example: XOR

The real proof of truth is to model a non-linear function, such as the XOR function, a sort of Hello World for Neural Networks.
We have seen in the previous post that a simple perceptron (one neuron with only one single layer of input) cannot model it. But it’s possible by combining together perceptrons that model each one different functions, such as OR, AND and NOT:

x1 XOR x2 = (x1 AND NOT x2) OR (NOT x1 AND x2)

The resulting multi-layer network of perceptrons is basically an artificial neural network with one additional layer – often called the hidden layer:

To model such a network we need to add one layer of weights, i.e. one column to the weights matrix and to compute the hidden layer weights in the training algorithm.

XOR with two inputs

Note: To keep the example more clear I will keep the weights of the two layers separately.

def learnWeightsTwoLayers(X, y, tolerance = 0.1, max_iterations = 1000):
"""
Learn the weights for an artificial neural network with two layers,
given a training set of input and output values.
Uses a backpropagation algorithm.
Arguments:
X: input data points without bias, matrix
y: output data points, array
tolerance: minimum error to stop the training, float. Optional (default=0.1)
max_iterations: maximum number of iterations performed by the algorithm.
Optional. Default=1000
Returns:
weights for input layer, array
weights for hidden layer, array
"""
# input and output training sizes
n_inputs = X.shape[1]
n_outputs = y.shape[1]
n_examples = X.shape[0]
# add bias
bias = np.ones(n_examples)
X = np.c_[bias, X] # add bias column to X
# initialize weights randomly with mean 0
w1 = 2 * np.random.random((n_inputs+1, n_inputs)) - 1 # input layer
w2 = 2 * np.random.random((n_inputs+1, n_outputs)) - 1 # hidden layer
# initialization
converged = False
i = 0 # iteration counter
# main loop until converged or max iterations reached
while not converged and i <= max_iterations:
# note: all training examples together (batch)
# forward propagation
hiddenLayer = sigmoid(np.dot(X, w1))
hiddenLayer = np.c_[bias, hiddenLayer] # add bias
pred_y = sigmoid(np.dot(hiddenLayer, w2)) # predicted output
# Compute the error
error_y = y - pred_y
# backpropagation: multiply how much we missed by the
# slope of the sigmoid at the values in y_pred
delta_y = error_y * sigmoidDeriv(pred_y)
# repeat for hidden layer: error and delta
# how much did each value of hidden layer contribute
# to the final error (according to the weights)?
error_l1 = delta_y.dot(w2.T)
delta_l1 = error_l1 * sigmoidDeriv(hiddenLayer)
# the first column refers to the bias: remove it
delta_l1 = np.delete(delta_l1, 0, axis=1)
# update weights
w2 += np.dot(hiddenLayer.T, delta_y)
w1 += np.dot(X.T, delta_l1)
# did we converge?
maxError = max(abs(error_y))
if abs(maxError) < tolerance:
converged = True # yes !
print ("** converged after iterations: ",i)
i += 1 # next iteration (to check if max is reached)
if not converged:
print("** Not converged!")
return w1, w2

The mainly difference is that we have now two arrays of weights w1 and w2, one storing the weights for the input layer and one for the hidden layer.
Then accordingly the code is changed to take in considerations both sets of weights.

Let´s see it in action:

def predictTwoLayers(X, w1, w2):
"""
Compute the binary output of an Artificial Neural Network with two layers,
given its neuron weights and an input dataset
Arguments:
X: input data points without bias, list
w1: the input-layer weights, array
w2: the hidden-layer weights, array
Returns:
integer (0 or 1)
"""
bias = np.ones(1)
X = np.append(bias, X, 0) # add bias column to X
# forward propagation
hiddenLayer = sigmoid(np.dot(X, w1))
hiddenLayer = np.append(bias, hiddenLayer, 0) # add bias
y = np.dot(hiddenLayer, w2)
return sigmoidActivation(y)
print("*** Simulation of XOR gate using ANN ***")
print("***")
# input dataset
X_train = np.array([ [0,0], [0,1], [1,0], [1,1] ])
# output dataset
y_train = np.array([[0,1,1,0]]).T
# note that we use tolerance = 0.4
XORweights1, XORweights2 = learnWeightsTwoLayers(X_train, y_train, 0.4)
print("X1 | X2 | Y = X1 XOR X2")
print("------------------------")
for x1 in (0,1):
for x2 in (0,1):
print(x1," | ",x2, " | ", predictTwoLayers(np.array([x1,x2]),
XORweights1, XORweights2))

I used a higher tolerance, since the output of a XOR gate is binary so we don’t need that much approximation. This will make the training algorithm faster.

And here is the output:

*** Simulation of XOR gate using ANN ***
***
** converged after iterations: 967
X1 | X2 | Y = X1 XOR X2
------------------------
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 0

It works but note that it takes now 967 iterations to converge.
Adding layers to a network will normally make it slower and harder to train. Normally deeper neural network will be more optimised, e.g. will use better gradient descent algorithms, regularisations and so on.

XOR with 3 inputs

And it works with 3 inputs too:

print("*** Simulation of XOR gate using ANN, now with 3 inputs ***")
print("***")
# input dataset
X_train = np.array([ [0,0,0], [0,1,0], [1,0,1], [0,1,1], [1,0,0], [1,1,1]])
# output dataset
y_train = np.array([[0,1,1,1,1,0]]).T
XORweights1, XORweights2 = learnWeightsTwoLayers(X_train, y_train, 0.4)
print("X1 | X2 | X3 | Y = X1 XOR X2 XOR X3")
print("---------------------------------------")
inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1))
for x1,x2,x3 in inputs:
print (x1," | ",x2, " | ", x3, " | ", predictTwoLayers([x1,x2,x3],
XORweights1, XORweights2))

The perceptron can model linear functions but cannot model non-linear functions, unless you combine perceptrons together. Nevertheless the perceptron convergence procedure works by ensuring that every time the weights change, they get closer to every “generously feasible” set of weights. This type of guarantee cannot be extended to more complex networks in which the average of two good solutions may be a bad solution (non-convex problems). Therefore the perceptron learning procedure cannot be generalised to hidden layers.

Instead of showing the weights get closer to a good set of weights, the Artificial Neural Networks learn the weights so that the actual output values get closer the target values. The aim of learning is to minimise the error summed over all training cases.
Sigmoid neurons give a real-valued output that is a smooth and bounded function of their total input; they have nice derivatives which make learning easy.

Such an Artificial Neural Network can represent non-linear functions.
The Universal Approximation theorem states that a simple feed-forward neural network with a single hidden layer is enough to represent a wide variety of continuous functions.

This is a bare example. Deep neural network have dozens of layers and use many additional optimisations to improve the learning: different activation functions, regularisations to prevent overfitting ((L1 and L2 regularization, dropout), better choice of cost function (squared difference, cross-entropy), tuned weight initialisation and choice of hyper-parameters (learning rate, number of layers and neurons, …), sophisticated gradient descent algorithms. But the principle is the same.

ANNs can represent efficiently almost any kind of function and studies are constantly progressing.

We have seen in the hello world of nlp, how to count words in a text or from a file but what happen if the file is very large? Sure, we can read it line by line and avoiding to overflow the memory of our machine but it can take long time.

There are better ways. We can distribute the processing on multiple machines, even distributed machines.

In this tutorial, we will write code that find out the most common words in the Complete Works of William Shakespeare. This could also be scaled to larger applications, such as finding the most common words in Wikipedia.

This tutorial will use Apache Spark, a framework for large-scale data processing, within a notebook. Many traditional frameworks were designed to be run on a single computer. However, many datasets today are too large to be stored on a single computer, and even when a dataset can be stored on one computer (such as the datasets in this tutorial), the dataset can often be processed much more quickly using multiple computers. Spark has efficient implementations of a number of transformations and actions that can be composed together to perform data processing and analysis. Spark excels at distributing these operations across a cluster while abstracting away many of the underlying implementation details. Spark has been designed with a focus on scalability and efficiency. With Spark you can begin developing your solution on your laptop, using a small dataset, and then use that same code to process terabytes or even petabytes across a distributed cluster.

The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data.
We continue from the word counting example and in this notebook, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg. This could also be scaled to larger applications, such as finding the most common words in Wikipedia.

During this example we will cover:
Part 0: What is Apache Spark
Part 1: Creating a base DataFrame and performing operations
Part 2: Counting with Spark SQL and DataFrames
Part 3: Finding unique words and a mean value
Part 4: Apply word count to a file

Note that for reference, you can look up the details of the relevant methods in Spark’s Python API.

Part 0: Spark

An introduction to using Apache Spark with the PySpark SQL API running in a notebook

What is Spark

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Spark it is a fast and general engine for large-scale data processing. Databricks is a company founded by the creators of Apache Spark, that aims to help clients with cloud-based big data processing using Spark.

Traditional analysis tools like R and Python Pandas run on a single machine but data are growing faster than computation speed.
The Opportunity: Cloud computing is a game-changer.
It provides access to low-cost computing and storage.
Distributing data over cluster of machines means lots of hard drives, lots of CPUs but also lots of memory !
Storage is getting cheaper but stalling CPU speeds are the bottlenecks.
A new Opportunity: Keep more data in-memory!
In-memory can make a big difference, up to 100x faster.
Spark is a new distributed execution engine that leverages the in-memory paradigm.

The challenge with cloud computing has always been programming the resources.
Spark is developed in Scala and – besides Scala itself – supports other languages such as Java and Python.
We are using for this example the Python programming interface to Spark (pySpark).
pySpark provides an easy-to-use programming abstraction and parallel runtime: “Here’s an operation, run it on all of the data”.

Spark Context

In Spark, communication occurs between a driver and executors. The driver has Spark jobs that it needs to run and these jobs are split into tasks that are submitted to the executors for completion. Executor programs run on cluster nodes or in local threads. The results from these tasks are delivered back to the driver.
Where does code run?
– Locally, in the driver
– Distributed at the executors (Executors run in parallel and have much more memory)
– Both at the driver and the executors
Problems with cheap hardware are: failures, network speeds versus shared memory, much more latency, network slower than storage, uneven performance.
How do we deal with machine failures? We launch another task.
How do we deal with slow tasks? We launch another task.

When running Spark, you start a new Spark application by creating a SparkContext.
SparkContext tells Spark how and where to access a cluster.
The program next creates a SQLContext object which is used to create and manage the DataFrames.
When the `SparkContext` is created, it asks the master for some cores to use to do work. The master sets these cores aside just for you; they won’t be used for other applications.
When using Databricks, both a `SparkContext` and a `SQLContext` are created for you automatically. `sc` is your `SparkContext`, and `sqlContext` is your `SQLContext`.

# Display the type of the Spark sqlContext
type(sqlContext)
Out[1]: pyspark.sql.context.HiveContext

Note that the type is `HiveContext`. This means we’re working with a version of Spark that has Hive support. Compiling Spark with Hive support is a good idea, even if you don’t have a Hive metastore. As the Spark Programming Guide states, a `HiveContext` “provides a superset of the functionality provided by the basic `SQLContext`. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs [user-defined functions], and the ability to read data from Hive tables.”

Part 1: Creating a base DataFrame and performing operations

DataFrames (DF) are the key concept, the primary abstraction in Spark.
Similar to Python Pandas dataframe, they are immutable once constructed and enable operations on collection of elements in parallel.
You construct DataFrames by parallelizing existing Python collections (lists), by transforming an existing Spark or pandas DFs or from files in HDFS or any other storage system.
DataFrames support two types of operations: transformations and actions.
Transformations are lazy (not computed immediately). A transformed DF is executed when action runs on it.
A transformation to a DataFrame is for example select. Actions to a DataFrame are for example show and count.

Spark Program Lifecycle

Create DataFrames from external data or create DataFrame from a collection in driver program

Lazily transform them into new DataFrames

cache() some DataFrames for reuse (optional)

Perform actions to execute parallel computation and produce results

Most of Python code runs in driver, except for code passed to transformations. Transformations run at executors. Actions can run both at executors and driver.

In this part, we will explore creating a base DataFrame with `sqlContext.createDataFrame` and using DataFrame operations to count words.

Create a DataFrame

We’ll start by generating a base DataFrame using a Python list of tuples and the `sqlContext.createDataFrame` method. Then we’ll print out the type and schema of the DataFrame. The Python API has several examples for using the ‘createDataFrame` method.

As you can see, DataFrame is a class of pyspark.sql

Create a new DataFrame from an existing one

This use lazy evaluation: results are not computed right away – Spark remembers the set of transformations applied to the base DataFrame.
Think of this as a recipe for creating result.

Spark Actions like show(), collect() or count() then cause Spark to execute the recipe to transform the source. It is the mechanism for getting results out of Spark.

Length of each word

You can create a new DataFrame from our base DF `wordsDF` by calling the `select` DataFrame function and pass in the appropriate recipe: we can use the SQL `length` function to find the number of characters in each word.

The `length` function is found in the `pyspark.sql.functions` module.

Now, let’s count the number of times a particular word appears in the ‘word’ column. There are multiple ways to perform the counting, but some are much less efficient than others.

A naive approach would be to call `collect` on all of the elements and count them in the driver program. While this approach could work for small datasets, we want an approach that will work for any size dataset including terabyte- or petabyte-sized datasets. In addition, performing all of the work in the driver program is slower than performing it in parallel in the workers. For these reasons, we will use data parallel operations.

Using ‘groupBy’ and ‘count’

Using DataFrames, we can preform aggregations by grouping the data using the `groupBy` function on the DataFrame. Using `groupBy` returns a `GroupedData` object and we can use the functions available for `GroupedData` to aggregate the groups. For example, we can call `avg` or `count` on a `GroupedData` object to obtain the average of the values in the groups or the number of occurrences in the groups, respectively.

To find the counts of words, we group by the words and then use the `count` function to find the number of times that words occur.

You can see that without using alias(), the column gets simply the function name (e.g., “count” in this case).
Let’s also add some unit testing.

# Load in the testing code
# If incorrect it will report back '1 test failed' for each failed test
from databricks_test_helper import Test
# TEST groupBy and count
Test.assertEquals(wordCountsDF.collect(), [('tutorial', 1), ('spark', 2),
('look', 2), ('python', 1)], 'incorrect counts for wordCountsDF')
1 test passed.

Part 3: Finding unique words and a mean value

Unique words

Calculate the number of unique words in `wordsDF`.

from spark_notebook_helpers import printDataFrames
# This function returns all the DataFrames in the notebook and their
# corresponding column names.
printDataFrames(True)
uniqueWordsCount = wordCountsDF.count()
print uniqueWordsCount
Out[2]: 4
# TEST Unique words
Test.assertEquals(uniqueWordsCount, 4, 'incorrect count of unique words')
1 test passed.

Means of groups using DataFrames

Find the mean number of occurrences of words in `wordCountsDF`.

We can use the `mean` GroupedData method to accomplish this. Note that when you use `groupBy` you don’t need to pass in any columns. A call without columns just prepares the DataFrame so that aggregation functions like `mean` can be applied.

averageCount = (wordCountsDF
.groupBy().mean('count')).collect()[0][0]
print averageCount
Out[3]: 1.5
# TEST Means of groups using DataFrames
Test.assertEquals(round(averageCount, 2), 1.5, 'incorrect value of
averageCount')
1 test passed.

Part 4: Apply word count to a file

In this section we will finish developing our word count application. We’ll have to build the `wordCount` function, deal with real world problems like capitalization and punctuation, load in our data source, and compute the word count on the new data.

The ‘wordCount’ function

First, we define a function for word counting.
This function takes in a DataFrame that is a list of words like `wordsDF` and returns a DataFrame that has all of the words and their associated counts.

# the words count function
def wordCount(wordListDF):
"""Creates a DataFrame with word counts.
Args:
wordListDF (DataFrame of str): A DataFrame consisting of one string
column called 'word'.
Returns:
DataFrame of (str, int): A DataFrame containing 'word' and 'count'
columns.
"""
return (wordListDF
.groupBy('word').count())
# apply the new function to the words DataFrame, it should get the same
# result
wordCount(wordsDF).show()
+--------+-----+
| word|count|
+--------+-----+
|tutorial| 1|
| spark| 2|
| look| 2|
| python| 1|
+--------+-----+

Capitalization and punctuation

Real world files are more complicated than the data we have been using until now. Some of the issues we have to address are:

+ Words should be counted independent of their capitialization (e.g., Spark and spark should be counted as the same word).
+ All punctuation should be removed.
+ Any leading or trailing spaces on a line should be removed.

We now define the function `removePunctuation` that converts all text to lower case, removes any punctuation, and removes leading and trailing spaces. We use the Python regexp_replace module to remove any text that is not a letter, number, or space and the `trim` and `lower` functions found in pyspark.sql.functions.

from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and
trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters
should should be eliminated (e.g. it's becomes its). Leading and
trailing spaces should be removed after punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
return trim(lower(regexp_replace(column, '([^\s\w_]|_)+', '')))
.alias('sentence')
sentenceDF = sqlContext.createDataFrame([('Hi, you',),
(' Look! No under_score!',),
(' * Remove punctuation then spaces * ',)],
['sentence'])
# display first original sentence
sentenceDF.show(truncate=False)
# then sentence with punctuation removed
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
+------------------------------------------+
|sentence |
+------------------------------------------+
|Hi, you |
| Look! No under_score! |
| * Remove punctuation then spaces * |
+------------------------------------------+
+------------------------------+
|sentence |
+------------------------------+
|hi you |
|look no underscore |
|remove punctuation then spaces|
+------------------------------+
# TEST Capitalization and punctuation
testPunctDF = sqlContext.createDataFrame([(" The Elephant's 4 cats. ",)])
Test.assertEquals( testPunctDF.select( removePunctuation(col('_1')))
.first()[0], 'the elephants 4 cats',
'incorrect definition for removePunctuation function')
1 test passed.

Load a text file

For the next part, we will use the Complete Works of William Shakespeare from Project Gutenberg. To convert a text file into a DataFrame, we use the `sqlContext.read.text()` method. We also apply the recently defined `removePunctuation()` function using a `select()` transformation to strip out the punctuation and change all text to lower case. Since the file is large we use `show(15)`, so that we only print 15 lines.

fileName = "dbfs:/datasets/shakespeare.txt"
shakespeareDF = sqlContext.read.text(fileName)
.select(removePunctuation(col('value')))
shakespeareDF.show(15, truncate=False)
+-------------------------------------------------+
|sentence |
+-------------------------------------------------+
|1609 |
| |
|the sonnets |
| |
|by william shakespeare |
| |
| |
| |
|1 |
|from fairest creatures we desire increase |
|that thereby beautys rose might never die |
|but as the riper should by time decease |
|his tender heir might bear his memory |
|but thou contracted to thine own bright eyes |
|feedst thy lights flame with selfsubstantial fuel|
+-------------------------------------------------+
only showing top 15 rows

Words from lines

Before we can use the `wordcount()` function, we have to address two issues with the format of the DataFrame:
+ The first issue is that that we need to split each line by its spaces.
+ The second issue is we need to filter out empty lines or words.

We apply a transformation that will split each ‘sentence’ in the DataFrame by its spaces, and then transform from a DataFrame that contains lists of words into a DataFrame with each word in its own row. To accomplish these two tasks you can use the `split` and `explode` functions found in pyspark.sql.functions.

Once we have a DataFrame with one word per row we can apply the DataFrame operation `where` to remove the rows that contain ”.

from pyspark.sql.functions import split, explode
shakeWordsSplitDF = (shakespeareDF
.select(split(shakespeareDF.sentence, '\s+').alias('split')))
shakeWordsSingleDF = (shakeWordsSplitDF
.select(explode(shakeWordsSplitDF.split).alias('word')))
shakeWordsDF = shakeWordsSingleDF.where(shakeWordsSingleDF.word <> '')
shakeWordsDF.show()
shakeWordsDFCount = shakeWordsDF.count()
print shakeWordsDFCount
+-----------+
| word |
+-----------+
| 1609 |
| the |
| sonnets |
| by |
| william |
|shakespeare|
| 1 |
| from |
| fairest |
| creatures |
| we |
| desire |
| increase |
| that |
| thereby |
| beautys |
| rose |
| might |
| never |
| die |
+-----------+
only showing top 20 rows
882996
# TEST Remove empty elements
Test.assertEquals(shakeWordsDF.count(), 882996,
'incorrect value for shakeWordCount')
Test.assertEquals(shakeWordsDF.columns, ['word'],
"shakeWordsDF should only contain the Column 'word'")
1 test passed.
1 test passed.

Count the words

We now have a DataFrame that is only words. Next, let’s apply the `wordCount()` function to produce a list of word counts. We can view the first 20 words by using the `show()` action; however, we’d like to see the words in descending order of count, so we’ll need to apply the `orderBy` DataFrame method to first sort the DataFrame that is returned from `wordCount()`.

You’ll notice that many of the words are common English words. These are called stopwords. We will see how to eliminate them from the results.

Stopwords are common (English) words that do not contribute much to the content or meaning of a document (e.g., “the”, “a”, “is”, “to”, etc.). Stopwords add noise to bag-of-words comparisons, so they are usually excluded.
Using the included file “stopwords.txt”, implement `tokenize`, an improved tokenizer that does not emit stopwords.

In Python, we can test membership in a set as follows:

“`
my_set = set([‘a’, ‘b’, ‘c’])
‘a’ in my_set # returns True
‘d’ in my_set # returns False
‘a’ not in my_set # returns False
“`
Within `tokenize()`, first tokenize the string using `simpleTokenize()`. Then, remove stopwords. To remove stop words, consider using a loop, a Python list comprehension, or the built-in Python filter() function.

import os
data_dir = 'datasets'
STOPWORDS_PATH = 'stopwords.txt'
stopfile = os.path.join(data_dir, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords
type(stopwords)
Out[4]: set

We first test the approach on a simple string.

import re
quickbrownfox = 'A quick brown fox jumps over the lazy dog.'
split_regex = r'\W+'
def simpleTokenise(string):
""" A simple implementation of input string tokenization
Args:
string (str): input string
Returns:
list: a list of tokens
"""
return filter(None, re.split(split_regex, string.lower()))
print simpleTokenise(quickbrownfox)
# Should give ['a', 'quick', 'brown', ... ]
Out[5]:['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
def removeStopWords(listOfTokens):
return [token for token in listOfTokens if token not in stopwords]
def tokeniseAndRemoveStopwords(string):
""" An implementation of input string tokenization that excludes stopwords
Args:
string (str): input string
Returns:
list: a list of tokens without stopwords
"""
tmpLista = simpleTokenise(string)
return removeStopWords(tmpLista)
print tokeniseAndRemoveStopwords(quickbrownfox)
# Should give ['quick', 'brown', ... ]
Out[6]: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

We use the User Defined Function (UDF) to remove the stopwords.