Recalling the perceptron
In the previous post we have seen how an artificial neuron works to classify simple functions and we have also seen its limitations when the space function to be classified is not linear, e.g. it fails to compute correctly the XOR function.
We have also seen that the way to overcome this limitation is to combine more neurons together in a network but one thing we did not yet seen was how to compute the “weights” of the network (they were hard-coded in the previous example).
As a reminder, here is the perceptron able to compute an OR function:
The input values are multiplied with specific weights and combined together into an activation function which outputs the OR function.
The “artificial neural network” (ANN) is really just this matrix of weights. Here the matrix has only one column (the OR function is very simple) but we may have “layers” and layers of neurons and their weights.
The values calculated are just transient values based on the input dataset. We don’t save them, they are forgotten.
All of the learning is stored in the weight matrix.
Like similar supervised models, the ANNs can be trained to learn the weights using some labeled training data (the “supervised” data) and once the weights matrix is learnt, we can use it to compute the function that we trained for.
Before seeing how the weights can be learned, we will slightly tune the activation function (the red circle in the figure above) in order to make the learning phase easier.
In the perceptron example, this activation function was a simple threshold one, outputting 0 if the input was below a threshold and 1 otherwise. Instead, we will use a sigmoid function which has the advantage that small changes in the weights and bias cause only a small change in their output. The sigmoid function – often called the logistic function – is defined by:
The sigmoid activation function is always positive, bounded between 0 and 1 and strictly increasing.
And this is its Python implementation:
import numpy as np def sigmoid(z): """ Compute the sigmoid function Argument: z: float number Returns: float """ return 1 / (1 + np.exp(-z))
The sigmoid function can be thought as a “squashing function”, because it takes the input and squashes it to be between zero and one: very negative values are squashed towards zero and positive values get squashed towards one.
For example we have:
sigmoid(-5) = 0.007; sigmoid(5) = 0.993
The output of a sigmoid neuron
The output of a sigmoid neuron is obviously not 0 or 1 but is a real number between 0 and 1. This can be useful, for example, if we want to use the output value to represent the intensity of the pixels in an image input to a neural network. But sometimes it is not, in fact to model an OR gate we need to have as output 0 or 1, as it was in the perceptron.
In practice we can use a convention to deal with this, for example, by deciding to interpret any output of at least 0.5 as indicating a “1”, and any output less than 0.5 as indicating a “0”.
We can quickly verify that the original perceptron simulating the OR function is still working with this new activation function:
def sigmoidActivation(z): """ Compute the activation function for boolean operators using sigmoid Since a sigmoid function is always between 0 and 1, we just cut off at 0.5 Arguments: z: input data points including bias, array Returns: integer (0 or 1) """ if sigmoid(z) > 0.5: return 1 else: return 0 def predictOneLayer(X, weights): """ Compute the binary output of an Artificial Neural Network with single layer, given its neuron weights and an input dataset Arguments: X: input data points without bias, list weights: the single-layer weights, array Returns: integer (0 or 1) """ X =  + X # add bias to the input list # forward propagation y = np.dot(X, weights).sum() return sigmoidActivation(y) def forwardOR(operands): """ Simulate an OR function using a simple forward neural network (single layer) Arguments: operands: input data points without bias, list Returns: integer (0 or 1) """ m = len(operands) # number of training examples ORweights = [-0.5] + *m # hard-coded weights return predictOneLayer(operands, ORweights) print("*** Simulation of OR using one forward neuron ***") print("X1 | X2 | Y = X1 OR X2") print("-----------------------") for x1 in (0,1): for x2 in (0,1): print(x1," | ",x2, " | ", forwardOR([x1,x2]))
And here is its output:
*** Simulation of OR using one forward neuron *** X1 | X2 | Y = X1 OR X2 ----------------------- 0 | 0 | 0 0 | 1 | 1 1 | 0 | 1 1 | 1 | 1
The OR function still works correctly using sigmoid neurons.
As you notice, the weights in this OR example above are hard-coded because we can get them manually very quick since the example is pretty straightforward. But this is obviously not scalable as the complexity of a function is growing.
Therefore we will see now how we can find the appropriate weights to model a simple function like the OR.
Training the network
What we need is an algorithm that will find the appropriate weights (remember: they ARE the artificial neural network) programmatically for us so that the output from the network approximates the expected y(x) for all training inputs x.
To quantify how well we’re achieving this goal we will define a cost function, sometimes referred to as a loss or objective function.
We have already seen examples of cost functions in the regression models and it can be a complex function, depending on the domain – a common choice is the categorical cross-entropy loss – but for this simple example we just stick to the difference between the network output and the training output:
error = actual output(label) – predicted output
Learning the weights
The output of the ANN is:
output = h(W * X + w0)
where X is the input, w0 is the bias, W are the weights and h is the non-linear activation function.
Learning the parameters for our network means finding weights that minimise the error on our training data. By finding parameters that minimise the error we maximise the likelihood of our training data.
Designing and training a neural network is not much diﬀerent from training any other machine learning model with gradient descent.
The gradient descent works by using the derivative function. The derivatives of the sigmoid with respect to the inputs and the weights are very simple:
therefore its Python implementation is clear-cut:
def sigmoidDeriv(s): """ Compute the derivative of a sigmoid function Argument: s: float number Returns: float """ return s * (1-s)
Now we have everything to explore the learning algorithm for artificial neural network.
The learning algorithm
The general idea is to forward-feed the input dataset to an initial set of weights to compute the output and the error, then back-propagate the error using the gradient (this is the “core” part!) and update the weights until the error is acceptable:
0. start with X and y (input and output training data) 1. initialise weights Wij randomly 2. loop until error < tolerance or max iterations: 3. for each training example Xi: 4. forward: out = theta(Wij * Xi) # theta is the activation function 5. compute the error 6. backward: compute gradient 7. update all weights Wij using the gradient 8. return the final weights
We will see it in details line by line line but here is the entire working Python implementation in the case of one single layer of neurons:
def learnWeightsOneLayer(X, y, tolerance = 0.1, max_iterations = 1000): """ Learn the weights for an artificial neural network with single layer, given a training set of input and output values. Uses a backpropagation algorithm. Arguments: X: input data points without bias, matrix y: output data points, array tolerance: minimum error to stop the training, float. Optional (default=0.1) max_iterations: maximum number of iterations performed by the algorithm. Optional. Default=1000 Returns: weights, array """ # input and output training sizes n_inputs = X.shape n_outputs = y.shape n_examples = X.shape # add bias bias = np.ones(n_examples) X = np.c_[bias, X] # add bias column to X # initialization converged = False i = 0 # iteration counter # initialize weights randomly with mean 0 w = 2 * np.random.random((n_inputs+1, n_outputs)) - 1 # main loop until converged or max iterations reached while not converged and i <= max_iterations: # note: all training examples are used together # forward propagation pred_y = sigmoid(np.dot(X, w)) # predicted output # how much did we miss? Compute the error error = y - pred_y # backpropagation: multiply how much we missed by the # slope of the sigmoid at the values in pred_y delta = error * sigmoidDeriv(pred_y) # update weights w += np.dot(X.T, delta) # did we converge? maxError = max(error) # biggest error among training examples if abs(maxError) < tolerance: converged = True # yes ! print ("** converged after iterations: ",i) i += 1 # no convergence yet: next iteration # end of loop if not converged: print("** Not converged!") return w
And this is how it works with a simple two-values input and an OR output:
# seed random numbers to make calculation # deterministic (just a good practice when testing / debugging) np.random.seed(1) print("*** Simulation of OR gate using neural network ***") # input dataset: 4 training examples X_train = np.array([ [0,0], [0,1], [1,0], [1,1] ]) # output dataset: the expected results for each training input y_train = np.array([[0,1,1,1]]).T # learn the weights through training! ORweights = learnWeightsOneLayer(X_train, y_train) # print the truth table print("X1 | X2 | Y = X1 OR X2") print("-----------------------") for x1 in (0,1): for x2 in (0,1): print(x1," | ",x2, " | ", predictOneLayer([x1,x2], ORweights))
And this is the output:
*** Simulation of OR gate using neural network *** ** converged after iterations: 149 X1 | X2 | Y = X1 OR X2 ----------------------- 0 | 0 | 0 0 | 1 | 1 1 | 0 | 1 1 | 1 | 1
Well, it works!
Let’s explore the code piece by piece.
Training data and bias
# input and output training sizes n_inputs = X.shape n_examples = X.shape
The input dataset is a numpy matrix. Each row is a single “training example”. Each column corresponds to one of our input nodes.
In our OR example above, we have 2 input nodes (X1 and X2) and 4 training examples.
Similarly the number of nodes in the output layer (the “labels” of our dataset) is determined by the number of classes, in this case we have only one output node for two classes: 0 and 1. That is the number of column for the y matrix while the rows are determined again by the number of training examples.
# add bias bias = np.ones(n_examples) X = np.c_[bias, X] # add bias column to X
This will add a “bias” neuron, i.e. an additional column of weights. It’s always a good idea to add a bias to a neural network.
The bias node in a neural network is a node that is always ‘on’. That is, its value is set to 1. It is analogous to the intercept in a regression model, and serves the same function: allows you to shift the activation function to the left or right, which may be critical for successful learning.
For example, assume that you want a neuron to fire y≈1 when all the inputs are x≈0. If there is no bias no matter what weights W you have, given the equation y = σ(WX) the neuron will always fire y≈0.5.
# initialise weights randomly with mean 0 w = 2 * np.random.random((n_inputs+1, n_outputs)) - 1
This creates our weight matrix for the neural network.
And it initialises the weights to a random number, using the numpy function random() which returns a random float number or array.
The size of the output is controlled by the parameter. In our example it’s a tuple (n_inputs+1, n_outputs) which means it will return a matrix.
We can choose the dimensionality of the hidden layer.
The more nodes we put into the hidden layer the more complex functions the neural network will be able to model but higher dimensionality means more computation is required to learn the network weights.
Since we only have one layer (the input values) we only need a matrix of weights with one column, to connect them. We keep its dimension here minimal: (3,1) because we have two inputs plus the bias and one output.
Results from the random() function are from the “continuous uniform” distribution over the half-open interval [0.0, 1.0). It’s a best practice to sample the weights from a uniform distribution with mean zero: to sample Unif[-1,1) we just have to multiply the output of random() by 1-(-1)=2 and add -1.
# initialisation converged = False i = 0 # iteration counter # main loop until converged or max iterations reached while not converged and i <= max_iterations:
This is the loop to train the network weights.
It iterates over the training code until we have reached the wished error (converged) or until we have reached the maximum number of iterations.
Both the max_iterations and the error tolerance are optional inputs of this function.
# forward propagation pred_y = sigmoid(np.dot(X, w)) # predicted output
This is the forward propagation prediction step: the network predicts the output given the input and the current weights by matrix multiplication and applying the activation function.
y = σ(Xw)
Note that X contains all training examples (input) and we process all of them at the same time in this implementation. This is known as “full batch” training.
The X matrix is multiplied with the current weights matrix w, using the numpy function dot() and its result is then going through the sigmoid squashing function, our activation function.
The result will be also a matrix, in this case of 4 rows (the four training examples) and one column (the predicted output):
(4 x 3) dot (3 x 1) = (4 x 1) # 4 examples, 3 input and 1 layer of weights
Each output corresponds with the network’s prediction for a given training input. This would work independently on the number of training examples, i.e. if I train the network with more examples, the matrix will be bigger but no changes in code are required.
Loss function (the error)
# Compute the error error = y - pred_y
We can now compare how well it did by subtracting the true answer (y from the training dataset) from the prediction (pred_y).
The error is an array: the difference between actual and prediction for each training example.
Again: this is a super-simple loss function just for illustration. For more complex examples you will use e.g. a quadratic function like we have seen in regression algorithms.
One popular approach is an algorithm called back-propagation that has similarities to the gradient descent algorithm we looked at earlier. The back-propagation algorithm (Rumelhart et al., 1986), often simply called backprop, allows the error information to ﬂow backwards through the network, in order to compute the gradient. Once we have the gradient of this error as a function of the neuron’s weights, we can propagate these output errors backward to infer errors for the other layers and adjust its weights in the direction that most decreases the error.
The term back-propagation is often misunderstood as meaning the whole learning algorithm for neural networks but actually, it refers only to the method for computing the gradient, while another algorithm, such as stochastic or batch gradient descent, is used to perform learning using this gradient.
We use the gradient to find the minimum; this means we first need to compute the gradient. The gradient is just made up of the derivatives of all the inputs concatenated in a vector (remember that a derivative of a function measures the sensitivity to change):
for all i,j,l
where w are the weights and e(w) is the error on the example X,y.
Back-propagation is the key algorithm that makes training deep models computationally tractable, basically is just a clever “trick” to efficiently calculate the gradients starting from the output. I am not going into details here but it’s an old trick and you can find the generic implementation under the name “reverse-mode differentiation”:
Classical methods are slow at computing the partial derivatives of a function with respect to many inputs, as is needed for gradient-based optimization algorithms. Automatic differentiation solves all of these problems […]
In reverse accumulation, one first fixes the dependent variable to be differentiated and computes the derivative with respect to each sub-expression recursively. In a pen-and-paper calculation, one can perform the equivalent by repeatedly substituting the derivative of the outer functions in the chain rule:
The chain rule of calculus is used to compute the derivatives of functions formed by composing other functions whose derivatives are known. The chain rule very simply states that the right thing to do is to simply multiply the gradients together to chain them.
Reverse-mode differentiation is an algorithm that computes the chain rule, with a speciﬁc order of operations that is highly eﬃcient.
I won’t go into more details here but there are many excellent explanations in the web, such as the Andrew Ng or Yaser Abu-Mostafa courses.
Bottom line: back-propagation tracks how every neuron affects one output.
# backpropagation: multiply how much we missed by the # slope of the sigmoid at the values in pred_y delta = error * sigmoidDeriv(pred_y)
The first thing we need is a function that computes the gradient of the sigmoid function we created earlier. This is sigmoidDeriv(). The derivative will give back the slope of the predictions.
The variable error is the total error across the whole data, calculated by running the data plus current parameters through the “network” (the forward-propagate function) and comparing the output to the true labels. This is the part we discussed earlier as the cost function.
When we multiply the “slopes” by the error, we are reducing the error of high confidence predictions.
The multiplication is essentially answering the question “how can I adjust my parameters to reduce the error the next time I run through the network”?
It does this by computing the contributions at each layer to the total error and adjusting appropriately by coming up with a gradient matrix (or, how much to change each parameter and in what direction).
Update the weights
# update weights w += np.dot(X.T, delta)
We are now ready to update our network! This line computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line. A small error and a small slope means a VERY small update. However, because we’re using a “full batch” configuration, we’re doing the above step on all four training examples.
Check if converged
# did we converge? maxError = max(error) # biggest error among training examples if abs(maxError) < tolerance: converged = True # yes ! print ("** converged after iterations: ",i) i += 1 # next iteration (to check if max is reached)
The last step is to check if we have reached the end of our optimisation, either because the error is smaller than our tolerance or because we reached the maximum number of iterations.
In our example above, the convergence (i.e. the error was less than 0.1) was reached after 149 iterations.
We can also print out the weights and we can see that they are different than the hard-coded ones (but follow the same pattern: the bias is negative, the inputs are similar and positive).
Weights for ANN modelling OR function: [[-1.63295267] [ 3.84814604] [ 3.83626687]]
There are several combinations of weights that can bring to the same result and there are other equivalent solutions that gradient descent could also ﬁnd.
The convergence point of gradient descent depends on the initial values of the parameters: the initial set of weights (that is random chosen), the gradient descent meta-parameters (not used in this simple example) and in minor part also on the error tolerance defined.
It is also possible to demonstrate that multiplying the resulting weights for a constant C will keep the ANN to output the same result but this is out of the scope for this post. You can try it yourself if you are curious.
An OR gate with 3 inputs
Let´s see if the ANN modelling an OR gate can work with 3 inputs:
print("*** Simulation of OR gate with 3 inputs using ANN ***") # input dataset X_train = np.array([ [0,0,0], [0,0,1], [0,1,0], [1,0,0] ]) # output dataset y_train = np.array([[0,1,1,1]]).T ORweights = learnWeightsOneLayer(X_train, y_train) print("X1 | X2 | X3 | Y = X1 OR X2 OR X3") print("----------------------------------") inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1)) for x1,x2,x3 in inputs: print (x1," | ",x2, " | ", x3, " | ", predictOneLayer([x1,x2,x3], ORweights))
Note that the training set X_train is now a matrix with 3 columns for the 3 inputs and that you do not need to list ALL combinations as training examples.
For an OR function with 3 operators there would be a total of 2^3 = 8 possible combinations but here we train the network with only 4 examples.
Choosing the right training dataset (large and representative enough) is a science for itself but I will not dig more here. You can refer to the previous posts about overfitting and cross-validations.
This is the output:
*** Simulation of OR gate with 3 inputs using ANN *** ** converged after iterations: 169 X1 | X2 | X3 | Y = X1 OR X2 OR X3 ---------------------------------- 0 | 0 | 0 | 0 0 | 0 | 1 | 1 0 | 1 | 0 | 1 0 | 1 | 1 | 1 1 | 0 | 0 | 1 1 | 0 | 1 | 1 1 | 1 | 0 | 1 1 | 1 | 1 | 1
It works perfectly for all 8 combinations and convergence has been reached after 169 iterations, pretty fast.
Non-linear example: XOR
The real proof of truth is to model a non-linear function, such as the XOR function, a sort of Hello World for Neural Networks.
We have seen in the previous post that a simple perceptron (one neuron with only one single layer of input) cannot model it. But it’s possible by combining together perceptrons that model each one different functions, such as OR, AND and NOT:
x1 XOR x2 = (x1 AND NOT x2) OR (NOT x1 AND x2)
The resulting multi-layer network of perceptrons is basically an artificial neural network with one additional layer – often called the hidden layer:
To model such a network we need to add one layer of weights, i.e. one column to the weights matrix and to compute the hidden layer weights in the training algorithm.
XOR with two inputs
Note: To keep the example more clear I will keep the weights of the two layers separately.
def learnWeightsTwoLayers(X, y, tolerance = 0.1, max_iterations = 1000): """ Learn the weights for an artificial neural network with two layers, given a training set of input and output values. Uses a backpropagation algorithm. Arguments: X: input data points without bias, matrix y: output data points, array tolerance: minimum error to stop the training, float. Optional (default=0.1) max_iterations: maximum number of iterations performed by the algorithm. Optional. Default=1000 Returns: weights for input layer, array weights for hidden layer, array """ # input and output training sizes n_inputs = X.shape n_outputs = y.shape n_examples = X.shape # add bias bias = np.ones(n_examples) X = np.c_[bias, X] # add bias column to X # initialize weights randomly with mean 0 w1 = 2 * np.random.random((n_inputs+1, n_inputs)) - 1 # input layer w2 = 2 * np.random.random((n_inputs+1, n_outputs)) - 1 # hidden layer # initialization converged = False i = 0 # iteration counter # main loop until converged or max iterations reached while not converged and i <= max_iterations: # note: all training examples together (batch) # forward propagation hiddenLayer = sigmoid(np.dot(X, w1)) hiddenLayer = np.c_[bias, hiddenLayer] # add bias pred_y = sigmoid(np.dot(hiddenLayer, w2)) # predicted output # Compute the error error_y = y - pred_y # backpropagation: multiply how much we missed by the # slope of the sigmoid at the values in y_pred delta_y = error_y * sigmoidDeriv(pred_y) # repeat for hidden layer: error and delta # how much did each value of hidden layer contribute # to the final error (according to the weights)? error_l1 = delta_y.dot(w2.T) delta_l1 = error_l1 * sigmoidDeriv(hiddenLayer) # the first column refers to the bias: remove it delta_l1 = np.delete(delta_l1, 0, axis=1) # update weights w2 += np.dot(hiddenLayer.T, delta_y) w1 += np.dot(X.T, delta_l1) # did we converge? maxError = max(abs(error_y)) if abs(maxError) < tolerance: converged = True # yes ! print ("** converged after iterations: ",i) i += 1 # next iteration (to check if max is reached) if not converged: print("** Not converged!") return w1, w2
The mainly difference is that we have now two arrays of weights w1 and w2, one storing the weights for the input layer and one for the hidden layer.
Then accordingly the code is changed to take in considerations both sets of weights.
Let´s see it in action:
def predictTwoLayers(X, w1, w2): """ Compute the binary output of an Artificial Neural Network with two layers, given its neuron weights and an input dataset Arguments: X: input data points without bias, list w1: the input-layer weights, array w2: the hidden-layer weights, array Returns: integer (0 or 1) """ bias = np.ones(1) X = np.append(bias, X, 0) # add bias column to X # forward propagation hiddenLayer = sigmoid(np.dot(X, w1)) hiddenLayer = np.append(bias, hiddenLayer, 0) # add bias y = np.dot(hiddenLayer, w2) return sigmoidActivation(y) print("*** Simulation of XOR gate using ANN ***") print("***") # input dataset X_train = np.array([ [0,0], [0,1], [1,0], [1,1] ]) # output dataset y_train = np.array([[0,1,1,0]]).T # note that we use tolerance = 0.4 XORweights1, XORweights2 = learnWeightsTwoLayers(X_train, y_train, 0.4) print("X1 | X2 | Y = X1 XOR X2") print("------------------------") for x1 in (0,1): for x2 in (0,1): print(x1," | ",x2, " | ", predictTwoLayers(np.array([x1,x2]), XORweights1, XORweights2))
I used a higher tolerance, since the output of a XOR gate is binary so we don’t need that much approximation. This will make the training algorithm faster.
And here is the output:
*** Simulation of XOR gate using ANN *** *** ** converged after iterations: 967 X1 | X2 | Y = X1 XOR X2 ------------------------ 0 | 0 | 0 0 | 1 | 1 1 | 0 | 1 1 | 1 | 0
It works but note that it takes now 967 iterations to converge.
Adding layers to a network will normally make it slower and harder to train. Normally deeper neural network will be more optimised, e.g. will use better gradient descent algorithms, regularisations and so on.
XOR with 3 inputs
And it works with 3 inputs too:
print("*** Simulation of XOR gate using ANN, now with 3 inputs ***") print("***") # input dataset X_train = np.array([ [0,0,0], [0,1,0], [1,0,1], [0,1,1], [1,0,0], [1,1,1]]) # output dataset y_train = np.array([[0,1,1,1,1,0]]).T XORweights1, XORweights2 = learnWeightsTwoLayers(X_train, y_train, 0.4) print("X1 | X2 | X3 | Y = X1 XOR X2 XOR X3") print("---------------------------------------") inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1)) for x1,x2,x3 in inputs: print (x1," | ",x2, " | ", x3, " | ", predictTwoLayers([x1,x2,x3], XORweights1, XORweights2))
and the output is:
*** Simulation of XOR gate using ANN, now with 3 inputs *** *** ** converged after iterations: 360 X1 | X2 | X3 | Y = X1 XOR X2 XOR X3 --------------------------------------- 0 | 0 | 0 | 0 0 | 0 | 1 | 1 0 | 1 | 0 | 1 0 | 1 | 1 | 1 1 | 0 | 0 | 1 1 | 0 | 1 | 1 1 | 1 | 0 | 1 1 | 1 | 1 | 0
The perceptron can model linear functions but cannot model non-linear functions, unless you combine perceptrons together. Nevertheless the perceptron convergence procedure works by ensuring that every time the weights change, they get closer to every “generously feasible” set of weights. This type of guarantee cannot be extended to more complex networks in which the average of two good solutions may be a bad solution (non-convex problems). Therefore the perceptron learning procedure cannot be generalised to hidden layers.
Instead of showing the weights get closer to a good set of weights, the Artificial Neural Networks learn the weights so that the actual output values get closer the target values. The aim of learning is to minimise the error summed over all training cases.
Sigmoid neurons give a real-valued output that is a smooth and bounded function of their total input; they have nice derivatives which make learning easy.
Such an Artificial Neural Network can represent non-linear functions.
The Universal Approximation theorem states that a simple feed-forward neural network with a single hidden layer is enough to represent a wide variety of continuous functions.
This is a bare example. Deep neural network have dozens of layers and use many additional optimisations to improve the learning: different activation functions, regularisations to prevent overfitting ((L1 and L2 regularization, dropout), better choice of cost function (squared difference, cross-entropy), tuned weight initialisation and choice of hyper-parameters (learning rate, number of layers and neurons, …), sophisticated gradient descent algorithms. But the principle is the same.
ANNs can represent efficiently almost any kind of function and studies are constantly progressing.