Cross-validation

We have seen previously how learning the parameters of a prediction function on the same data would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice to hold out part of the available data as a test set.

The test error would be then the average error that results when predicting the response on a new observation, one that was not used in training the learning method.
In contrast, the training error is calculated by applying the learning method to the observations used in its training.
But the training error rate often is quite different from the test error rate, and in particular it can dramatically underestimate the general error.

The best solution would be to use a large designated test set, that often is not available.

Here we see a class of methods that estimate the test error by holding out a subset of the training observations from the fitting process, and then applying the learning method to those held out observations. Continue reading “Cross-validation”

Advertisements

Introduction to NTLK

Text is everywhere, approximately 80% of all data is estimated to be unstructured text/rich data (web pages, social networks, search queries, documents, …) and text data is growing fast, an estimated 2.5 Exabytes every day!

We have seen how to do some basic text processing in Python, now we introduce an open source framework for natural language processing that can further help to work with human languages: NLTK (Natural Language ToolKit).

Tokenise a text

Let’s see it firstly with a basic NLP task, the usual tokenisation (split a text into tokens or words).

You can follow along with a notebook in GitHub. Continue reading “Introduction to NTLK”

NLP3o: A web app for simple text analysis

We have seen earlier an introduction of few NLP (Natural Language Processing) topics, specifically how to tokenise a text, remove the punctuation and its stop words (the most common words such as the conjunctions) and how to find the top used words.

Now we can see – as an example – how to put all these inside a simple web app.

The entire source code is available on GitHub, so you can better follow along; and you can see it in action in a Heroku dyno.

Screen Shot 2017-07-29 at 14.55.08 Continue reading “NLP3o: A web app for simple text analysis”

Logistic regression with Python statsmodels

We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.
This was done using Python, the sigmoid function and the gradient descent. 

We can now see how to solve the same example using the statsmodels library, specifically the logit package, that is for logistic regression. The package contains an optimised and efficient algorithm to find the correct regression parameters.
You can follow along from the Python notebook on GitHub.

Continue reading “Logistic regression with Python statsmodels”

Back-propagation for neural network

Recalling the perceptron

In the previous post we have seen how an artificial neuron works to classify simple functions and we have also seen its limitations when the space function to be classified is not linear, e.g. it fails to compute correctly the XOR function.

We have also seen that the way to overcome this limitation is to combine more neurons together in a network but one thing we did not yet seen was how to compute the “weights” of the network (they were hard-coded in the previous example).

As a reminder, here is the perceptron able to compute an OR function:

neuronor

The input values are multiplied with specific weights and combined together into an activation function which outputs the OR function.
The “artificial neural network” (ANN) is really just this matrix of weights. Here the matrix has only one column (the OR function is very simple) but we may have “layers” and layers of neurons and their weights.
The values calculated are just transient values based on the input dataset. We don’t save them, they are forgotten.
All of the learning is stored in the weight matrix.

Like similar supervised models, the ANNs can be trained to learn the weights using some labeled training data (the “supervised” data) and once the weights matrix is learnt, we can use it to compute the function that we trained for.

Sigmoid neurons

Before seeing how the weights can be learned, we will slightly tune the activation function (the red circle in the figure above)  in order to make the learning phase easier.
In the perceptron example, this activation function was a simple threshold one, outputting 0 if the input was below a threshold and 1 otherwise. Instead, we will use a sigmoid function which has the advantage that small changes in the weights and bias cause only a small change in their output.  The sigmoid function – often called the logistic function – is defined by:

\frac{1}{(1+e^{-z})}

The sigmoid activation function is always positive,  bounded between 0 and 1 and strictly increasing.

sigmoid
Sigmoid function

And this is its Python implementation:

import numpy as np

def sigmoid(z): 
  """
  Compute the sigmoid function

  Argument:
    z: float number

  Returns:
    float
  """
 
  return 1 / (1 + np.exp(-z))

The sigmoid function can be thought as a “squashing function”, because it takes the input and squashes it to be between zero and one: very negative values are squashed towards zero and positive values get squashed towards one.
For example we have:

sigmoid(-5) = 0.007;  sigmoid(5) = 0.993

The output of a sigmoid neuron

The output of a sigmoid neuron is obviously not 0 or 1 but is a real number between 0 and 1. This can be useful, for example, if we want to use the output value to represent the intensity of the pixels in an image input to a neural network. But sometimes it is not, in fact to model an OR gate we need to have as output 0 or 1, as it was in the perceptron.
In practice we can use a convention to deal with this, for example, by deciding to interpret any output of at least 0.5 as indicating a “1”, and any output less than 0.5 as indicating a “0”.

We can quickly verify that the original perceptron simulating the OR function is still working with this new activation function:

def sigmoidActivation(z):
  """
  Compute the activation function for boolean operators using sigmoid
  Since a sigmoid function is always between 0 and 1, 
  we just cut off at 0.5

  Arguments:
    z: input data points including bias, array

  Returns:
    integer (0 or 1)
  """

  if sigmoid(z) > 0.5:
    return 1
  else:
    return 0


def predictOneLayer(X, weights):
  """
  Compute the binary output of an Artificial Neural Network with single layer,
  given its neuron weights and an input dataset

  Arguments:
    X: input data points without bias, list
    weights: the single-layer weights, array

  Returns:
    integer (0 or 1)
  """
  X = [1] + X # add bias to the input list

    # forward propagation
  y = np.dot(X, weights).sum()

  return sigmoidActivation(y)
 

def forwardOR(operands):
  """
  Simulate an OR function using a simple forward neural network 
  (single layer)

  Arguments:
    operands: input data points without bias, list

  Returns:
    integer (0 or 1)
  """
 
  m = len(operands) # number of training examples
  ORweights = [-0.5] + [1]*m  # hard-coded weights
  return predictOneLayer(operands, ORweights)


print("*** Simulation of OR using one forward neuron ***")
print("X1 | X2 | Y = X1 OR X2")
print("-----------------------")
for x1 in (0,1):
  for x2 in (0,1):
    print(x1," | ",x2, " | ", forwardOR([x1,x2]))

And here is its output:

*** Simulation of OR using one forward neuron ***
X1 | X2 | Y = X1 OR X2
-----------------------
0  | 0  | 0
0  | 1  | 1
1  | 0  | 1
1  | 1  | 1

The OR function still works correctly using sigmoid neurons.

As you notice, the weights in this OR example above are hard-coded because we can get them manually very quick since the example is pretty straightforward. But this is obviously not scalable as the complexity of a function is growing.
Therefore we will see now how we can find the appropriate weights to model a simple function like the OR.

Training the network

What we need is an algorithm that will find the appropriate weights (remember: they ARE the artificial neural network) programmatically for us so that the output from the network approximates the expected y(x) for all training inputs x.

Cost function

To quantify how well we’re achieving this goal we will define a cost function, sometimes referred to as a loss or objective function.
We have already seen examples of cost functions in the regression models and it can be a complex function, depending on the domain – a common choice is the categorical cross-entropy loss  – but for this simple example we just stick to the difference between the network output and the training output:

error = actual output(label)  – predicted output

Learning the weights

The output of the ANN is:

output = h(W * X + w0)

where X is the input, w0 is the bias, W are the weights and h is the non-linear activation function.neuron

Learning the parameters for our network means finding weights that minimise the error on our training data.  By finding parameters that minimise the error we maximise the likelihood of our training data.
Designing and training a neural network is not much different from training any other machine learning model with gradient descent.

Sigmoid derivative

The gradient descent works by using the derivative function. The derivatives of the sigmoid with respect to the inputs and the weights are very simple:

\sigma = \frac{1}{(1+e^{-z})} \rightarrow \frac{\partial \sigma(z)}{\partial z} = \sigma(z) * (1-\sigma(z))

therefore its Python implementation is clear-cut:

def sigmoidDeriv(s):
  """
  Compute the derivative of a sigmoid function

  Argument:
    s: float number

  Returns:
    float
  """

  return s * (1-s)

Now we have everything to explore the learning algorithm for artificial neural network.

The learning algorithm

The general idea is to forward-feed the input dataset to an initial set of weights to compute the output and the error, then back-propagate the error using the gradient (this is the “core” part!) and update the weights until the error is acceptable:

0. start with X and y (input and output training data)
1. initialise weights Wij randomly
2. loop until error < tolerance or max iterations:
  3. for each training example Xi:
    4. forward: out = theta(Wij * Xi) # theta is the activation function
    5. compute the error
    6. backward: compute gradient
    7. update all weights Wij using the gradient
8. return the final weights

We will see it in details line by line line but here is the entire working Python implementation in the case of one single layer of neurons:

def learnWeightsOneLayer(X, y, tolerance = 0.1, max_iterations = 1000):
  """
  Learn the weights for an artificial neural network with single layer,
  given a training set of input and output values.
  Uses a backpropagation algorithm.

  Arguments:
    X: input data points without bias, matrix
    y: output data points, array
    tolerance: minimum error to stop the training, float. Optional (default=0.1)
    max_iterations: maximum number of iterations performed by the algorithm.
                    Optional. Default=1000

  Returns:
    weights, array
  """

  # input and output training sizes
  n_inputs = X.shape[1]
  n_outputs = y.shape[1]
  n_examples = X.shape[0]
 
  # add bias
  bias = np.ones(n_examples)
  X = np.c_[bias, X] # add bias column to X
 
  # initialization
  converged = False
  i = 0 # iteration counter


  # initialize weights randomly with mean 0
  w = 2 * np.random.random((n_inputs+1, n_outputs)) - 1 
 
  # main loop until converged or max iterations reached
  while not converged and i <= max_iterations:
 
    # note: all training examples are used together
 
    # forward propagation
    pred_y = sigmoid(np.dot(X, w)) # predicted output
 
 
    # how much did we miss? Compute the error
    error = y - pred_y
 
    # backpropagation: multiply how much we missed by the
    # slope of the sigmoid at the values in pred_y
    delta = error * sigmoidDeriv(pred_y)
 
    # update weights
    w += np.dot(X.T, delta) 

    # did we converge?
    maxError = max(error) # biggest error among training examples
    if abs(maxError) < tolerance:
      converged = True # yes !
      print ("** converged after iterations: ",i)

    i += 1 # no convergence yet: next iteration

  # end of loop
  if not converged:
    print("** Not converged!")
 
  return w

And this is how it works with a simple two-values input and an OR output:

   # seed random numbers to make calculation
   # deterministic (just a good practice when testing / debugging)
 np.random.seed(1)
 
 print("*** Simulation of OR gate using neural network ***")
   # input dataset: 4 training examples
 X_train = np.array([ [0,0], [0,1], [1,0], [1,1] ])
   # output dataset: the expected results for each training input
 y_train = np.array([[0,1,1,1]]).T
 
  # learn the weights through training!
 ORweights = learnWeightsOneLayer(X_train, y_train)

  # print the truth table 
 print("X1 | X2 | Y = X1 OR X2")
 print("-----------------------")
 for x1 in (0,1):
   for x2 in (0,1):
     print(x1," | ",x2, " | ", predictOneLayer([x1,x2], ORweights))

And this is the output:

*** Simulation of OR gate using neural network ***
** converged after iterations: 149
X1 | X2 | Y = X1 OR X2
-----------------------
0  | 0  | 0
0  | 1  | 1
1  | 0  | 1
1  | 1  | 1

Well, it works!

Let’s explore the code piece by piece.

Training data and bias

 # input and output training sizes
 n_inputs = X.shape[1]
 n_examples = X.shape[0]

The input dataset is a numpy matrix. Each row is a single “training example”. Each column corresponds to one of our input nodes.
In our OR example above, we have 2 input nodes (X1 and X2) and 4 training examples.
Similarly the number of nodes in the output layer (the “labels” of our dataset) is determined by the number of classes, in this case we have only one output node for two classes: 0 and 1. That is the number of column for the y matrix while the rows are determined again by the number of training examples.

# add bias
  bias = np.ones(n_examples)
  X = np.c_[bias, X] # add bias column to X

This will add a “bias” neuron, i.e. an additional column of weights. It’s always a good idea to add a bias to a neural network.
The bias node in a neural network is a node that is always ‘on’. That is, its value is set to 1. It is analogous to the intercept in a regression model, and serves the same function: allows you to shift the activation function to the left or right, which may be critical for successful learning.

For example, assume that you want a neuron to fire y≈1 when all the inputs are x≈0. If there is no bias no matter what weights W you have, given the equation y = σ(WX) the neuron will always fire y≈0.5.

Initial weights

   # initialise weights randomly with mean 0
 w = 2 * np.random.random((n_inputs+1, n_outputs)) - 1

This creates our weight matrix for the neural network.
And it initialises the weights to a random number, using the numpy function random() which returns a random float number or array.
The size of the output is controlled by the parameter. In our example it’s a tuple (n_inputs+1, n_outputs) which means it will return a matrix.

We can choose the dimensionality of the hidden layer.
The more nodes we put into the hidden layer the more complex functions the neural network will be able to model but higher dimensionality means more computation is required to learn the network weights.

Since we only have one layer (the input values) we only need a matrix of weights with one column, to connect them. We keep its dimension here minimal: (3,1) because we have two inputs plus the bias and one output.

Results from the random() function are from the “continuous uniform” distribution over the half-open interval [0.0, 1.0). It’s a best practice to sample the weights from a uniform distribution with mean zero: to sample Unif[-1,1) we just have to multiply the output of random() by 1-(-1)=2 and add -1.

Iterate

   # initialisation
 converged = False
 i = 0 # iteration counter
 
   # main loop until converged or max iterations reached
 while not converged and i <= max_iterations:

This is the loop to train the network weights.
It  iterates over the training code until we have reached the wished error (converged) or until we have reached the maximum number of iterations.
Both the max_iterations and the error tolerance are optional inputs of this function.

Feed-forward

 # forward propagation
 pred_y = sigmoid(np.dot(X, w)) # predicted output

This is the forward propagation prediction step: the network predicts the output given the input and the current weights by matrix multiplication and applying the activation function.

y = σ(Xw)

Note that X contains all training examples (input) and we process all of them at the same time in this implementation. This is known as “full batch” training.

The X matrix is multiplied with the current weights matrix w, using the numpy function dot() and its result is then going through the sigmoid squashing function, our activation function.
The result will be also a matrix, in this case of 4 rows (the four training examples) and one column (the predicted output):

(4 x 3) dot (3 x 1) = (4 x 1)  # 4 examples, 3 input and 1 layer of weights

Each output corresponds with the network’s prediction for a given training input. This would work independently on the number of training examples, i.e. if I train the network with more examples, the matrix will be bigger but no changes in code are required.

Loss function (the error)

 # Compute the error
 error = y - pred_y

We can now compare how well it did by subtracting the true answer (y from the training dataset) from the prediction (pred_y).
The error is an array: the difference between actual and prediction for each training example.
Again: this is a super-simple loss function just for illustration. For more complex examples you will use e.g. a quadratic function like we have seen in regression algorithms.

Back-propagation

One popular approach is an algorithm called back-propagation that has similarities to the gradient descent algorithm we looked at earlier. The back-propagation algorithm (Rumelhart et al., 1986), often simply called backprop, allows the error information to flow backwards through the network, in order to compute the gradient. Once we have the gradient of this error as a function of the neuron’s weights, we can propagate these output errors backward to infer errors for the other layers and adjust its weights in the direction that most decreases the error.

neuron-error

The term back-propagation is often misunderstood as meaning the whole learning algorithm for neural networks but actually, it refers only to the method for computing the gradient, while another algorithm, such as stochastic or batch gradient descent, is used to perform learning using this gradient.

We use the gradient to find the minimum; this means we first need to compute the gradient. The gradient is just made up of the derivatives of all the inputs concatenated in a vector (remember that a derivative of a function measures the sensitivity to change):

\nabla e(w) = \frac{\partial e(w)}{\partial w^{(l)}_{ij}}  for all i,j,l

where w are the weights and e(w) is the error on the example X,y.

Back-propagation is the key algorithm that makes training deep models computationally tractable, basically is just a clever “trick” to efficiently calculate the gradients starting from the output. I am not going into details here but it’s an old trick and you can find the generic implementation under the name “reverse-mode differentiation”:

Classical methods are slow at computing the partial derivatives of a function with respect to many inputs, as is needed for gradient-based optimization algorithms. Automatic differentiation solves all of these problems […]
In reverse accumulation, one first fixes the dependent variable to be differentiated and computes the derivative with respect to each sub-expression recursively. In a pen-and-paper calculation, one can perform the equivalent by repeatedly substituting the derivative of the outer functions in the chain rule:

\frac{\partial y}{\partial x} = \frac{\partial y}{\partial w_{1}} \cdot \frac{\partial w_{1}}{\partial x}

The chain rule of calculus is used to compute the derivatives of functions formed by composing other functions whose derivatives are known. The chain rule very simply states that the right thing to do is to simply multiply the gradients together to chain them.
Reverse-mode differentiation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient.

I won’t go into more details here but there are many excellent explanations in the web, such as the Andrew Ng or Yaser Abu-Mostafa courses.

Bottom line: back-propagation tracks how every neuron affects one output.

 # backpropagation: multiply how much we missed by the
 # slope of the sigmoid at the values in pred_y
delta = error * sigmoidDeriv(pred_y)

The first thing we need is a function that computes the gradient of the sigmoid function we created earlier. This is sigmoidDeriv(). The derivative will give back the slope of the predictions.

The variable error is the total error across the whole data, calculated by running the data plus current parameters through the “network” (the forward-propagate function) and comparing the output to the true labels. This is the part we discussed earlier as the cost function.

When we multiply the “slopes” by the error, we are reducing the error of high confidence predictions.
The multiplication is essentially answering the question “how can I adjust my parameters to reduce the error the next time I run through the network”?
It does this by computing the contributions at each layer to the total error and adjusting appropriately by coming up with a gradient matrix (or, how much to change each parameter and in what direction).

Update the weights

 # update weights
 w += np.dot(X.T, delta)

We are now ready to update our network! This line computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line. A small error and a small slope means a VERY small update. However, because we’re using a “full batch” configuration, we’re doing the above step on all four training examples.

Check if converged

   # did we converge?
 maxError = max(error) # biggest error among training examples
 if abs(maxError) < tolerance:
   converged = True # yes !
   print ("** converged after iterations: ",i)

 i += 1 # next iteration (to check if max is reached)

The last step is to check if we have reached the end of our optimisation, either because the error is smaller than our tolerance or because we reached the maximum number of iterations.

In our example above, the convergence (i.e. the error was less than 0.1) was reached after 149 iterations.

We can also print out the weights and we can see that they are different than the hard-coded ones (but follow the same pattern: the bias is negative, the inputs are similar and positive).

Weights for ANN modelling OR function: 
[[-1.63295267]
 [ 3.84814604]
 [ 3.83626687]]

There are several combinations of weights that can bring to the same result and there are other equivalent solutions that gradient descent could also find.
The convergence point of gradient descent depends on the initial values of the parameters: the initial set of weights (that is random chosen), the gradient descent meta-parameters (not used in this simple example) and in minor part also on the error tolerance defined.

It is also possible to demonstrate that multiplying the resulting weights for a constant C will keep the ANN to output the same result but this is out of the scope for this post. You can try it yourself if you are curious.

An OR gate with 3 inputs

Let´s see if the ANN modelling an OR gate can work with 3 inputs:

 print("*** Simulation of OR gate with 3 inputs using ANN ***")
 
   # input dataset
 X_train = np.array([ [0,0,0], [0,0,1], [0,1,0], [1,0,0] ])
   # output dataset 
 y_train = np.array([[0,1,1,1]]).T
 
 ORweights = learnWeightsOneLayer(X_train, y_train)

 print("X1 | X2 | X3 | Y = X1 OR X2 OR X3")
 print("----------------------------------")
 inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1))
 for x1,x2,x3 in inputs:
   print (x1," | ",x2, " | ", x3, " | ", 
          predictOneLayer([x1,x2,x3], ORweights))

Note that the training set X_train is now a matrix with 3 columns for the 3 inputs and that you do not need to list ALL combinations as training examples.
For an OR function with 3 operators there would be a total of 2^3 = 8 possible combinations but here we train the network with only 4 examples.
Choosing the right training dataset (large and representative enough) is a science for itself but I will not dig more here. You can refer to the previous posts about overfitting and cross-validations.

This is the output:

*** Simulation of OR gate with 3 inputs using ANN ***
** converged after iterations: 169
X1 | X2 | X3 | Y = X1 OR X2 OR X3
----------------------------------
0  | 0  | 0  | 0
0  | 0  | 1  | 1
0  | 1  | 0  | 1
0  | 1  | 1  | 1
1  | 0  | 0  | 1
1  | 0  | 1  | 1
1  | 1  | 0  | 1
1  | 1  | 1  | 1

It works perfectly for all 8 combinations and convergence has been reached after 169 iterations, pretty fast.

Non-linear example: XOR

The real proof of truth is to model a non-linear function, such as the XOR function, a sort of Hello World for Neural Networks.
We have seen in the previous post that a simple perceptron (one neuron with only one single layer of input) cannot model it. But it’s possible by combining together perceptrons that model each one different functions, such as OR, AND and NOT:

x1 XOR x2 = (x1 AND NOT x2) OR (NOT x1 AND x2)

The resulting multi-layer network of perceptrons is basically an artificial neural network with one additional layer – often called the hidden layer:

neuronxor

To model such a network we need to add one layer of weights, i.e. one column to the weights matrix and to compute the hidden layer weights in the training algorithm.

XOR with two inputs

Note: To keep the example more clear I will keep the weights of the two layers separately.

def learnWeightsTwoLayers(X, y, tolerance = 0.1, max_iterations = 1000):
  """
  Learn the weights for an artificial neural network with two layers,
  given a training set of input and output values.
  Uses a backpropagation algorithm.

  Arguments:
    X: input data points without bias, matrix
    y: output data points, array
    tolerance: minimum error to stop the training, float. Optional (default=0.1)
    max_iterations: maximum number of iterations performed by the algorithm.
                    Optional. Default=1000

  Returns:
    weights for input layer, array
    weights for hidden layer, array
  """  

    # input and output training sizes
  n_inputs = X.shape[1]
  n_outputs = y.shape[1]
  n_examples = X.shape[0]
 
    # add bias
  bias = np.ones(n_examples)
  X = np.c_[bias, X] # add bias column to X
 
    # initialize weights randomly with mean 0
  w1 = 2 * np.random.random((n_inputs+1, n_inputs)) - 1 # input layer
  w2 = 2 * np.random.random((n_inputs+1, n_outputs)) - 1 # hidden layer
 
   # initialization
  converged = False
  i = 0 # iteration counter

   # main loop until converged or max iterations reached
  while not converged and i <= max_iterations:
 
    # note: all training examples together (batch)
 
       # forward propagation
    hiddenLayer = sigmoid(np.dot(X, w1)) 
    hiddenLayer = np.c_[bias, hiddenLayer] # add bias 
    pred_y = sigmoid(np.dot(hiddenLayer, w2))  # predicted output
 
 
       # Compute the error
    error_y = y - pred_y 
 
       # backpropagation: multiply how much we missed by the
       # slope of the sigmoid at the values in y_pred
    delta_y = error_y * sigmoidDeriv(pred_y)
 
       # repeat for hidden layer: error and delta
       # how much did each value of hidden layer contribute 
       # to the final error (according to the weights)?
    error_l1 = delta_y.dot(w2.T)
    delta_l1 = error_l1 * sigmoidDeriv(hiddenLayer)
       # the first column refers to the bias: remove it
    delta_l1 = np.delete(delta_l1, 0, axis=1)
 
      # update weights
    w2 += np.dot(hiddenLayer.T, delta_y) 
    w1 += np.dot(X.T, delta_l1) 

      # did we converge?
    maxError = max(abs(error_y))
    if abs(maxError) < tolerance:
      converged = True # yes !
      print ("** converged after iterations: ",i)

    i += 1 # next iteration (to check if max is reached)

  if not converged:
    print("** Not converged!")
 
  return w1, w2

The mainly difference is that we have now two arrays of weights w1 and w2, one storing the weights for the input layer and one for the hidden layer.
Then accordingly the code is changed to take in considerations both sets of weights.

Let´s see it in action:

def predictTwoLayers(X, w1, w2):
  """
  Compute the binary output of an Artificial Neural Network with two layers,
  given its neuron weights and an input dataset

  Arguments:
    X: input data points without bias, list
    w1: the input-layer weights, array
    w2: the hidden-layer weights, array

  Returns:
    integer (0 or 1)
  """ 
  bias = np.ones(1)
  X = np.append(bias, X, 0) # add bias column to X

    # forward propagation
  hiddenLayer = sigmoid(np.dot(X, w1)) 
  hiddenLayer = np.append(bias, hiddenLayer, 0) # add bias 
  y = np.dot(hiddenLayer, w2) 

  return sigmoidActivation(y)

print("*** Simulation of XOR gate using ANN ***")
print("***")

  # input dataset
X_train = np.array([ [0,0], [0,1], [1,0], [1,1] ])
  # output dataset 
y_train = np.array([[0,1,1,0]]).T
   # note that we use tolerance = 0.4
XORweights1, XORweights2 = learnWeightsTwoLayers(X_train, y_train, 0.4)

 
print("X1 | X2 | Y = X1 XOR X2")
print("------------------------")
for x1 in (0,1):
  for x2 in (0,1):
    print(x1," | ",x2, " | ", predictTwoLayers(np.array([x1,x2]), 
                                               XORweights1, XORweights2))
 

I used a higher tolerance, since the output of a XOR gate is binary so we don’t need that much approximation. This will make the training algorithm faster.

And here is the output:

*** Simulation of XOR gate using ANN ***
***
** converged after iterations: 967
X1 | X2 | Y = X1 XOR X2
------------------------
0  | 0  | 0
0  | 1  | 1
1  | 0  | 1
1  | 1  | 0

It works but note that it takes now 967 iterations to converge.
Adding layers to a network will normally make it slower and harder to train. Normally deeper neural network will be more optimised, e.g. will use better gradient descent algorithms, regularisations and so on.

XOR with 3 inputs

And it works with 3 inputs too:

print("*** Simulation of XOR gate using ANN, now with 3 inputs ***")
print("***")

  # input dataset
X_train = np.array([ [0,0,0], [0,1,0], [1,0,1], [0,1,1], [1,0,0], [1,1,1]])
  # output dataset 
y_train = np.array([[0,1,1,1,1,0]]).T
 
XORweights1, XORweights2 = learnWeightsTwoLayers(X_train, y_train, 0.4)

print("X1 | X2 | X3 | Y = X1 XOR X2 XOR X3")
print("---------------------------------------")
inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1))
for x1,x2,x3 in inputs:
  print (x1," | ",x2, " | ", x3, " | ", predictTwoLayers([x1,x2,x3], 
                                         XORweights1, XORweights2))

and the output is:

*** Simulation of XOR gate using ANN, now with 3 inputs ***
***
** converged after iterations: 360
X1 | X2 | X3 | Y = X1 XOR X2 XOR X3
---------------------------------------
0  | 0  | 0  | 0
0  | 0  | 1  | 1
0  | 1  | 0  | 1
0  | 1  | 1  | 1
1  | 0  | 0  | 1
1  | 0  | 1  | 1
1  | 1  | 0  | 1
1  | 1  | 1  | 0

Conclusion

The perceptron can model linear functions but cannot model non-linear functions, unless you combine perceptrons together. Nevertheless the perceptron convergence procedure works by ensuring that every time the weights change, they get closer to every “generously feasible” set of weights. This type of guarantee cannot be extended to more complex networks in which the average of two good solutions may be a bad solution (non-convex problems). Therefore the perceptron learning procedure cannot be generalised to hidden layers.

Instead of showing the weights get closer to a good set of weights, the Artificial Neural Networks learn the weights so that the actual output values get closer the target values. The aim of learning is to minimise the error summed over all training cases.
Sigmoid neurons give a real-valued output that is a smooth and bounded function of their total input; they have nice derivatives which make learning easy.

Such an Artificial Neural Network can represent non-linear functions.
The Universal Approximation theorem states that a simple feed-forward neural network with a single hidden layer is enough to represent a wide variety of continuous functions.

This is a bare example. Deep neural network have dozens of layers and use many additional optimisations to improve the learning:  different activation functions, regularisations to prevent overfitting ((L1 and L2 regularization, dropout), better choice of cost function (squared difference, cross-entropy), tuned weight initialisation and choice of hyper-parameters (learning rate, number of layers and neurons, …), sophisticated gradient descent algorithms. But the principle is the same.

ANNs can represent efficiently almost any kind of function and studies are constantly progressing.