Machines “think” differently but it’s not a problem (maybe)

Yet another article about the interpretability problem of many AI algorithms, this time on the MIT Technology Review, May/June 2017 issue.

The issue is clear; many of the most successful recent AI technologies revolve around deep learning: complex artificial neural networks – with so many layers of so many neurons transforming so many variables – that behave like “black boxes” for us.
We cannot comprehend anymore the model, we don’t know how or why the outcome to a specific input is obtained.
Is it scary?

In the film Dekalog 1 by Krzysztof Kieślowski – the first of ten short films inspired to the ten Christian imperatives, the first one being “I am the Lord your God; you shall have no other gods before me”  – Krzysztof lives alone with Paweł, his 12-years-old and highly intelligent son, and introduces him to the world of personal computers. Continue reading “Machines “think” differently but it’s not a problem (maybe)”

Advertisements

Big data, data science and machine learning explained

Data are considered the new secret sauce, are everywhere and have been the cornerstone for the success of many high-tech companies, from Google to Facebook.

But we always used data, there are examples from the ancient times dated thousands of years ago.
In the latest centuries data started to find more and more practical applications thanks to the emergence of statistics and later by the Business Intelligence. The earliest known use of the term “Business Intelligence” is by Richard Millar Devens in 1865. Devens used the term to describe how a banker gained profit by receiving and acting upon information about his environment, prior to his competitors.

Big Data

It is after the WWII that the practice of using data-based systems to improve business decision-making – surely driven by advances in automatic computing systems and storage possibilities – started to take off and be used widely. Digital storage becomes more cost-effective for storing data than paper and since then, an unbelievable amount of data have been collected and organised in data warehouses, initially in structured formats. The term Big Data started to be used meaning just a lot of data.

In a 2001 research report and related lectures, analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing

  1. volume (the amount of data reached peaks that could be handled only by specific systems)
  2. velocity (speed of data in and out, including the emergence of real-time data)
  3. variety (the range of data types and sources, often in unstructured formats)

Gartner, and now much of the industry, quickly picked this  “3Vs” model for describing Big Data which, a decade later, has become the generally accepted three defining dimensions of big data.

Continue reading “Big data, data science and machine learning explained”

Back-propagation for neural network

Recalling the perceptron

In the previous post we have seen how an artificial neuron works to classify simple functions and we have also seen its limitations when the space function to be classified is not linear, e.g. it fails to compute correctly the XOR function.

We have also seen that the way to overcome this limitation is to combine more neurons together in a network but one thing we did not yet seen was how to compute the “weights” of the network (they were hard-coded in the previous example).

As a reminder, here is the perceptron able to compute an OR function:

neuronor

The input values are multiplied with specific weights and combined together into an activation function which outputs the OR function.
The “artificial neural network” (ANN) is really just this matrix of weights. Here the matrix has only one column (the OR function is very simple) but we may have “layers” and layers of neurons and their weights.
The values calculated are just transient values based on the input dataset. We don’t save them, they are forgotten.
All of the learning is stored in the weight matrix.

Like similar supervised models, the ANNs can be trained to learn the weights using some labeled training data (the “supervised” data) and once the weights matrix is learnt, we can use it to compute the function that we trained for.

Sigmoid neurons

Before seeing how the weights can be learned, we will slightly tune the activation function (the red circle in the figure above)  in order to make the learning phase easier.
In the perceptron example, this activation function was a simple threshold one, outputting 0 if the input was below a threshold and 1 otherwise. Instead, we will use a sigmoid function which has the advantage that small changes in the weights and bias cause only a small change in their output.  The sigmoid function – often called the logistic function – is defined by:

\frac{1}{(1+e^{-z})}

The sigmoid activation function is always positive,  bounded between 0 and 1 and strictly increasing.

sigmoid
Sigmoid function

And this is its Python implementation:

import numpy as np

def sigmoid(z): 
  """
  Compute the sigmoid function

  Argument:
    z: float number

  Returns:
    float
  """
 
  return 1 / (1 + np.exp(-z))

The sigmoid function can be thought as a “squashing function”, because it takes the input and squashes it to be between zero and one: very negative values are squashed towards zero and positive values get squashed towards one.
For example we have:

sigmoid(-5) = 0.007;  sigmoid(5) = 0.993

The output of a sigmoid neuron

The output of a sigmoid neuron is obviously not 0 or 1 but is a real number between 0 and 1. This can be useful, for example, if we want to use the output value to represent the intensity of the pixels in an image input to a neural network. But sometimes it is not, in fact to model an OR gate we need to have as output 0 or 1, as it was in the perceptron.
In practice we can use a convention to deal with this, for example, by deciding to interpret any output of at least 0.5 as indicating a “1”, and any output less than 0.5 as indicating a “0”.

We can quickly verify that the original perceptron simulating the OR function is still working with this new activation function:

def sigmoidActivation(z):
  """
  Compute the activation function for boolean operators using sigmoid
  Since a sigmoid function is always between 0 and 1, 
  we just cut off at 0.5

  Arguments:
    z: input data points including bias, array

  Returns:
    integer (0 or 1)
  """

  if sigmoid(z) > 0.5:
    return 1
  else:
    return 0


def predictOneLayer(X, weights):
  """
  Compute the binary output of an Artificial Neural Network with single layer,
  given its neuron weights and an input dataset

  Arguments:
    X: input data points without bias, list
    weights: the single-layer weights, array

  Returns:
    integer (0 or 1)
  """
  X = [1] + X # add bias to the input list

    # forward propagation
  y = np.dot(X, weights).sum()

  return sigmoidActivation(y)
 

def forwardOR(operands):
  """
  Simulate an OR function using a simple forward neural network 
  (single layer)

  Arguments:
    operands: input data points without bias, list

  Returns:
    integer (0 or 1)
  """
 
  m = len(operands) # number of training examples
  ORweights = [-0.5] + [1]*m  # hard-coded weights
  return predictOneLayer(operands, ORweights)


print("*** Simulation of OR using one forward neuron ***")
print("X1 | X2 | Y = X1 OR X2")
print("-----------------------")
for x1 in (0,1):
  for x2 in (0,1):
    print(x1," | ",x2, " | ", forwardOR([x1,x2]))

And here is its output:

*** Simulation of OR using one forward neuron ***
X1 | X2 | Y = X1 OR X2
-----------------------
0  | 0  | 0
0  | 1  | 1
1  | 0  | 1
1  | 1  | 1

The OR function still works correctly using sigmoid neurons.

As you notice, the weights in this OR example above are hard-coded because we can get them manually very quick since the example is pretty straightforward. But this is obviously not scalable as the complexity of a function is growing.
Therefore we will see now how we can find the appropriate weights to model a simple function like the OR.

Training the network

What we need is an algorithm that will find the appropriate weights (remember: they ARE the artificial neural network) programmatically for us so that the output from the network approximates the expected y(x) for all training inputs x.

Cost function

To quantify how well we’re achieving this goal we will define a cost function, sometimes referred to as a loss or objective function.
We have already seen examples of cost functions in the regression models and it can be a complex function, depending on the domain – a common choice is the categorical cross-entropy loss  – but for this simple example we just stick to the difference between the network output and the training output:

error = actual output(label)  – predicted output

Learning the weights

The output of the ANN is:

output = h(W * X + w0)

where X is the input, w0 is the bias, W are the weights and h is the non-linear activation function.neuron

Learning the parameters for our network means finding weights that minimise the error on our training data.  By finding parameters that minimise the error we maximise the likelihood of our training data.
Designing and training a neural network is not much different from training any other machine learning model with gradient descent.

Sigmoid derivative

The gradient descent works by using the derivative function. The derivatives of the sigmoid with respect to the inputs and the weights are very simple:

\sigma = \frac{1}{(1+e^{-z})} \rightarrow \frac{\partial \sigma(z)}{\partial z} = \sigma(z) * (1-\sigma(z))

therefore its Python implementation is clear-cut:

def sigmoidDeriv(s):
  """
  Compute the derivative of a sigmoid function

  Argument:
    s: float number

  Returns:
    float
  """

  return s * (1-s)

Now we have everything to explore the learning algorithm for artificial neural network.

The learning algorithm

The general idea is to forward-feed the input dataset to an initial set of weights to compute the output and the error, then back-propagate the error using the gradient (this is the “core” part!) and update the weights until the error is acceptable:

0. start with X and y (input and output training data)
1. initialise weights Wij randomly
2. loop until error < tolerance or max iterations:
  3. for each training example Xi:
    4. forward: out = theta(Wij * Xi) # theta is the activation function
    5. compute the error
    6. backward: compute gradient
    7. update all weights Wij using the gradient
8. return the final weights

We will see it in details line by line line but here is the entire working Python implementation in the case of one single layer of neurons:

def learnWeightsOneLayer(X, y, tolerance = 0.1, max_iterations = 1000):
  """
  Learn the weights for an artificial neural network with single layer,
  given a training set of input and output values.
  Uses a backpropagation algorithm.

  Arguments:
    X: input data points without bias, matrix
    y: output data points, array
    tolerance: minimum error to stop the training, float. Optional (default=0.1)
    max_iterations: maximum number of iterations performed by the algorithm.
                    Optional. Default=1000

  Returns:
    weights, array
  """

  # input and output training sizes
  n_inputs = X.shape[1]
  n_outputs = y.shape[1]
  n_examples = X.shape[0]
 
  # add bias
  bias = np.ones(n_examples)
  X = np.c_[bias, X] # add bias column to X
 
  # initialization
  converged = False
  i = 0 # iteration counter


  # initialize weights randomly with mean 0
  w = 2 * np.random.random((n_inputs+1, n_outputs)) - 1 
 
  # main loop until converged or max iterations reached
  while not converged and i <= max_iterations:
 
    # note: all training examples are used together
 
    # forward propagation
    pred_y = sigmoid(np.dot(X, w)) # predicted output
 
 
    # how much did we miss? Compute the error
    error = y - pred_y
 
    # backpropagation: multiply how much we missed by the
    # slope of the sigmoid at the values in pred_y
    delta = error * sigmoidDeriv(pred_y)
 
    # update weights
    w += np.dot(X.T, delta) 

    # did we converge?
    maxError = max(error) # biggest error among training examples
    if abs(maxError) < tolerance:
      converged = True # yes !
      print ("** converged after iterations: ",i)

    i += 1 # no convergence yet: next iteration

  # end of loop
  if not converged:
    print("** Not converged!")
 
  return w

And this is how it works with a simple two-values input and an OR output:

   # seed random numbers to make calculation
   # deterministic (just a good practice when testing / debugging)
 np.random.seed(1)
 
 print("*** Simulation of OR gate using neural network ***")
   # input dataset: 4 training examples
 X_train = np.array([ [0,0], [0,1], [1,0], [1,1] ])
   # output dataset: the expected results for each training input
 y_train = np.array([[0,1,1,1]]).T
 
  # learn the weights through training!
 ORweights = learnWeightsOneLayer(X_train, y_train)

  # print the truth table 
 print("X1 | X2 | Y = X1 OR X2")
 print("-----------------------")
 for x1 in (0,1):
   for x2 in (0,1):
     print(x1," | ",x2, " | ", predictOneLayer([x1,x2], ORweights))

And this is the output:

*** Simulation of OR gate using neural network ***
** converged after iterations: 149
X1 | X2 | Y = X1 OR X2
-----------------------
0  | 0  | 0
0  | 1  | 1
1  | 0  | 1
1  | 1  | 1

Well, it works!

Let’s explore the code piece by piece.

Training data and bias

 # input and output training sizes
 n_inputs = X.shape[1]
 n_examples = X.shape[0]

The input dataset is a numpy matrix. Each row is a single “training example”. Each column corresponds to one of our input nodes.
In our OR example above, we have 2 input nodes (X1 and X2) and 4 training examples.
Similarly the number of nodes in the output layer (the “labels” of our dataset) is determined by the number of classes, in this case we have only one output node for two classes: 0 and 1. That is the number of column for the y matrix while the rows are determined again by the number of training examples.

# add bias
  bias = np.ones(n_examples)
  X = np.c_[bias, X] # add bias column to X

This will add a “bias” neuron, i.e. an additional column of weights. It’s always a good idea to add a bias to a neural network.
The bias node in a neural network is a node that is always ‘on’. That is, its value is set to 1. It is analogous to the intercept in a regression model, and serves the same function: allows you to shift the activation function to the left or right, which may be critical for successful learning.

For example, assume that you want a neuron to fire y≈1 when all the inputs are x≈0. If there is no bias no matter what weights W you have, given the equation y = σ(WX) the neuron will always fire y≈0.5.

Initial weights

   # initialise weights randomly with mean 0
 w = 2 * np.random.random((n_inputs+1, n_outputs)) - 1

This creates our weight matrix for the neural network.
And it initialises the weights to a random number, using the numpy function random() which returns a random float number or array.
The size of the output is controlled by the parameter. In our example it’s a tuple (n_inputs+1, n_outputs) which means it will return a matrix.

We can choose the dimensionality of the hidden layer.
The more nodes we put into the hidden layer the more complex functions the neural network will be able to model but higher dimensionality means more computation is required to learn the network weights.

Since we only have one layer (the input values) we only need a matrix of weights with one column, to connect them. We keep its dimension here minimal: (3,1) because we have two inputs plus the bias and one output.

Results from the random() function are from the “continuous uniform” distribution over the half-open interval [0.0, 1.0). It’s a best practice to sample the weights from a uniform distribution with mean zero: to sample Unif[-1,1) we just have to multiply the output of random() by 1-(-1)=2 and add -1.

Iterate

   # initialisation
 converged = False
 i = 0 # iteration counter
 
   # main loop until converged or max iterations reached
 while not converged and i <= max_iterations:

This is the loop to train the network weights.
It  iterates over the training code until we have reached the wished error (converged) or until we have reached the maximum number of iterations.
Both the max_iterations and the error tolerance are optional inputs of this function.

Feed-forward

 # forward propagation
 pred_y = sigmoid(np.dot(X, w)) # predicted output

This is the forward propagation prediction step: the network predicts the output given the input and the current weights by matrix multiplication and applying the activation function.

y = σ(Xw)

Note that X contains all training examples (input) and we process all of them at the same time in this implementation. This is known as “full batch” training.

The X matrix is multiplied with the current weights matrix w, using the numpy function dot() and its result is then going through the sigmoid squashing function, our activation function.
The result will be also a matrix, in this case of 4 rows (the four training examples) and one column (the predicted output):

(4 x 3) dot (3 x 1) = (4 x 1)  # 4 examples, 3 input and 1 layer of weights

Each output corresponds with the network’s prediction for a given training input. This would work independently on the number of training examples, i.e. if I train the network with more examples, the matrix will be bigger but no changes in code are required.

Loss function (the error)

 # Compute the error
 error = y - pred_y

We can now compare how well it did by subtracting the true answer (y from the training dataset) from the prediction (pred_y).
The error is an array: the difference between actual and prediction for each training example.
Again: this is a super-simple loss function just for illustration. For more complex examples you will use e.g. a quadratic function like we have seen in regression algorithms.

Back-propagation

One popular approach is an algorithm called back-propagation that has similarities to the gradient descent algorithm we looked at earlier. The back-propagation algorithm (Rumelhart et al., 1986), often simply called backprop, allows the error information to flow backwards through the network, in order to compute the gradient. Once we have the gradient of this error as a function of the neuron’s weights, we can propagate these output errors backward to infer errors for the other layers and adjust its weights in the direction that most decreases the error.

neuron-error

The term back-propagation is often misunderstood as meaning the whole learning algorithm for neural networks but actually, it refers only to the method for computing the gradient, while another algorithm, such as stochastic or batch gradient descent, is used to perform learning using this gradient.

We use the gradient to find the minimum; this means we first need to compute the gradient. The gradient is just made up of the derivatives of all the inputs concatenated in a vector (remember that a derivative of a function measures the sensitivity to change):

\nabla e(w) = \frac{\partial e(w)}{\partial w^{(l)}_{ij}}  for all i,j,l

where w are the weights and e(w) is the error on the example X,y.

Back-propagation is the key algorithm that makes training deep models computationally tractable, basically is just a clever “trick” to efficiently calculate the gradients starting from the output. I am not going into details here but it’s an old trick and you can find the generic implementation under the name “reverse-mode differentiation”:

Classical methods are slow at computing the partial derivatives of a function with respect to many inputs, as is needed for gradient-based optimization algorithms. Automatic differentiation solves all of these problems […]
In reverse accumulation, one first fixes the dependent variable to be differentiated and computes the derivative with respect to each sub-expression recursively. In a pen-and-paper calculation, one can perform the equivalent by repeatedly substituting the derivative of the outer functions in the chain rule:

\frac{\partial y}{\partial x} = \frac{\partial y}{\partial w_{1}} \cdot \frac{\partial w_{1}}{\partial x}

The chain rule of calculus is used to compute the derivatives of functions formed by composing other functions whose derivatives are known. The chain rule very simply states that the right thing to do is to simply multiply the gradients together to chain them.
Reverse-mode differentiation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient.

I won’t go into more details here but there are many excellent explanations in the web, such as the Andrew Ng or Yaser Abu-Mostafa courses.

Bottom line: back-propagation tracks how every neuron affects one output.

 # backpropagation: multiply how much we missed by the
 # slope of the sigmoid at the values in pred_y
delta = error * sigmoidDeriv(pred_y)

The first thing we need is a function that computes the gradient of the sigmoid function we created earlier. This is sigmoidDeriv(). The derivative will give back the slope of the predictions.

The variable error is the total error across the whole data, calculated by running the data plus current parameters through the “network” (the forward-propagate function) and comparing the output to the true labels. This is the part we discussed earlier as the cost function.

When we multiply the “slopes” by the error, we are reducing the error of high confidence predictions.
The multiplication is essentially answering the question “how can I adjust my parameters to reduce the error the next time I run through the network”?
It does this by computing the contributions at each layer to the total error and adjusting appropriately by coming up with a gradient matrix (or, how much to change each parameter and in what direction).

Update the weights

 # update weights
 w += np.dot(X.T, delta)

We are now ready to update our network! This line computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line. A small error and a small slope means a VERY small update. However, because we’re using a “full batch” configuration, we’re doing the above step on all four training examples.

Check if converged

   # did we converge?
 maxError = max(error) # biggest error among training examples
 if abs(maxError) < tolerance:
   converged = True # yes !
   print ("** converged after iterations: ",i)

 i += 1 # next iteration (to check if max is reached)

The last step is to check if we have reached the end of our optimisation, either because the error is smaller than our tolerance or because we reached the maximum number of iterations.

In our example above, the convergence (i.e. the error was less than 0.1) was reached after 149 iterations.

We can also print out the weights and we can see that they are different than the hard-coded ones (but follow the same pattern: the bias is negative, the inputs are similar and positive).

Weights for ANN modelling OR function: 
[[-1.63295267]
 [ 3.84814604]
 [ 3.83626687]]

There are several combinations of weights that can bring to the same result and there are other equivalent solutions that gradient descent could also find.
The convergence point of gradient descent depends on the initial values of the parameters: the initial set of weights (that is random chosen), the gradient descent meta-parameters (not used in this simple example) and in minor part also on the error tolerance defined.

It is also possible to demonstrate that multiplying the resulting weights for a constant C will keep the ANN to output the same result but this is out of the scope for this post. You can try it yourself if you are curious.

An OR gate with 3 inputs

Let´s see if the ANN modelling an OR gate can work with 3 inputs:

 print("*** Simulation of OR gate with 3 inputs using ANN ***")
 
   # input dataset
 X_train = np.array([ [0,0,0], [0,0,1], [0,1,0], [1,0,0] ])
   # output dataset 
 y_train = np.array([[0,1,1,1]]).T
 
 ORweights = learnWeightsOneLayer(X_train, y_train)

 print("X1 | X2 | X3 | Y = X1 OR X2 OR X3")
 print("----------------------------------")
 inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1))
 for x1,x2,x3 in inputs:
   print (x1," | ",x2, " | ", x3, " | ", 
          predictOneLayer([x1,x2,x3], ORweights))

Note that the training set X_train is now a matrix with 3 columns for the 3 inputs and that you do not need to list ALL combinations as training examples.
For an OR function with 3 operators there would be a total of 2^3 = 8 possible combinations but here we train the network with only 4 examples.
Choosing the right training dataset (large and representative enough) is a science for itself but I will not dig more here. You can refer to the previous posts about overfitting and cross-validations.

This is the output:

*** Simulation of OR gate with 3 inputs using ANN ***
** converged after iterations: 169
X1 | X2 | X3 | Y = X1 OR X2 OR X3
----------------------------------
0  | 0  | 0  | 0
0  | 0  | 1  | 1
0  | 1  | 0  | 1
0  | 1  | 1  | 1
1  | 0  | 0  | 1
1  | 0  | 1  | 1
1  | 1  | 0  | 1
1  | 1  | 1  | 1

It works perfectly for all 8 combinations and convergence has been reached after 169 iterations, pretty fast.

Non-linear example: XOR

The real proof of truth is to model a non-linear function, such as the XOR function, a sort of Hello World for Neural Networks.
We have seen in the previous post that a simple perceptron (one neuron with only one single layer of input) cannot model it. But it’s possible by combining together perceptrons that model each one different functions, such as OR, AND and NOT:

x1 XOR x2 = (x1 AND NOT x2) OR (NOT x1 AND x2)

The resulting multi-layer network of perceptrons is basically an artificial neural network with one additional layer – often called the hidden layer:

neuronxor

To model such a network we need to add one layer of weights, i.e. one column to the weights matrix and to compute the hidden layer weights in the training algorithm.

XOR with two inputs

Note: To keep the example more clear I will keep the weights of the two layers separately.

def learnWeightsTwoLayers(X, y, tolerance = 0.1, max_iterations = 1000):
  """
  Learn the weights for an artificial neural network with two layers,
  given a training set of input and output values.
  Uses a backpropagation algorithm.

  Arguments:
    X: input data points without bias, matrix
    y: output data points, array
    tolerance: minimum error to stop the training, float. Optional (default=0.1)
    max_iterations: maximum number of iterations performed by the algorithm.
                    Optional. Default=1000

  Returns:
    weights for input layer, array
    weights for hidden layer, array
  """  

    # input and output training sizes
  n_inputs = X.shape[1]
  n_outputs = y.shape[1]
  n_examples = X.shape[0]
 
    # add bias
  bias = np.ones(n_examples)
  X = np.c_[bias, X] # add bias column to X
 
    # initialize weights randomly with mean 0
  w1 = 2 * np.random.random((n_inputs+1, n_inputs)) - 1 # input layer
  w2 = 2 * np.random.random((n_inputs+1, n_outputs)) - 1 # hidden layer
 
   # initialization
  converged = False
  i = 0 # iteration counter

   # main loop until converged or max iterations reached
  while not converged and i <= max_iterations:
 
    # note: all training examples together (batch)
 
       # forward propagation
    hiddenLayer = sigmoid(np.dot(X, w1)) 
    hiddenLayer = np.c_[bias, hiddenLayer] # add bias 
    pred_y = sigmoid(np.dot(hiddenLayer, w2))  # predicted output
 
 
       # Compute the error
    error_y = y - pred_y 
 
       # backpropagation: multiply how much we missed by the
       # slope of the sigmoid at the values in y_pred
    delta_y = error_y * sigmoidDeriv(pred_y)
 
       # repeat for hidden layer: error and delta
       # how much did each value of hidden layer contribute 
       # to the final error (according to the weights)?
    error_l1 = delta_y.dot(w2.T)
    delta_l1 = error_l1 * sigmoidDeriv(hiddenLayer)
       # the first column refers to the bias: remove it
    delta_l1 = np.delete(delta_l1, 0, axis=1)
 
      # update weights
    w2 += np.dot(hiddenLayer.T, delta_y) 
    w1 += np.dot(X.T, delta_l1) 

      # did we converge?
    maxError = max(abs(error_y))
    if abs(maxError) < tolerance:
      converged = True # yes !
      print ("** converged after iterations: ",i)

    i += 1 # next iteration (to check if max is reached)

  if not converged:
    print("** Not converged!")
 
  return w1, w2

The mainly difference is that we have now two arrays of weights w1 and w2, one storing the weights for the input layer and one for the hidden layer.
Then accordingly the code is changed to take in considerations both sets of weights.

Let´s see it in action:

def predictTwoLayers(X, w1, w2):
  """
  Compute the binary output of an Artificial Neural Network with two layers,
  given its neuron weights and an input dataset

  Arguments:
    X: input data points without bias, list
    w1: the input-layer weights, array
    w2: the hidden-layer weights, array

  Returns:
    integer (0 or 1)
  """ 
  bias = np.ones(1)
  X = np.append(bias, X, 0) # add bias column to X

    # forward propagation
  hiddenLayer = sigmoid(np.dot(X, w1)) 
  hiddenLayer = np.append(bias, hiddenLayer, 0) # add bias 
  y = np.dot(hiddenLayer, w2) 

  return sigmoidActivation(y)

print("*** Simulation of XOR gate using ANN ***")
print("***")

  # input dataset
X_train = np.array([ [0,0], [0,1], [1,0], [1,1] ])
  # output dataset 
y_train = np.array([[0,1,1,0]]).T
   # note that we use tolerance = 0.4
XORweights1, XORweights2 = learnWeightsTwoLayers(X_train, y_train, 0.4)

 
print("X1 | X2 | Y = X1 XOR X2")
print("------------------------")
for x1 in (0,1):
  for x2 in (0,1):
    print(x1," | ",x2, " | ", predictTwoLayers(np.array([x1,x2]), 
                                               XORweights1, XORweights2))
 

I used a higher tolerance, since the output of a XOR gate is binary so we don’t need that much approximation. This will make the training algorithm faster.

And here is the output:

*** Simulation of XOR gate using ANN ***
***
** converged after iterations: 967
X1 | X2 | Y = X1 XOR X2
------------------------
0  | 0  | 0
0  | 1  | 1
1  | 0  | 1
1  | 1  | 0

It works but note that it takes now 967 iterations to converge.
Adding layers to a network will normally make it slower and harder to train. Normally deeper neural network will be more optimised, e.g. will use better gradient descent algorithms, regularisations and so on.

XOR with 3 inputs

And it works with 3 inputs too:

print("*** Simulation of XOR gate using ANN, now with 3 inputs ***")
print("***")

  # input dataset
X_train = np.array([ [0,0,0], [0,1,0], [1,0,1], [0,1,1], [1,0,0], [1,1,1]])
  # output dataset 
y_train = np.array([[0,1,1,1,1,0]]).T
 
XORweights1, XORweights2 = learnWeightsTwoLayers(X_train, y_train, 0.4)

print("X1 | X2 | X3 | Y = X1 XOR X2 XOR X3")
print("---------------------------------------")
inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1))
for x1,x2,x3 in inputs:
  print (x1," | ",x2, " | ", x3, " | ", predictTwoLayers([x1,x2,x3], 
                                         XORweights1, XORweights2))

and the output is:

*** Simulation of XOR gate using ANN, now with 3 inputs ***
***
** converged after iterations: 360
X1 | X2 | X3 | Y = X1 XOR X2 XOR X3
---------------------------------------
0  | 0  | 0  | 0
0  | 0  | 1  | 1
0  | 1  | 0  | 1
0  | 1  | 1  | 1
1  | 0  | 0  | 1
1  | 0  | 1  | 1
1  | 1  | 0  | 1
1  | 1  | 1  | 0

Conclusion

The perceptron can model linear functions but cannot model non-linear functions, unless you combine perceptrons together. Nevertheless the perceptron convergence procedure works by ensuring that every time the weights change, they get closer to every “generously feasible” set of weights. This type of guarantee cannot be extended to more complex networks in which the average of two good solutions may be a bad solution (non-convex problems). Therefore the perceptron learning procedure cannot be generalised to hidden layers.

Instead of showing the weights get closer to a good set of weights, the Artificial Neural Networks learn the weights so that the actual output values get closer the target values. The aim of learning is to minimise the error summed over all training cases.
Sigmoid neurons give a real-valued output that is a smooth and bounded function of their total input; they have nice derivatives which make learning easy.

Such an Artificial Neural Network can represent non-linear functions.
The Universal Approximation theorem states that a simple feed-forward neural network with a single hidden layer is enough to represent a wide variety of continuous functions.

This is a bare example. Deep neural network have dozens of layers and use many additional optimisations to improve the learning:  different activation functions, regularisations to prevent overfitting ((L1 and L2 regularization, dropout), better choice of cost function (squared difference, cross-entropy), tuned weight initialisation and choice of hyper-parameters (learning rate, number of layers and neurons, …), sophisticated gradient descent algorithms. But the principle is the same.

ANNs can represent efficiently almost any kind of function and studies are constantly progressing.

Perceptron, an artificial neuron

Artificial Neural Networks origins are in algorithms that try to mimic the brain and its neurons, back to the 40s of the past century.
They were widely used in 80s and early 90s but their popularity diminished in late 90s when they failed to keep up with the promises.
Their recent resurgence is due to the increased computation power (that allow bigger and deeper networks) and the availability of data (that allows proper training).
Nowadays are used in many cognitive applications (such as state-of-the-art for computer vision, speech recognition, automatic translation and many more) due to their self-learning ability and feature detection not possible with normal systems.

Neurons

To describe the neural networks, we will begin by describing the simplest possible neural network, one which comprises a single “neuron”. Continue reading “Perceptron, an artificial neuron”

Inference statistics for linear regression

We have seen how we can fit a model to existing data using linear regression. Now we want to assess how well the model describes those data points (every Outcome = Model + Error) and will use some statistics for it.

The following code is also available as a Notebook in GitHub.

As an example we access some available diamond ring data (from the Journal of Statistics Education): prices in Singapore dollars and weights in carats (the standard measure of diamond mass, equal to 0.2 g). 

import pandas as pd

diamondData = pd.read_csv("diamond.dat.txt", delim_whitespace=True,
                          header=None, names=["carats","price"])
In [1]: diamondData.head()
Out[1]:

carats

price

0

0.17

355

1

0.16

328

2

0.17

350

3

0.18

325

4

0.25

642

Fit a model

Is there a relationship between the diamond price and its weight?

Our first goal should be to determine whether the data provide evidence of an association between price and carats. If the evidence is weak, then one might argue that bigger diamonds are not better!

To evaluate the model we will use a special Python package, statsmodel, which is a package based on the original statistics module of SciPy (Scientific Python) by Jonathan Taylor – later removed – corrected, improved, tested and released as a new package during the Google Summer of Code 2009.

statsmodels_hybi_banner

Since statsmodels offers also functions to fit a linear regression model, we do not need to import and use sklearn to fit the model but we can do everything with statsmodels. Continue reading “Inference statistics for linear regression”

Linear regression – an introduction

Introduction: the price of wine

There are differences in price and quality of wine from year to year that are sometimes very significant: wines are tasting better when they are older. So,  there is an incentive to store young wines until they are mature.
Wine experts taste the wine and then predict which ones will be the best later.
But it is hard to determine the quality of the wine when it is so young just by tasting it, since the taste will change significantly by the time it will be consumed .

We have seen that on March 4, 1990, the  Princeton Professor of Economics Orley Ashenfelter announced that he can predict the quality of Bordeaux wine without tasting a single drop but just using a mathematical model.

Ashenfelter used a method called linear regression.

The method predicts an outcome variable – the wine price – using one or more independent variables – such as the wine age or the weather.

The initial reaction was of skepticism.
Robert Parker, the world’s most influential wine expert said : “Ashenfelter is an absolute total sham. He is like a movie critic who never goes to see the movie but tells you how good it is based on the actors and the director” but the method proved its validity very soon.

Let’s see how Ashenfelter was able to predict reliably the wine price. Continue reading “Linear regression – an introduction”