Artificial Neural Networks origins are in algorithms that try to mimic the brain and its neurons, back to the 40s of the past century.

They were widely used in 80s and early 90s but their popularity diminished in late 90s when they failed to keep up with the promises.

Their recent resurgence is due to the increased computation power (that allow bigger and deeper networks) and the availability of data (that allows proper training).

Nowadays are used in many cognitive applications (such as state-of-the-art for computer vision, speech recognition, automatic translation and many more) due to their self-learning ability and feature detection not possible with normal systems.

# Neurons

To describe the neural networks, we will begin by describing the simplest possible neural network, one which comprises a single “neuron”.

This artificial “neuron” is a computational unit that takes as input x1,x2, …, xn (and a +1 intercept term), each one with a weight w0,w1,..,wn and outputs an activation function h_w(x) based on the input and weights.

Thus, our single neuron corresponds exactly to the input-output mapping defined by logistic regression.

The effect of each input line on the neuron is controlled by a weight that can be positive or negative.

The weights can adapt so that the neurons learn to perform and to give the correct output.

Just as a comparison, our brain has about 10^11 neurons each with about 10^4 weights. All connected together, the output of one neuron is the input of another one. This huge number of weights can affect the computation in a very short time. Much better bandwidth than a workstation.

Our brain neural network makes us able to recognise objects, understand language, make plans, control the body.

Why mimicking the biological neurons?

To model things we have to idealise them. Idealisation removes complicated details that are not essential for understanding the main principles.

Once we understand the basic principles, it’s easy to add complexity to make the model more faithful.

There exist several models for the artificial neuron: linear, binary threshold, rectified, sigmoid, stochastic, … all inspired by a seminal paper entitled “A Logical Calculus of Ideas Immanent in Nervous Activity” (1943) by W.Pitts and W. McCullough.

They described a brain nerve cell as similar in the concept to a binary logic gate: multiple signals arrive to the input bodies (“dendrites”) of the cell and when they exceed a certain threshold the cell generates an output signal and send it outside via the output body (“axon”) of the cell.

# The perceptron

We see now a type of neuron called **perceptron**, described by the scientist Frank Rosenblatt in 1957 (“The Perceptron, a Perceiving and Recognizing Automaton”).

Rosenblatt was the first to introduce the weights as real numbers expressing the importance of the respective inputs to the output and it has a simple but powerful learning algorithm for such weights.

He proposed a simple rule to compute the output.

The neuron’s binary output, 0 or 1, is determined by whether the weighted sum of each input times its weight is less than or greater than some threshold value.

Just like the weights, the threshold is a real number which is a parameter of the neuron (the bias unit in the picture above).

# Forward propagation

Let’s see some simple practical examples. Perceptrons are best when classifying so we will see how to create a neuron in Python, able to calculate the basic boolean (binary) functions.

This is the NOT truth table:

X | NOT X --------- 0 | 1 1 | 0

Which just negates its input. I use here 0 and 1 (see them as false and true) but you can also use other distinct values such as -1 and 1 (positive and negative).

First of all we need **an activation function**. We will stick to the simplest one: if the sum of the inputs is positive, it will outputs 1 otherwise 0:

import numpy as np # for matrix multiplication def booleanActivation(input, weights): # input and weights are arrays of values input = [1] + input // add bias inputz = np.dot(input, weights).sum() // h_w(x)if z > 0: return 1 else: return 0

The input array (we can have *n* features as input) is multiplied with the weights array (they must have the same shape) and then summed.

Now we need to find out the correct weights which will calculate the NOT function.

The way to learn them is:

- start with random or zero values for the initial weights.
- calculate the actual output using these weights and the above activation function
- update the weights: for each weight you add the difference between each expected output and the calculated output (error).
- repeat the steps 1 to 3 until the error is less than a pre-specified tolerance or a pre-specified number of iterations has been reached. Otherwise the algorithm may never stop.

As you can see, it is a similar principle as the gradient descent optimisation algorithm.

In a next post we can see in details this error back-propagation algorithm. For the moment we hard-code the weights:

def forwardNOT(operand): NOTweights = [1, -1] # include weight for the bias unit return booleanActivation(operand, NOTweights)

The NOT function needs only one input parameter and the weights for its neuron are hard-coded as 1 for the bias and -1 for the only input.

print("*** Simulation of NOT neuron ***") print("X | Y = NOT X") print("-------------") for x in (0,1): print(x," | ", forwardNOT([x]))

This is the output:

*** Simulation of NOT neuron *** X | Y = NOT X ------------- 0 | 1 1 | 0

## An example: the OR function

Let’s see a more complex function: the OR.

The OR accepts multiple operands and here is its space:

The OR function is False only when ALL its inputs are also False. Therefore it’s linearly separable: as you can see from the picture above it is possible to fit a straight line that correctly separates all the cases into two distinct classes.

Again we hard-code the weights. This time the bias input is -0.5 and each input (can be more than two) has weight of 1.

def forwardOR(operands): m = len(operands) # number of inputs ORweights = [-0.5] + [1]*m // hard-coded return booleanActivation(operands, ORweights)

Let’s try with two inputs:

print("*** Simulation of OR neuron ***") print("X1 | X2 | Y = X1 OR X2") print("-----------------------") for x1 in (0,1): for x2 in (0,1): print(x1," | ",x2, " | ", forwardOR([x1,x2])) *** Simulation of OR neuron *** X1 | X2 | Y = X1 OR X2 ----------------------- 0 | 0 | 0 0 | 1 | 1 1 | 0 | 1 1 | 1 | 1

And it works for more than two inputs:

print("*** Simulation of OR neuron ***") print("X1 | X2 | X3 | Y = X1 OR X2 OR X3") print("----------------------------------") inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1)) for x1,x2,x3 in inputs: print (x1," | ",x2, " | ", x3, " | ", forwardOR([x1,x2,x3])) *** Simulation of OR neuron *** X1 | X2 | X3 | Y = X1 OR X2 OR X3 ---------------------------------- 0 | 0 | 0 | 0 0 | 0 | 1 | 1 0 | 1 | 0 | 1 0 | 1 | 1 | 1 1 | 0 | 0 | 1 1 | 0 | 1 | 1 1 | 1 | 0 | 1 1 | 1 | 1 | 1

Note that we used Python generators here for the nested loops.

A generator is a function returning an iterable object; in this simple case we can turn a list comprehension into a generator expression by replacing the square brackets with round parentheses.

## Another example: the AND function

Now we can see how to simulate an AND boolean function.

Again, the weights have been hard-coded: still weight 1 for each input but this time -1.5 for the bias input:

def forwardAND(operands): m = len(operands) # number of inputs ANDweights = [-1.5] + [1]*m // hard-coded return booleanActivation(operands, ANDweights)

print("*** Simulation of AND neuron ***") print("X1 | X2 | Y = X1 AND X2") print("-----------------------") inputs = ((x1,x2) for x1 in (0,1) for x2 in (0,1)) for x1,x2 in inputs: print (x1," | ",x2, " | ", forwardAND([x1,x2]))

This is the output:

*** Simulation of AND neuron *** X1 | X2 | Y = X1 AND X2 ----------------------- 0 | 0 | 0 0 | 1 | 0 1 | 0 | 0 1 | 1 | 1

The problem is that this AND function does not work when the inputs are more than two:

In : forwardAND([0,1,1]) Out: 1

We need to change and fine-tune the weights, basically reducing them to 0.6

Now it works also for three inputs:

def forwardAND(operands): m = len(operands) # number of inputs ANDweights = [-1.5] + [0.6]*m // new weights ! return booleanActivation(operands, ANDweights) print("*** Simulation of AND neuron ***") print("X1 | X2 | X3 | Y = X1 AND X2 AND X3") print("-----------------------------------") inputs = ((x1,x2,x3) for x1 in (0,1) for x2 in (0,1) for x3 in (0,1)) for x1,x2,x3 in inputs: print (x1," | ",x2, " | ", x3, " | ", forwardAND([x1,x2,x3]))

This is the output:

*** Simulation of AND neuron *** X1 | X2 | X3 | Y = X1 AND X2 AND X3 ----------------------------------- 0 | 0 | 0 | 0 0 | 0 | 1 | 0 0 | 1 | 0 | 0 0 | 1 | 1 | 0 1 | 0 | 0 | 0 1 | 0 | 1 | 0 1 | 1 | 0 | 0 1 | 1 | 1 | 1

Of course hard-coding the weights and fine tuning them case by case is not really scalable. Here is when the back-propagation learning algorithm (one of the most important in machine learning) will be useful.

But first, let’s see another limitation of the perceptron that basically is not working when the function space is not linearly separable!

## An example of non-linearly separable function

A counter-example is the XOR function, also called Exclusive-OR.

Here is its truth table:

X1 | X2 | Y = X1 XOR X2 ----------------------- 0 | 0 | 0 0 | 1 | 1 1 | 0 | 1 1 | 1 | 0

It gains the name “exclusive OR” because the meaning of OR is ambiguous when both operands are true; the exclusive OR operator excludes that case.

This is sometimes thought of as “one or the other but not both”.

As you can see from its space, the XOR function is not linearly separable:

There is no way we can find a straight line separating the two groups of True and False.

Either we use a curve line or two straight lines.

And it means there is no way we can find some weights to correctly calculate the XOR function using the perceptron.

The way to solve this is **to add an additional layer**, in this case leveraging the property that a XOR function can be written as:

x1 XOR x2 = (x1 AND NOT x2) OR (NOT x1 AND x2)

## Neural network intuition

A **neural network** is put together by hooking many of our simple “neurons” so that the output of a neuron can be the input of another.

These neural networks allow to learn non-linear features.

As an example, a network of neurons calculating the XOR function could look like this:

We can see that such network works just calling the previous implemented neurons in the correct order:

def forwardXOR(x1,x2): # this is a network combining existing neurons, # no weights to prepare x1not = forwardNOT([x1]) x2not = forwardNOT([x2]) z1 = forwardAND([x1, x2not]) z2 = forwardAND([x1not, x2]) return forwardOR([z1,z2]) print("*** Simulation of XOR network of neurons ***") print("X1 | X2 | Y = X1 XOR X2") print("------------------------") for x1 in (0,1): for x2 in (0,1): print(x1," | ",x2, " | ", forwardXOR(x1,x2))

And here is the output:

*** Simulation of XOR network of neurons *** X1 | X2 | Y = X1 XOR X2 ------------------------ 0 | 0 | 0 0 | 1 | 1 1 | 0 | 1 1 | 1 | 0

# The limitation of the perceptrons

Still these networks are very limited in the input-output mappings they can learn to model.

More layers of linear units do not help. It’s still linear.

Fixed output non-linearities are not enough.

We need multiple layers of adaptive, non-linear hidden units.

And we need an efficient way of adapting all the weights, not just the last layer. This is hard.

Learning the weights going into hidden units is equivalent to learning features.

This is difficult because nobody is telling us directly what the hidden units should do.

A neural network is just that: layers and layers of linear models and non-linear transformations.

In 1969, Minsky and Papert published a book called “Perceptrons” that analysed what they could do and showed their limitations.

Many people thought these limitations applied to all neural network models and this slowed down for several years their development, until they saw a big resurgence in the last few years, showing an impressive accuracy on several benchmark problems such as visual and speech recognition.

The biggest improvements powering it were the increased computation power (GPU), together with more efficient back-propagation learning algorithms and the availability of huge datasets for the learning phase – that allowed to expand the number of layers of the networks.

Nowadays is not uncommon to have several hundreds of layers, indeed this has developed into a promising approach for machine learning, called **deep learning.**

The perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.

Pingback: Back-propagation for neural network – Look back in respect