We have seen the concept of regularisation and how is applied to linear regression, let’s see now another example for logistic regression done with artificial neural networks.

The question to answer is to recognise hand-written digits and is a classic one-vs-all logistic regression problem.

The dataset contains 5000 training examples of handwritten digits and is a subset of the MNIST handwritten digit dataset.

Each training example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location.

The 20 by 20 grid of pixels is unrolled into a 400-dimensional vector. Each of these training examples becomes a single row in our data matrix X. This gives us a 5000 by 400 matrix X where every row is a training example for a handwritten digit image.

Let’s get more familiar with the dataset.

You can follow along on the associated notebook.

## 1 – Overview of the data set

from scipy.io import loadmat dataset = loadmat('datasets/mnist-data.mat') # comes as dictionary dataset.keys()

dict_keys(['__version__', '__header__', 'y', '__globals__', 'X'])

Each line of X is an array representing an image. You can visualize an example by running the following code. Feel free also to change the *indexImage* value and re-run to see other images.

import matplotlib.pyplot as plt import numpy as np # Example of a picture indexImage = 4000 # try any index between 0 and 4999. # They are sorted, from 1 to 10 renderImage = np.reshape(dataset['X'][indexImage], (20,20)) labelImage = dataset['y'][indexImage] plt.imshow(renderImage, cmap='gray') print ("Label: this is a ", labelImage)

Label: this is a [8]

## 2 – Data preprocessing

- Figure out the dimensions and shapes of the problem
- Split the dataset into training a test subsets
- “Standardise” the data

X = dataset['X'] # the images X.shape

(5000, 400)

The second part of the training set is a 5000-dimensional vector y that contains labels for the training set.

y = dataset['y'] # the labels y.shape

(5000, 1)

y[499]

array([10], dtype=uint8)

One problem: The label representing the digit 0 (zero) is coded as ten (10). Change this.

list_y = [0 if i == 10 else i for i in y] # apply to each item in y y = np.asarray(list_y) y = y.reshape(-1,1) y.shape

(5000, 1)

y[0:10] # verify that the label is now zero

array([[0], [0], [0], [0], [0], [0], [0], [0], [0], [0]])

### One hot encoding

Another problem: the original labels (in the variable y) are a number between 0, 1, 2, …, 9 therefore the output is a multi-class and not a binary variable.

For the purpose of training a neural network, we need to recode the labels as vectors containing only binary values 0 or 1.

Will do this by transforming y from one to 10 “dummy” variables: each one takes the binary value 0 or 1 to indicate the absence or presence of the respective categorical label.

n_classes = 10 # 10 digits = 10 classes/labels # np.eye(n) creates an identity matrix of shape (n,n) OHE_y = np.eye(n_classes)[y.reshape(-1)] OHE_y.shape

(5000, 10)

OHE_y[1000] # this is the new encoding for e.g. label = 2

array([ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])

### Split into train and test sets

Split the dataset into 20% of test and 80% of train sets.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, OHE_y, test_size=0.2, random_state=7) input_layer_size = X.shape[1] num_px = np.sqrt(input_layer_size) # 400 = 20x20 Input Images of Digits n_y = y_train.shape[1] m_train = X_train.shape[0] m_test = X_test.shape[0] print ("Dataset dimensions:") print ("Number of training examples = " + str(m_train)) print ("Number of testing examples = " + str(m_test)) print ("Height/Width of each image: num_px = " + str(num_px)) print ("Each image is of size: <" + str(num_px) + ", " + str(num_px) + ">") print ("X train shape: " + str(X_train.shape)) print ("y train shape: " + str(y_train.shape))

Dataset dimensions: Number of training examples = 4000 Number of testing examples = 1000 Height/Width of each image: num_px = 20.0 Each image is of size: <20.0, 20.0> X train shape: (4000, 400) y train shape: (4000, 10)

## Deep Neural Network for Image Classification

Now we will build and apply a deep neural network to the problem.

The main steps for building an Artificial Neural Network are as usual:

- Define the model structure (such as number and size of layers) and the hyperparameters
- Initialize the model’s weights
- Loop for the number of iterations:

– Calculate current loss (forward propagation)

– Calculate current gradient (backward propagation)

– Update parameters (gradient descent) - Use the trained weights to predict the labels

### Defining the neural network structure

Our neural network has 3 layers: an input layer, a hidden layer and an output layer.

Recall that our inputs are pixel values of digit images. Since the images are of size 20×20, this gives us 400 input layer units (excluding the extra bias unit which always outputs +1).

There are 25 units in the second layer and 10 output units (corresponding to the 10 digit classes).

### CONSTANTS DEFINING THE MODEL #### # we define a neural network with total 3 layers, x, y and 1 hidden: n_h = 25 nn_layers = [input_layer_size, n_h, n_y] # length is 3 (layers)

### Build the 3-layer neural network

We will re-use all the helper functions defined previously to build the neural network, such as the linear forward and the backward propagation.

Please refer to the Jupyter notebook for the details.

Now we can put together all the functions to build an L-layer neural network with this structure:

nn_layers

[400, 25, 10]

np.random.seed(1) train_set_x = X_train.T train_set_x.shape

(400, 4000)

# y is the original output array, with labels # train_set_y is that set, one-hot-encoded train_set_y = y_train.T train_set_y.shape

(10, 4000)

# FUNCTION: L_layer_model def simpleNN_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False): """ Implements a L-layer neural network. Arguments: X -- data, numpy array of shape (number of examples, num_px * num_px) Y -- true "label" vector (containing 0 or 1), of shape (10, number of examples) layers_dims -- list containing the input size and each layer size, of length (number of layers + 1). learning_rate -- learning rate of the gradient descent update rule num_iterations -- number of iterations of the optimisation loop print_cost -- if True, it prints the cost every 200 steps Returns: parameters -- parameters learnt by the model. They can then be used to predict. """ costs = [] # keep track of cost iterations2cost = 200 # Print the cost every these iterations # Parameters initialization. parameters = initialise_parameters(layers_dims) # Loop (gradient descent) for i in range(0, num_iterations): # Forward propagation AL, caches = L_model_forward(X, parameters) # Compute cost. cost = compute_cost(AL, Y) # Backward propagation. grads = L_model_backward(AL, Y, caches) # Update parameters. parameters = update_parameters(parameters, grads, learning_rate) # Print the cost every iterations2cost training example if print_cost and i % iterations2cost == 0: print ("Cost after iteration %i: %f" %(i, cost)) if print_cost and i % iterations2cost == 0: costs.append(cost) if print_cost: # plot the cost fig, ax = plt.subplots(1,1) plt.plot(np.squeeze(costs)) ticks = ax.get_xticks() ax.locator_params(axis='x', nticks=len(costs)) ax.set_xticklabels([int(x*iterations2cost) for x in ticks]) plt.ylabel('cost') plt.xlabel('iterations') plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters

We will now train the model as a 3-layer neural network.

fit_params = simpleNN_model(train_set_x, train_set_y, nn_layers, learning_rate = 0.3, num_iterations = 3500, print_cost = False)

## 4. Results analysis

Now we can check the performance of the trained network by predicting the results of the test set and comparing them with the actual labels.

Note that the *predict()* function has been adapted to cope with the multi-class labels.

def predict(X, yOHE, parameters): """ This function is used to predict the results of a L-layer neural network. It also checks them against the true labels and print the accuracy Arguments: X -- data set of examples you would like to label yOHE -- the true labels, as multi-class vectors parameters -- parameters of the trained model Returns: p -- predictions (the label) for the given dataset X """ m = X.shape[1] nLabels = yOHE.shape[1] n = len(parameters) // 2 # number of layers in the neural network p = np.zeros((1, m)) # the predicted output, initialised to zero y = np.zeros((1, m)) # the actual output # Forward propagation probas, caches = L_model_forward(X, parameters) # probas is a matrix of shape [nLabels, m] (one-hot-encoded) assert (probas.shape[1] == m) for i in range(0, m): # convert probs to label predictions: # just take the label with max prob p[0,i] = np.argmax(probas[:,i]) # convert expected results into label: takes the value with one y[0,i] = np.argmax(yOHE[:,i]) # print results print("Accuracy: " + str(np.sum((p == y)/m))) return p

print ("On the training set:") predictions_train = predict(train_set_x, train_set_y, fit_params) print ("On the test set:") predictions_test = predict(X_test.T, y_test.T, fit_params)

On the training set: Accuracy: 0.99425 On the test set: Accuracy: 0.929

The accuracy on the test set s very good: 93% of the times did predict correctly the digit (and this with just 3 layers) but the very large training set accuracy (almost 100%) could be an indication of over-fitting.

Let’s apply a couple of optimisations to our network, indlufing regularisation but first how we initialise the weights.

## 5 – Initialising parameters

There are two types of parameters to initialise in a neural network:

– the weight matrices

– the bias vectors

The weight matrix is initialised with random values (to break symmetry as we have seen previously).

Of course, different initialisations lead to different results and poor initialisation can slow down the optimisation algorithm.

One good practice is not to initialise to values that are too large, instead what bring good results are the so-called Xavier or the He (for ReLU activation) initialisations.

Finally, we try here the “He Initialisation”; this is named for the first author of He et al., 2015.

This function is similar to the previous random initialisation.

The only difference is that instead of multiplying the random number by 10, it will multiply it by , which is what He initialisation recommends for layers with a ReLU activation.

# FUNCTION: initialize_parameters def initialise_parameters_he(layer_dims): """ Arguments: layer_dims -- python array (list) containing the dimensions of each layer in our network Returns: parameters -- python dictionary containing the parameters "W1", "b1", ..., "WL", "bL": Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1]) bl -- bias vector of shape (layer_dims[l], 1) """ parameters = {} L = len(layer_dims) # number of layers in the network for l in range(1, L): parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*np.sqrt(2./layer_dims[l-1]) parameters['b' + str(l)] = np.zeros((layer_dims[l], 1)) # unit tests assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1])) assert(parameters['b' + str(l)].shape == (layer_dims[l], 1)) return parameters

## 6 – L2 Regularization

The standard way to avoid overfitting is called **L2 regularisation**. It consists of appropriately modifying your cost function, from:

To:

This is implemented in the below function `compute_cost_with_regularisation()` which computes the cost given by formula.

# GRADED FUNCTION: compute_cost_with_regularization def compute_cost_with_regularisation(A3, Y, parameters, lambdaHyper): """ Implement the cost function with L2 regularization. See formula (2) above. Arguments: A3 -- post-activation, output of forward propagation, of shape (output size, number of examples) Y -- "true" labels vector, of shape (output size, number of examples) parameters -- python dictionary containing parameters of the model lambdaHyper -- the lambda regularisation hyper-parameter. Returns: cost - value of the regularized loss function (formula (2)) """ # This gives you the cross-entropy part of the cost cross_entropy_cost = compute_cost(A3, Y) sum_regularization_cost = 0 m = Y.shape[1] L = len(parameters) // 2 # number of layers (2 because we have W and b) for i in range(1, L+1): W_i = parameters['W' + str(i)] sum_regularization_cost += np.sum(np.square(W_i)) regularization_cost = (1/m)*(lambdaHyper/2)*(sum_regularization_cost) cost = cross_entropy_cost + regularization_cost return cost

Of course, because we changed the cost function, we have to change backward propagation as well.

All the gradients have to be computed with respect to this new cost: add the regularisation term’s gradient.

def backward_propagation_with_regularisation(X, Y, Yhat, caches, lambdaHyper): """ Implements the backward propagation of our baseline model to which we added an L2 regularization. Arguments: X -- input dataset, of shape (input size, number of examples) Yhat -- "true" labels vector, of shape (output size, number of examples) caches -- cache output from forward_propagation() lambdaHyper -- regularization hyperparameter, scalar Returns: gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables """ m = X.shape[1] L = len(caches) # the number of layers gradients = {} last_layer_cache = caches[L-1] ((A, W, b), Z) = last_layer_cache assert (Yhat.shape == Y.shape) dZ = Yhat - Y gradients["dZ" + str(L)] = dZ for i in reversed(range(L-1)): current_layer_cache = caches[i] ((A_prev, W_prev, b_prev), Z_prev) = current_layer_cache dW_entropy = 1./m * np.dot(dZ, A.T) dW_reg = (lambdaHyper/m)*W dW = dW_entropy + dW_reg db = 1./m * np.sum(dZ, axis=1, keepdims = True) dA_prev = np.dot(W.T, dZ) dZ_prev = np.multiply(dA_prev, np.int64(A &gt; 0)) gradients["dW" + str(i + 2)] = dW gradients["db" + str(i + 2)] = db gradients["dA" + str(i + 1)] = dA_prev gradients["dZ" + str(i + 1)] = dZ_prev ((A, W, b), Z) = ((A_prev, W_prev, b_prev), Z_prev) dZ = dZ_prev # finally add the gradients for the first layer dW_prev = 1./m * np.dot(dZ_prev, X.T) + (lambdaHyper/m)*W_prev db_prev = 1./m * np.sum(dZ_prev, axis=1, keepdims = True) gradients["dW1"] = dW_prev gradients["db1"] = db_prev return gradients

### Putting all together

def NN_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False, lambdaHyper = 0, init="standard"): """ Implements a L-layer neural network. Arguments: X -- data, numpy array of shape (number of examples, num_px * num_px * 3) Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples) layers_dims -- list containing the input size and each layer size, of length (number of layers + 1). learning_rate -- learning rate of the gradient descent update rule num_iterations -- number of iterations of the optimization loop print_cost -- if True, it prints the cost every 100 steps lambdaHyper -- regularisation hyperparameter, scalar init -- type of initialisation: standard or He. Returns: parameters -- parameters learnt by the model. They can then be used to predict. """ costs = [] # keep track of cost iterations2cost = 200 # Print the cost every these iterations # Parameters initialization. if init == "he": parameters = initialise_parameters_he(layers_dims) else: parameters = initialise_parameters(layers_dims) # Loop (gradient descent) for i in range(0, num_iterations): # Forward propagation. Yhat, caches = L_model_forward(X, parameters) # Compute cost. if lambdaHyper == 0: cost = compute_cost(Yhat, Y) else: cost = compute_cost_with_regularisation(Yhat, Y, parameters, lambdaHyper) # Backward propagation. if lambdaHyper == 0: grads = L_model_backward(Yhat, Y, caches) else: grads = backward_propagation_with_regularisation(X, Y, Yhat, caches, lambdaHyper) # Update parameters. parameters = update_parameters(parameters, grads, learning_rate) # Print the cost every 200 training example if print_cost and i % iterations2cost == 0: print ("Cost after iteration %i: %f" %(i, cost)) if print_cost and i % iterations2cost == 0: costs.append(cost) if print_cost: # plot the cost fig, ax = plt.subplots(1,1) plt.plot(np.squeeze(costs)) ticks = ax.get_xticks() ax.locator_params(axis='x', nticks=len(costs)) ax.set_xticklabels([int(x*iterations2cost) for x in ticks]) plt.ylabel('cost') plt.xlabel('iterations') plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters

fit_params_reg = NN_model(train_set_x, train_set_y, nn_layers, learning_rate = 0.3, num_iterations = 3500, print_cost = False, lambdaHyper = 5, init="he")

Let’s check the new accuracy values:

print ("On the training set:") predictions_train = predict(train_set_x, train_set_y, fit_params_reg) print ("On the test set:") predictions_test = predict(X_test.T, y_test.T, fit_params_reg)

On the training set: Accuracy: 0.9855 On the test set: Accuracy: 0.939

It slightly improved the test accuracy, now almost 94%.

Note also that the training accuracy has been reduced.

## 7 – Dropout

Finally, **dropout** is a widely used regularisation technique that is specific to deep learning.

**It randomly shuts down some neurons in each iteration.**

When you shut some neurons down, you actually modify the model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of the neurons. With dropout, the neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5): """ Implements the forward propagation with dropout Arguments: X -- input dataset, of shape (2, number of examples) parameters -- python dictionary containing the parameters for a 3-layers network keep_prob - probability of keeping a neuron active during drop-out, scalar Returns: A2 -- last activation value, output of the forward propagation, of shape (1,1) cache -- tuple, information stored for computing the backward propagation """ L = len(parameters) // 2 # number of layers in the neural network # retrieve parameters W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] Z1 = np.dot(W1, X) + b1 A1,cache_temp = relu(Z1) D1 = np.random.rand(A1.shape[0], A1.shape[1]) # Step 1: initialize matrix D1 D1 = D1 < keep_prob # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold) A1 = A1*D1 # Step 3: shut down some neurons of A1 A1 = A1 / keep_prob # Step 4: scale the value of neurons that haven't been shut down Z2 = np.dot(W2, A1) + b2 A2, cache_temp = sigmoid(Z2) caches = (Z1, D1, A1, W1, b1, Z2, A2, W2, b2) return A2, caches

And the backward propagation with dropout:

def backward_propagation_with_dropout(X, Y, cache, keep_prob): """ Implements the backward propagation of our baseline model to which we added dropout. Arguments: X -- input dataset, of shape (2, number of examples) Y -- "true" labels vector, of shape (output size, number of examples) cache -- cache output from forward_propagation_with_dropout() keep_prob - probability of keeping a neuron active during drop-out, scalar Returns: gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables """ m = X.shape[1] (Z1, D1, A1, W1, b1, Z2, A2, W2, b2) = cache dZ2 = A2 - Y dW2 = 1./m * np.dot(dZ2, A1.T) db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) dA1 = np.dot(W2.T, dZ2) dA1 = dA1*D1 # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation dA1 = dA1 / keep_prob # Step 2: Scale the value of neurons that haven't been shut down dZ1 = np.multiply(dA1, np.int64(A1 > 0)) dW1 = 1./m * np.dot(dZ1, X.T) db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True) gradients = {"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients

Finally we adapt the model to the dropout:

def NN_model_drop(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False, keep_prob = 1, init="standard"): """ Implements a L-layer neural network Arguments: X -- data, numpy array of shape (number of examples, num_px * num_px * 3) Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples) layers_dims -- list containing the input size and each layer size, of length (number of layers + 1). learning_rate -- learning rate of the gradient descent update rule num_iterations -- number of iterations of the optimization loop print_cost -- if True, it prints the cost every 100 steps Returns: parameters -- parameters learnt by the model. They can then be used to predict. """ costs = [] # keep track of cost iterations2cost = 200 # Print the cost every these iterations # Parameters initialization. if init == "he": parameters = initialise_parameters_he(layers_dims) else: parameters = initialise_parameters(layers_dims) # Loop (gradient descent) for i in range(0, num_iterations): # Forward propagation: Yhat, caches = forward_propagation_with_dropout(X, parameters, keep_prob) # Compute cost. cost = compute_cost(Yhat, Y) # Backward propagation. grads = backward_propagation_with_dropout(X, Y, caches, keep_prob) # Update parameters. parameters = update_parameters(parameters, grads, learning_rate) # Print the cost every 200 training example if print_cost and i % iterations2cost == 0: print ("Cost after iteration %i: %f" %(i, cost)) if print_cost and i % iterations2cost == 0: costs.append(cost) if print_cost: # plot the cost fig, ax = plt.subplots(1,1) plt.plot(np.squeeze(costs)) <span data-mce-type="bookmark" id="mce_SELREST_start" data-mce-style="overflow:hidden;line-height:0" style="overflow:hidden;line-height:0" ></span> ticks = ax.get_xticks() ax.locator_params(axis='x', nticks=len(costs)) ax.set_xticklabels([int(x*iterations2cost) for x in ticks]) plt.ylabel('cost') plt.xlabel('iterations') plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters

The new accuracy values:

fit_params_drop = NN_model_drop(train_set_x, train_set_y, nn_layers, learning_rate = 0.3, num_iterations = 3500, print_cost = False, keep_prob = 0.8, init="he") print ("On the train set:") predictions_train = predict(train_set_x, train_set_y, fit_params_drop) print ("On the test set:") predictions_test = predict(X_test.T, y_test.T, fit_params_drop)

On the train set: Accuracy: 0.98675 On the test set: Accuracy: 0.928

Even higher accuracy could be obtained, by systematically searching for better hyper-parameters (learning_rate, layers_dims, num_iterations).

This post is part of a series about Machine Learning with Python.