# Regularisation in neural networks

We have seen the concept of regularisation and how is applied to linear regression, let’s see now another example for logistic regression done with artificial neural networks.

The question to answer is to recognise hand-written digits and is a classic one-vs-all logistic regression problem.

The dataset  contains 5000 training examples of handwritten digits and is a subset of the MNIST handwritten digit dataset.

Each training example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location.

The 20 by 20 grid of pixels is unrolled into a 400-dimensional vector. Each of these training examples becomes a single row in our data matrix X. This gives us a 5000 by 400 matrix X where every row is a training example for a handwritten digit image.

Let’s get more familiar with the dataset.
You can follow along on the associated notebook.

## 1 – Overview of the data set

from scipy.io import loadmat

dataset = loadmat('datasets/mnist-data.mat') # comes as dictionary
dataset.keys()

dict_keys(['__version__', '__header__', 'y', '__globals__', 'X'])

Each line of X is an array representing an image. You can visualize an example by running the following code. Feel free also to change the indexImage value and re-run to see other images.

import matplotlib.pyplot as plt
import numpy as np
# Example of a picture
indexImage = 4000  # try any index between 0 and 4999.
# They are sorted, from 1 to 10
renderImage = np.reshape(dataset['X'][indexImage], (20,20))
labelImage = dataset['y'][indexImage]

plt.imshow(renderImage, cmap='gray')
print ("Label: this is a ", labelImage)

Label: this is a [8]

## 2 – Data preprocessing

• Figure out the dimensions and shapes of the problem
• Split the dataset into training a test subsets
• “Standardise” the data
X = dataset['X'] # the images
X.shape

(5000, 400)

The second part of the training set is a 5000-dimensional vector y that contains labels for the training set.

y = dataset['y'] # the labels
y.shape

(5000, 1)
y[499]

array([10], dtype=uint8)

One problem: The label representing the digit 0 (zero) is coded as ten (10). Change this.

list_y = [0 if i == 10 else i for i in y] # apply to each item in y
y = np.asarray(list_y)
y = y.reshape(-1,1)
y.shape

(5000, 1)
y[0:10]  # verify that the label is now zero

array([[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0],
[0]])

### One hot encoding

Another problem: the original labels (in the variable y) are a number between 0, 1, 2, …, 9 therefore the output is a multi-class and not a binary variable.
For the purpose of training a neural network, we need to recode the labels as vectors containing only binary values 0 or 1.
Will do this by transforming y from one to 10 “dummy” variables: each one takes the binary value 0 or 1 to indicate the absence or presence of the respective categorical label.

n_classes = 10 # 10 digits = 10 classes/labels

# np.eye(n) creates an identity matrix of shape (n,n)
OHE_y = np.eye(n_classes)[y.reshape(-1)]
OHE_y.shape

(5000, 10)
OHE_y[1000] # this is the new encoding for e.g. label = 2

array([ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])

### Split into train and test sets

Split the dataset into 20% of test and 80% of train sets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X, OHE_y, test_size=0.2, random_state=7)

input_layer_size  = X.shape[1]
num_px = np.sqrt(input_layer_size) # 400 = 20x20 Input Images of Digits
n_y = y_train.shape[1]
m_train = X_train.shape[0]
m_test = X_test.shape[0]

print ("Dataset dimensions:")
print ("Number of training examples = " + str(m_train))
print ("Number of testing examples = " + str(m_test))
print ("Height/Width of each image: num_px = " + str(num_px))
print ("Each image is of size: &lt;" + str(num_px) + ", " + str(num_px) + "&gt;")
print ("X train shape: " + str(X_train.shape))
print ("y train shape: " + str(y_train.shape))

Dataset dimensions:
Number of training examples = 4000
Number of testing examples = 1000
Height/Width of each image: num_px = 20.0
Each image is of size: <20.0, 20.0>
X train shape: (4000, 400)
y train shape: (4000, 10)

## Deep Neural Network for Image Classification

Now we will build and apply a deep neural network to the problem.

The main steps for building an Artificial  Neural Network are as usual:

1. Define the model structure (such as number and size of layers) and the hyperparameters
2. Initialize the model’s weights
3. Loop for the number of iterations:
– Calculate current loss (forward propagation)
– Calculate current gradient (backward propagation)
4. Use the trained weights to predict the labels

### Defining the neural network structure

Our neural network has 3 layers: an input layer, a hidden layer and an output layer.
Recall that our inputs are pixel values of digit images. Since the images are of size 20×20, this gives us 400 input layer units (excluding the extra bias unit which always outputs +1).
There are 25 units in the second layer and 10 output units (corresponding to the 10 digit classes).

### CONSTANTS DEFINING THE MODEL ####
# we define a neural network with total 3 layers, x, y and 1 hidden:
n_h = 25
nn_layers = [input_layer_size, n_h, n_y]  # length is 3 (layers)


### Build the 3-layer neural network

We will re-use all the helper functions defined previously to build the neural network, such as the linear forward and the backward propagation.
Please refer to the Jupyter notebook for the details.

Now we can put together all the functions to build an L-layer neural network with this structure:

nn_layers

[400, 25, 10]
np.random.seed(1)

train_set_x = X_train.T
train_set_x.shape

(400, 4000)
# y is the original output array, with labels
# train_set_y is that set, one-hot-encoded
train_set_y = y_train.T
train_set_y.shape

(10, 4000)
#  FUNCTION: L_layer_model

def simpleNN_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):
"""
Implements a L-layer neural network.

Arguments:
X -- data, numpy array of shape (number of examples, num_px * num_px)
Y -- true "label" vector (containing 0 or 1), of shape (10, number of examples)
layers_dims -- list containing the input size and each layer size, of length (number of layers + 1).
learning_rate -- learning rate of the gradient descent update rule
num_iterations -- number of iterations of the optimisation loop
print_cost -- if True, it prints the cost every 200 steps

Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""

costs = []      # keep track of cost
iterations2cost = 200 # Print the cost every these iterations

# Parameters initialization.
parameters = initialise_parameters(layers_dims)

for i in range(0, num_iterations):

# Forward propagation
AL, caches = L_model_forward(X, parameters)

# Compute cost.
cost = compute_cost(AL, Y)

# Backward propagation.

# Update parameters.

# Print the cost every iterations2cost training example
if print_cost and i % iterations2cost == 0:
print ("Cost after iteration %i: %f" %(i, cost))
if print_cost and i % iterations2cost == 0:
costs.append(cost)

if print_cost:
# plot the cost
fig, ax = plt.subplots(1,1)
plt.plot(np.squeeze(costs))

ticks = ax.get_xticks()
ax.locator_params(axis='x', nticks=len(costs))
ax.set_xticklabels([int(x*iterations2cost) for x in ticks])
plt.ylabel('cost')
plt.xlabel('iterations')
plt.title("Learning rate =" + str(learning_rate))
plt.show()

return parameters


We will now train the model as a 3-layer neural network.

fit_params = simpleNN_model(train_set_x, train_set_y, nn_layers, learning_rate = 0.3, num_iterations = 3500, print_cost = False)


## 4. Results analysis

Now we can check the performance of the trained network by predicting the results of the test set and comparing them with the actual labels.
Note that the predict() function has been adapted to cope with the multi-class labels.

def predict(X, yOHE, parameters):
"""
This function is used to predict the results of a  L-layer neural network.
It also checks them against the true labels and print the accuracy

Arguments:
X -- data set of examples you would like to label
yOHE -- the true labels, as multi-class vectors
parameters -- parameters of the trained model

Returns:
p -- predictions (the label) for the given dataset X
"""

m = X.shape[1]
nLabels = yOHE.shape[1]
n = len(parameters) // 2 # number of layers in the neural network
p = np.zeros((1, m)) # the predicted output, initialised to zero
y = np.zeros((1, m)) # the actual output

# Forward propagation
probas, caches = L_model_forward(X, parameters)

# probas is a matrix of shape [nLabels, m] (one-hot-encoded)
assert (probas.shape[1] == m)

for i in range(0, m):
# convert probs to label predictions:
# just take the label with max prob
p[0,i] = np.argmax(probas[:,i])

# convert expected results into label: takes the value with one
y[0,i] = np.argmax(yOHE[:,i])

# print results
print("Accuracy: "  + str(np.sum((p == y)/m)))

return p

print ("On the training set:")
predictions_train = predict(train_set_x, train_set_y, fit_params)
print ("On the test set:")
predictions_test = predict(X_test.T, y_test.T, fit_params)

On the training set:
Accuracy: 0.99425
On the test set:
Accuracy: 0.929

The accuracy on the test set s very good: 93% of the times did predict correctly the digit (and this with just 3 layers) but the very large training set accuracy (almost 100%) could be an indication of over-fitting.

Let’s apply a couple of optimisations to our network, indlufing regularisation but first how we initialise the weights.

## 5 – Initialising parameters

There are two types of parameters to initialise in a neural network:
– the weight matrices
– the bias vectors

The weight matrix is initialised with random values (to break symmetry as we have seen previously).
Of course, different initialisations lead to different results and poor initialisation can slow down the optimisation algorithm.

One good practice is not to initialise to values that are too large, instead what bring good results are the so-called Xavier or the He (for ReLU activation) initialisations.

Finally, we try here the “He Initialisation”; this is named for the first author of He et al., 2015.

This function is similar to the previous random initialisation.
The only difference is that instead of multiplying the random number by 10, it will multiply it by $\sqrt{\frac{2}{\text{dimension of the previous layer}}}$, which is what He initialisation recommends for layers with a ReLU activation.

#  FUNCTION: initialize_parameters

def initialise_parameters_he(layer_dims):
"""
Arguments:
layer_dims -- python array (list) containing the dimensions of each layer in our network

Returns:
parameters -- python dictionary containing the parameters "W1", "b1", ..., "WL", "bL":
Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
bl -- bias vector of shape (layer_dims[l], 1)
"""

parameters = {}
L = len(layer_dims)            # number of layers in the network

for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*np.sqrt(2./layer_dims[l-1])
parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))

# unit tests
assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

return parameters


## 6 – L2 Regularization

The standard way to avoid overfitting is called L2 regularisation. It consists of appropriately modifying your cost function, from:
$J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)}$
To:
$J_{regularised} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularisation cost}$

This is implemented in the below function compute_cost_with_regularisation() which computes the cost given by formula.

# GRADED FUNCTION: compute_cost_with_regularization

def compute_cost_with_regularisation(A3, Y, parameters, lambdaHyper):
"""
Implement the cost function with L2 regularization. See formula (2) above.

Arguments:
A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
Y -- "true" labels vector, of shape (output size, number of examples)
parameters -- python dictionary containing parameters of the model
lambdaHyper -- the lambda regularisation hyper-parameter.

Returns:
cost - value of the regularized loss function (formula (2))
"""
# This gives you the cross-entropy part of the cost
cross_entropy_cost = compute_cost(A3, Y)

sum_regularization_cost = 0
m = Y.shape[1]
L = len(parameters) // 2    # number of layers (2 because we have W and b)

for i in range(1, L+1):
W_i = parameters['W' + str(i)]
sum_regularization_cost += np.sum(np.square(W_i))

regularization_cost = (1/m)*(lambdaHyper/2)*(sum_regularization_cost)

cost = cross_entropy_cost + regularization_cost

return cost


Of course, because we changed the cost function, we have to change backward propagation as well.
All the gradients have to be computed with respect to this new cost: add the regularisation term’s gradient.

def backward_propagation_with_regularisation(X, Y, Yhat, caches, lambdaHyper):
"""
Implements the backward propagation of our baseline model to which we added an L2 regularization.

Arguments:
X -- input dataset, of shape (input size, number of examples)
Yhat -- "true" labels vector, of shape (output size, number of examples)
caches -- cache output from forward_propagation()
lambdaHyper -- regularization hyperparameter, scalar

Returns:
gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
"""

m = X.shape[1]

L = len(caches) # the number of layers

last_layer_cache = caches[L-1]
((A, W, b), Z) = last_layer_cache

assert (Yhat.shape == Y.shape)

dZ = Yhat - Y

for i in reversed(range(L-1)):
current_layer_cache = caches[i]
((A_prev, W_prev, b_prev), Z_prev) = current_layer_cache

dW_entropy = 1./m * np.dot(dZ, A.T)
dW_reg = (lambdaHyper/m)*W
dW = dW_entropy + dW_reg

db = 1./m * np.sum(dZ, axis=1, keepdims = True)

dA_prev = np.dot(W.T, dZ)
dZ_prev = np.multiply(dA_prev, np.int64(A &amp;gt; 0))

gradients["dW" + str(i + 2)] = dW
gradients["db" + str(i + 2)] = db

gradients["dA" + str(i + 1)] = dA_prev
gradients["dZ" + str(i + 1)] = dZ_prev

((A, W, b), Z) = ((A_prev, W_prev, b_prev), Z_prev)
dZ = dZ_prev

dW_prev = 1./m * np.dot(dZ_prev, X.T) + (lambdaHyper/m)*W_prev
db_prev = 1./m * np.sum(dZ_prev, axis=1, keepdims = True)



### Putting all together

def NN_model(X, Y, layers_dims, learning_rate = 0.0075,
num_iterations = 3000, print_cost=False,
lambdaHyper = 0, init="standard"):
"""
Implements a L-layer neural network.

Arguments:
X -- data, numpy array of shape (number of examples, num_px * num_px * 3)
Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples)
layers_dims -- list containing the input size and each layer size, of length (number of layers + 1).
learning_rate -- learning rate of the gradient descent update rule
num_iterations -- number of iterations of the optimization loop
print_cost -- if True, it prints the cost every 100 steps
lambdaHyper -- regularisation hyperparameter, scalar
init -- type of initialisation: standard or He.
Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""

costs = []  # keep track of cost
iterations2cost = 200 # Print the cost every these iterations

# Parameters initialization.
if init == "he":
parameters = initialise_parameters_he(layers_dims)
else:
parameters = initialise_parameters(layers_dims)

for i in range(0, num_iterations):

# Forward propagation.
Yhat, caches = L_model_forward(X, parameters)

# Compute cost.
if lambdaHyper == 0:
cost = compute_cost(Yhat, Y)
else:
cost = compute_cost_with_regularisation(Yhat, Y, parameters, lambdaHyper)

# Backward propagation.
if lambdaHyper == 0:
else:
grads = backward_propagation_with_regularisation(X, Y, Yhat, caches, lambdaHyper)

# Update parameters.

# Print the cost every 200 training example
if print_cost and i % iterations2cost == 0:
print ("Cost after iteration %i: %f" %(i, cost))
if print_cost and i % iterations2cost == 0:
costs.append(cost)

if print_cost:
# plot the cost
fig, ax = plt.subplots(1,1)
plt.plot(np.squeeze(costs))

ticks = ax.get_xticks()
ax.locator_params(axis='x', nticks=len(costs))
ax.set_xticklabels([int(x*iterations2cost) for x in ticks])

plt.ylabel('cost')
plt.xlabel('iterations')
plt.title("Learning rate =" + str(learning_rate))
plt.show()

return parameters

fit_params_reg = NN_model(train_set_x, train_set_y, nn_layers,
learning_rate = 0.3, num_iterations = 3500, print_cost = False,
lambdaHyper = 5, init="he")


Let’s check the new accuracy values:

print ("On the training set:")
predictions_train = predict(train_set_x, train_set_y, fit_params_reg)
print ("On the test set:")
predictions_test = predict(X_test.T, y_test.T, fit_params_reg)

On the training set:
Accuracy: 0.9855
On the test set:
Accuracy: 0.939

It slightly improved the test accuracy, now almost 94%.

Note also that the training accuracy has been reduced.

## 7 – Dropout

Finally, dropout is a widely used regularisation technique that is specific to deep learning.

It randomly shuts down some neurons in each iteration.

When you shut some neurons down, you actually modify the model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of the neurons. With dropout, the neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
"""
Implements the forward propagation with dropout
Arguments:
X -- input dataset, of shape (2, number of examples)
parameters -- python dictionary containing the parameters for a 3-layers network
keep_prob - probability of keeping a neuron active during drop-out, scalar

Returns:
A2 -- last activation value, output of the forward propagation, of shape (1,1)
cache -- tuple, information stored for computing the backward propagation
"""
L = len(parameters) // 2                  # number of layers in the neural network

# retrieve parameters
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]

Z1 = np.dot(W1, X) + b1
A1,cache_temp = relu(Z1)

D1 = np.random.rand(A1.shape[0], A1.shape[1])     # Step 1: initialize matrix D1
D1 = D1 &lt; keep_prob     # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)     A1 = A1*D1                                 # Step 3: shut down some neurons of A1     A1 = A1 / keep_prob                   # Step 4: scale the value of neurons that haven't been shut down      Z2 = np.dot(W2, A1) + b2      A2, cache_temp = sigmoid(Z2)      caches = (Z1, D1, A1, W1, b1, Z2, A2, W2, b2)      return A2, caches  

And the backward propagation with dropout:

 def backward_propagation_with_dropout(X, Y, cache, keep_prob):     """     Implements the backward propagation of our baseline model to which we added dropout.          Arguments:     X -- input dataset, of shape (2, number of examples)     Y -- "true" labels vector, of shape (output size, number of examples)     cache -- cache output from forward_propagation_with_dropout()     keep_prob - probability of keeping a neuron active during drop-out, scalar          Returns:     gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables     """          m = X.shape[1]     (Z1, D1, A1, W1, b1, Z2, A2, W2, b2) = cache          dZ2 = A2 - Y          dW2 = 1./m * np.dot(dZ2, A1.T)     db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)          dA1 = np.dot(W2.T, dZ2)     dA1 = dA1*D1              # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation     dA1 = dA1 / keep_prob       # Step 2: Scale the value of neurons that haven't been shut down     dZ1 = np.multiply(dA1, np.int64(A1 &gt; 0))
dW1 = 1./m * np.dot(dZ1, X.T)
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

gradients = {"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}



Finally we adapt the model to the dropout:

def NN_model_drop(X, Y, layers_dims, learning_rate = 0.0075,
num_iterations = 3000, print_cost=False,
keep_prob = 1, init="standard"):
"""
Implements a L-layer neural network

Arguments:
X -- data, numpy array of shape (number of examples, num_px * num_px * 3)
Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples)
layers_dims -- list containing the input size and each layer size, of length (number of layers + 1).
learning_rate -- learning rate of the gradient descent update rule
num_iterations -- number of iterations of the optimization loop
print_cost -- if True, it prints the cost every 100 steps

Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""

costs = []  # keep track of cost
iterations2cost = 200 # Print the cost every these iterations

# Parameters initialization.
if init == "he":
parameters = initialise_parameters_he(layers_dims)
else:
parameters = initialise_parameters(layers_dims)

for i in range(0, num_iterations):

# Forward propagation:
Yhat, caches = forward_propagation_with_dropout(X, parameters, keep_prob)

# Compute cost.
cost = compute_cost(Yhat, Y)

# Backward propagation.
grads = backward_propagation_with_dropout(X, Y, caches, keep_prob)

# Update parameters.

# Print the cost every 200 training example
if print_cost and i % iterations2cost == 0:
print ("Cost after iteration %i: %f" %(i, cost))
if print_cost and i % iterations2cost == 0:
costs.append(cost)

if print_cost:
# plot the cost
fig, ax = plt.subplots(1,1)
plt.plot(np.squeeze(costs))

&lt;span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			&gt;&lt;/span&gt;
ticks = ax.get_xticks()
ax.locator_params(axis='x', nticks=len(costs))
ax.set_xticklabels([int(x*iterations2cost) for x in ticks])
plt.ylabel('cost')
plt.xlabel('iterations')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
return parameters


The new accuracy values:

fit_params_drop = NN_model_drop(train_set_x, train_set_y, nn_layers,
learning_rate = 0.3, num_iterations = 3500, print_cost = False,
keep_prob = 0.8, init="he")

print ("On the train set:")
predictions_train = predict(train_set_x, train_set_y, fit_params_drop)
print ("On the test set:")
predictions_test = predict(X_test.T, y_test.T, fit_params_drop)


On the train set:
Accuracy: 0.98675
On the test set:
Accuracy: 0.928

Even higher accuracy could be obtained, by systematically searching for better hyper-parameters (learning_rate, layers_dims, num_iterations).

This post is part of a series about Machine Learning with Python.