Recover audio using linear regression

In this example, we will use linear regression to recover or ‘fill out’ a completely deleted portion of an audio file!
For this, we use the FSDD, Free-Spoken-Digits-Dataset, an audio dataset put together by Zohar Jackson:

cleaned up audio (no dead-space, roughly same length, same bitrate, same samples-per-second rate, same speaker, etc) samples ready for machine learning.

You can follow along with the associated notebook in GitHub. Continue reading “Recover audio using linear regression”


Regularisation in neural networks

We have seen the concept of regularisation and how is applied to linear regression, let’s see now another example for logistic regression done with artificial neural networks.

The question to answer is to recognise hand-written digits and is a classic one-vs-all logistic regression problem.

The dataset  contains 5000 training examples of handwritten digits and is a subset of the MNIST handwritten digit dataset.

Each training example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location.

The 20 by 20 grid of pixels is unrolled into a 400-dimensional vector. Each of these training examples becomes a single row in our data matrix X. This gives us a 5000 by 400 matrix X where every row is a training example for a handwritten digit image.

Let’s get more familiar with the dataset.
You can follow along on the associated notebook.
Continue reading “Regularisation in neural networks”


The basic idea of regularisation is to penalise or shrink the large coefficients of a regression model.
This can help with the bias / variance trade-off (shrinking the coefficient estimates can significantly reduce their variance and will improve the prediction error)  and can help with model selection by automatically removing irrelevant features (that is, by setting the corresponding coefficient estimates to zero).
Its cons are that this approach may be very demanding computationally.

There are several ways to perform the shrinkage; the regularisations models that we will see are the Ridge regression and the Lasso. Continue reading “Regularisation”



We have seen previously how learning the parameters of a prediction function on the same data would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice to hold out part of the available data as a test set.

The test error would be then the average error that results when predicting the response on a new observation, one that was not used in training the learning method.
In contrast, the training error is calculated by applying the learning method to the observations used in its training.
But the training error rate often is quite different from the test error rate, and in particular it can dramatically underestimate the general error.

The best solution would be to use a large designated test set, that often is not available.

Here we see a class of methods that estimate the test error by holding out a subset of the training observations from the fitting process, and then applying the learning method to those held out observations. Continue reading “Cross-validation”


Cat or not cat?

This shows two simple image-recognition algorithms that can correctly classify pictures as cat or non-cat.
The first is a classic logistic regression while the second – more accurate – is a deep neural network.

You can follow along using the notebook in GitHub also. And this post is part of a series about Machine Learning with Python. Continue reading “Cat or not cat?”


Introduction to NTLK

Text is everywhere, approximately 80% of all data is estimated to be unstructured text/rich data (web pages, social networks, search queries, documents, …) and text data is growing fast, an estimated 2.5 Exabytes every day!

We have seen how to do some basic text processing in Python, now we introduce an open source framework for natural language processing that can further help to work with human languages: NLTK (Natural Language ToolKit).

Tokenise a text

Let’s see it firstly with a basic NLP task, the usual tokenisation (split a text into tokens or words).

You can follow along with a notebook in GitHub. Continue reading “Introduction to NTLK”