We have seen previously how learning the parameters of a prediction function on the same data would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice to hold out part of the available data as a test set.

The test error would be then the average error that results when predicting the response on a new observation, one that was not used in training the learning method.
In contrast, the training error is calculated by applying the learning method to the observations used in its training.
But the training error rate often is quite different from the test error rate, and in particular it can dramatically underestimate the general error.

The best solution would be to use a large designated test set, that often is not available.

Here we see a class of methods that estimate the test error by holding out a subset of the training observations from the fitting process, and then applying the learning method to those held out observations. Continue reading “Cross-validation”


Cat or not cat?

This shows two simple image-recognition algorithms that can correctly classify pictures as cat or non-cat.
The first is a classic logistic regression while the second – more accurate – is a deep neural network.

You can follow along using the notebook in GitHub also. And this post is part of a series about Machine Learning with Python. Continue reading “Cat or not cat?”

Introduction to NTLK

Text is everywhere, approximately 80% of all data is estimated to be unstructured text/rich data (web pages, social networks, search queries, documents, …) and text data is growing fast, an estimated 2.5 Exabytes every day!

We have seen how to do some basic text processing in Python, now we introduce an open source framework for natural language processing that can further help to work with human languages: NLTK (Natural Language ToolKit).

Tokenise a text

Let’s see it firstly with a basic NLP task, the usual tokenisation (split a text into tokens or words).

You can follow along with a notebook in GitHub. Continue reading “Introduction to NTLK”

The overfitting problem and the bias vs. variance dilemma

We have seen what is linear regression, how to make models and algorithms for estimating the parameters of such models, how to measure the loss.

Now we see how to assess how well the considered method should perform in predicting new data, how to select amongst possible models to choose the best performing.

We will first explore the concept of training and test error, how they vary with model complexity and how they might be utilised to form a valid assessment of predictive performance. This leads directly to an important bias-variance tradeoff, which is fundamental to machine learning.

The concepts described in this post are key to all machine learning problems, well-beyond the regression setting. Continue reading “The overfitting problem and the bias vs. variance dilemma”

Effective military corps are Agile

I never read it but I have been told that the Marine’s Corp “Warfighting” manual  contains several similarities with how Agile projects and teams should be.
If you also don’t feel to read the dense 100+ pages document here is my shorter take, inspired from what I learnt during my mandatory military service time (well, I was not in the Marine’s but still in a rapid response force).

Embrace changes

Traditionally you associate military to huge command-and-control structures but that is not always the case.

Continue reading “Effective military corps are Agile”

NLP3o: A web app for simple text analysis

We have seen earlier an introduction of few NLP (Natural Language Processing) topics, specifically how to tokenise a text, remove the punctuation and its stop words (the most common words such as the conjunctions) and how to find the top used words.

Now we can see – as an example – how to put all these inside a simple web app.

The entire source code is available on GitHub, so you can better follow along; and you can see it in action in a Heroku dyno.

Screen Shot 2017-07-29 at 14.55.08 Continue reading “NLP3o: A web app for simple text analysis”