Introduction to NTLK

Text is everywhere, approximately 80% of all data is estimated to be unstructured text/rich data (web pages, social networks, search queries, documents, …) and text data is growing fast, an estimated 2.5 Exabytes every day!

We have seen how to do some basic text processing in Python, now we introduce an open source framework for natural language processing that can further help to work with human languages: NLTK (Natural Language ToolKit).

Tokenise a text

Let’s see it firstly with a basic NLP task, the usual tokenisation (split a text into tokens or words).

You can follow along with a notebook in GitHub. Continue reading “Introduction to NTLK”

Advertisements

The overfitting problem and the bias vs. variance dilemma

We have seen what is linear regression, how to make models and algorithms for estimating the parameters of such models, how to measure the loss.

Now we see how to assess how well the considered method should perform in predicting new data, how to select amongst possible models to choose the best performing.

We will first explore the concept of training and test error, how they vary with model complexity and how they might be utilised to form a valid assessment of predictive performance. This leads directly to an important bias-variance tradeoff, which is fundamental to machine learning.

The concepts described in this post are key to all machine learning problems, well-beyond the regression setting. Continue reading “The overfitting problem and the bias vs. variance dilemma”

NLP3o: A web app for simple text analysis

We have seen earlier an introduction of few NLP (Natural Language Processing) topics, specifically how to tokenise a text, remove the punctuation and its stop words (the most common words such as the conjunctions) and how to find the top used words.

Now we can see – as an example – how to put all these inside a simple web app.

The entire source code is available on GitHub, so you can better follow along; and you can see it in action in a Heroku dyno.

Screen Shot 2017-07-29 at 14.55.08 Continue reading “NLP3o: A web app for simple text analysis”

Logistic regression with Python statsmodels

We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.
This was done using Python, the sigmoid function and the gradient descent. 

We can now see how to solve the same example using the statsmodels library, specifically the logit package, that is for logistic regression. The package contains an optimised and efficient algorithm to find the correct regression parameters.
You can follow along from the Python notebook on GitHub.

Continue reading “Logistic regression with Python statsmodels”

An introduction to logistic regression

Variables can be described as either quantitative or qualitative.
Quantitative variables have a numerical value, e.g. a person’s income, or the price of a house.
Qualitative variables have a values taken from one of different classes or categories. E.g., a person’s gender (male or female), the type of house purchased (villa, flat, penthouse, …) the colour of the eye (brown, blue, green) or a cancer diagnosis.

Linear regression predicts a continuous variable but sometime we want to predict a categorical variable, i.e. a variable with a small number of possible discrete outcomes, usually unordered (there is no order among the outcomes).

This kind of problems are called Classification.

Classification

Given a feature vector X and a qualitative response y taking values from one fixed set, the classification task is to build a function f(X) that takes as input the feature vector X and predicts its value for y.
Often we are interested also (or even more) in estimating the probabilities that X belongs to each category in C.
For example, it is more valuable to have the probability that an insurance claim is fraudulent, than if a classification is fraudulent or not.

There are many possible classification techniques, or classifiers, available to predict a qualitative response.

We will se now one called logistic regression.

Note: this post is part of a series about Machine Learning with Python.
Continue reading “An introduction to logistic regression”

Machines “think” differently but it’s not a problem (maybe)

Yet another article about the interpretability problem of many AI algorithms, this time on the MIT Technology Review, May/June 2017 issue.

The issue is clear; many of the most successful recent AI technologies revolve around deep learning: complex artificial neural networks – with so many layers of so many neurons transforming so many variables – that behave like “black boxes” for us.
We cannot comprehend anymore the model, we don’t know how or why the outcome to a specific input is obtained.
Is it scary?

In the film Dekalog 1 by Krzysztof Kieślowski – the first of ten short films inspired to the ten Christian imperatives, the first one being “I am the Lord your God; you shall have no other gods before me”  – Krzysztof lives alone with Paweł, his 12-years-old and highly intelligent son, and introduces him to the world of personal computers. Continue reading “Machines “think” differently but it’s not a problem (maybe)”