Introduction to time series – Part II: an example

Exploring a milk production Time Series

Time series models are used in a wide range of applications, particularly for forecasting, which is the goal of this example, performed in four steps:

– Explore the characteristics of the time series data.
– Decompose the time series into trend, seasonal components, and remainder components.
– Apply time series models.
– Forecast the production for a 12 month period.

Part I of this mini-series explored the basic terms of time series analysis.

Load and clean the data

The dataset is the production amount of several diary products in California, month by month, for 18 years.
Our goal: forecast the next year production for one of those products: milk.

You can follow along with the associated notebook in GitHubContinue reading “Introduction to time series – Part II: an example”

Advertisements

Merry GDPR everyone

The GDPR (The General Data Protection Regulation from the European Union) came into force on 25th of May (here is a funny top 10 of the worst ways to apply the GDPR).

Now that the spam-madness of GDPR mails has luckily passed its peak, I would like to break a lance in its favour.

Let’s start with the name: Data Protection is not fully representative, its goal is more than that.
In a nutshell it gives each European citizen the rights:

  • to know which kind of personal and sensitive data are collected about them
  • to know why and which use will be done with those data
  • to refuse the collection and processing of those data (deny consent)
  • to access the data (see them, edit if needed, obtain and port them somewhere else)
  • to be forgotten (data erase, collection and processing stopped)
  • to be promptly informed if a security breach happened and involved their data

Now, imagine a not far future – algorithmic pricing is already an established methodology – when you enter a shop (online or offline) to buy a new mobile phone but they have no price tag. Continue reading “Merry GDPR everyone”

Introduction to NTLK

Text is everywhere, approximately 80% of all data is estimated to be unstructured text/rich data (web pages, social networks, search queries, documents, …) and text data is growing fast, an estimated 2.5 Exabytes every day!

We have seen how to do some basic text processing in Python, now we introduce an open source framework for natural language processing that can further help to work with human languages: NLTK (Natural Language ToolKit).

Tokenise a text

Let’s see it firstly with a basic NLP task, the usual tokenisation (split a text into tokens or words).

You can follow along with a notebook in GitHub. Continue reading “Introduction to NTLK”

The overfitting problem and the bias vs. variance dilemma

We have seen what is linear regression, how to make models and algorithms for estimating the parameters of such models, how to measure the loss.

Now we see how to assess how well the considered method should perform in predicting new data, how to select amongst possible models to choose the best performing.

We will first explore the concept of training and test error, how they vary with model complexity and how they might be utilised to form a valid assessment of predictive performance. This leads directly to an important bias-variance tradeoff, which is fundamental to machine learning.

The concepts described in this post are key to all machine learning problems, well-beyond the regression setting. Continue reading “The overfitting problem and the bias vs. variance dilemma”

NLP3o: A web app for simple text analysis

We have seen earlier an introduction of few NLP (Natural Language Processing) topics, specifically how to tokenise a text, remove the punctuation and its stop words (the most common words such as the conjunctions) and how to find the top used words.

Now we can see – as an example – how to put all these inside a simple web app.

The entire source code is available on GitHub, so you can better follow along; and you can see it in action in a Heroku dyno.

Screen Shot 2017-07-29 at 14.55.08 Continue reading “NLP3o: A web app for simple text analysis”

Logistic regression with Python statsmodels

We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.
This was done using Python, the sigmoid function and the gradient descent. 

We can now see how to solve the same example using the statsmodels library, specifically the logit package, that is for logistic regression. The package contains an optimised and efficient algorithm to find the correct regression parameters.
You can follow along from the Python notebook on GitHub.

Continue reading “Logistic regression with Python statsmodels”