Introduction to time series – Part I: the basics

A Time series is a data set collected through time.

What makes it different from other datasets that we used for regular regression problems are two things:

  1. It is time dependent. So the basic assumption of a linear regression model that the observations are independent doesn’t hold in this case.
  2. Most time series have some form of trend – either an increasing or decreasing trend – or some kind of seasonality pattern, i.e. variations specific to a particular time frame.

Basically, this means that the present is correlated with the past.
A value at time T is correlated with the value at T minus 1 but it may also correlated with the value at time T minus 2, maybe not quite as much as T minus 1.
And even at 20 times steps behind, we could still know something about the value of T because they’re still correlated, depending on which kind of time series it is.
And this obviously is not true with normal random data.

Time series are everywhere, for example in:

  • Financial data (stocks, currency exchange rates, interest rates)
  • Marketing (click-through rates for web advertising)
  • Economics (sales and demand forecasts)
  • Natural phenomenon (water flow, temperature, precipitation, wind speed, animal species abundance, heart rate)
  • Demographic and population and so on.

What might you want to do with time series?

  • Smoothing – extract an underlying signal (a trend) from a noise.
  • Modelling – explain how the time series arose, for intervention.
  • Forecasting – predict the values of the time series in the future.

We first see here which specific characteristics the Time Series (TS) have, and will then see in a second part a concrete example of TS analysis (smoothing + modelling + forecasting).

You can follow along with the associated notebook in Git. Continue reading “Introduction to time series – Part I: the basics”

Advertisements

Merry GDPR everyone

The GDPR (The General Data Protection Regulation from the European Union) came into force on 25th of May (here is a funny top 10 of the worst ways to apply the GDPR).

Now that the spam-madness of GDPR mails has luckily passed its peak, I would like to break a lance in its favour.

Let’s start with the name: Data Protection is not fully representative, its goal is more than that.
In a nutshell it gives each European citizen the rights:

  • to know which kind of personal and sensitive data are collected about them
  • to know why and which use will be done with those data
  • to refuse the collection and processing of those data (deny consent)
  • to access the data (see them, edit if needed, obtain and port them somewhere else)
  • to be forgotten (data erase, collection and processing stopped)
  • to be promptly informed if a security breach happened and involved their data

Now, imagine a not far future – algorithmic pricing is already an established methodology – when you enter a shop (online or offline) to buy a new mobile phone but they have no price tag. Continue reading “Merry GDPR everyone”

Logistic regression using SKlearn

We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.
This was done using Python, from scratch defining the sigmoid function and the gradient descent, and we have seen also the same example using the statsmodels library.

Now we are going to see how to solve a logistic regression problem using the popular SciKitLearn library, specifically the LogisticRegression module.

The example this time is to predict survival on the Titanic ship (that sank against an iceberg).
It’s a basic learning competition on the ML platform Kaggle, a simple introduction to machine learning concepts, specifically binary classification (survived / not survived).
Here we are looking into how to apply Logistic Regression to the Titanic dataset.

You can follow along the Python notebook on GitHub or the Python kernel on Kaggle. Continue reading “Logistic regression using SKlearn”

Azure Machine Learning Studio

Azure ML (Machine Learning) Studio is an interactive environment from Microsoft to build predictive analytics solutions.
You can upload your data or use the Azure cloud and via a drag-and-drop interface you can combine existing machine learning algorithms – or your own scripts, in several languages – to build and test a data science pipeline.
Eventually, the final model can be deployed as a web service for e.g. Excel or custom apps.

If you are already using the Azure solutions, it offers a valuable add-on for machine learning. Especially if you need a quick way to analyse an dataset and evaluate a model.

This is what Gartner says about Azure ML Studio in its 2018 “Magic Quadrant for Data Science and Machine-Learning Platforms”:

Microsoft remains a Visionary.
Its position in this regard is attributable to low scores for market responsiveness and product viability, as Azure Machine Learning Studio’s cloud-only nature limits its usability for the many advanced analytic use cases that require an on-premises option.

Note: I have no affiliation with Microsoft nor I am payed by them. I am just looking  into the main tools available for machine learning.

We will see how to create and build a regression model based on the Autos dataset that we already used earlier.

You can follow up this experiment, directly from Azure ML Studio. Continue reading “Azure Machine Learning Studio”

What makes an effective team?

Google has always been a data-driven company, not only for its businesses but also for nearly every aspect of its employees’ professional lives.

A few years ago, the project Aristotle was started to assess what it takes to build the perfect team. A dedicated work group of researchers measured many Google’s teams effectiveness using a combination of different qualitative evaluations and quantitative metrics, with the goal of finding the comprehensive definition of team effectiveness.

You can read the story in this New York Times article, here are the results of the project.

A quick summary

The researchers found soon that what really mattered was NOT WHO is on the team, BUT HOW the team worked together. Which makes sense: teams are highly interdependent – team members need one another to get work done.

Continue reading “What makes an effective team?”

Recover audio using linear regression

In this example, we will use linear regression to recover or ‘fill out’ a completely deleted portion of an audio file!
For this, we use the FSDD, Free-Spoken-Digits-Dataset, an audio dataset put together by Zohar Jackson:

cleaned up audio (no dead-space, roughly same length, same bitrate, same samples-per-second rate, same speaker, etc) samples ready for machine learning.

You can follow along with the associated notebook in GitHub. Continue reading “Recover audio using linear regression”