Multi-class logistic regression

We have seen several examples of binary logistic regression where the outcomes that we wanted to predict had two classes, such as a model predicting if a student will be admitted to the University (Yes or No) based on the previous exam results  or if a random Titanic passenger will survive or not.

Binary classification such as these are very common but you can also encounter classification problems where the outcome is a multi-class of more than two: for example if tomorrow weather will be sunny, cloudy or rainy; or if an incoming email shall be tagged as work, family, friends or hobby.

We see now a couple of approaches to handle such classification problems with a practical example: to classify a sky object based on a set of observed variables.
Data is from the Sloan Digital Sky Survey (Release 14).
For the example that we are using the sky object to be classified can be one of three classes: Star, Galaxy or Quasar.

The code is available also on a notebook in GitHub. Data is also available on GitHub. Continue reading “Multi-class logistic regression”

Advertisements

Introduction to time series – Part II: an example

Exploring a milk production Time Series

Time series models are used in a wide range of applications, particularly for forecasting, which is the goal of this example, performed in four steps:

– Explore the characteristics of the time series data.
– Decompose the time series into trend, seasonal components, and remainder components.
– Apply time series models.
– Forecast the production for a 12 month period.

Part I of this mini-series explored the basic terms of time series analysis.

Load and clean the data

The dataset is the production amount of several diary products in California, month by month, for 18 years.
Our goal: forecast the next year production for one of those products: milk.

You can follow along with the associated notebook in GitHubContinue reading “Introduction to time series – Part II: an example”

Introduction to time series – Part I: the basics

A Time series is a data set collected through time.

What makes it different from other datasets that we used for regular regression problems are two things:

  1. It is time dependent. So the basic assumption of a linear regression model that the observations are independent doesn’t hold in this case.
  2. Most time series have some form of trend – either an increasing or decreasing trend – or some kind of seasonality pattern, i.e. variations specific to a particular time frame.

Basically, this means that the present is correlated with the past.
A value at time T is correlated with the value at T minus 1 but it may also correlated with the value at time T minus 2, maybe not quite as much as T minus 1.
And even at 20 times steps behind, we could still know something about the value of T because they’re still correlated, depending on which kind of time series it is.
And this obviously is not true with normal random data.

Time series are everywhere, for example in:

  • Financial data (stocks, currency exchange rates, interest rates)
  • Marketing (click-through rates for web advertising)
  • Economics (sales and demand forecasts)
  • Natural phenomenon (water flow, temperature, precipitation, wind speed, animal species abundance, heart rate)
  • Demographic and population and so on.

What might you want to do with time series?

  • Smoothing – extract an underlying signal (a trend) from a noise.
  • Modelling – explain how the time series arose, for intervention.
  • Forecasting – predict the values of the time series in the future.

We first see here which specific characteristics the Time Series (TS) have, and will then see in a second part a concrete example of TS analysis (smoothing + modelling + forecasting).

You can follow along with the associated notebook in GitHub. Continue reading “Introduction to time series – Part I: the basics”

Logistic regression using SKlearn

We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.
This was done using Python, from scratch defining the sigmoid function and the gradient descent, and we have seen also the same example using the statsmodels library.

Now we are going to see how to solve a logistic regression problem using the popular SciKitLearn library, specifically the LogisticRegression module.

The example this time is to predict survival on the Titanic ship (that sank against an iceberg).
It’s a basic learning competition on the ML platform Kaggle, a simple introduction to machine learning concepts, specifically binary classification (survived / not survived).
Here we are looking into how to apply Logistic Regression to the Titanic dataset.

You can follow along the Python notebook on GitHub or the Python kernel on Kaggle. Continue reading “Logistic regression using SKlearn”

Azure Machine Learning Studio

Azure ML (Machine Learning) Studio is an interactive environment from Microsoft to build predictive analytics solutions.
You can upload your data or use the Azure cloud and via a drag-and-drop interface you can combine existing machine learning algorithms – or your own scripts, in several languages – to build and test a data science pipeline.
Eventually, the final model can be deployed as a web service for e.g. Excel or custom apps.

If you are already using the Azure solutions, it offers a valuable add-on for machine learning. Especially if you need a quick way to analyse an dataset and evaluate a model.

This is what Gartner says about Azure ML Studio in its 2018 “Magic Quadrant for Data Science and Machine-Learning Platforms”:

Microsoft remains a Visionary.
Its position in this regard is attributable to low scores for market responsiveness and product viability, as Azure Machine Learning Studio’s cloud-only nature limits its usability for the many advanced analytic use cases that require an on-premises option.

Note: I have no affiliation with Microsoft nor I am payed by them. I am just looking  into the main tools available for machine learning.

We will see how to create and build a regression model based on the Autos dataset that we already used earlier.

You can follow up this experiment, directly from Azure ML Studio. Continue reading “Azure Machine Learning Studio”

Recover audio using linear regression

In this example, we will use linear regression to recover or ‘fill out’ a completely deleted portion of an audio file!
For this, we use the FSDD, Free-Spoken-Digits-Dataset, an audio dataset put together by Zohar Jackson:

cleaned up audio (no dead-space, roughly same length, same bitrate, same samples-per-second rate, same speaker, etc) samples ready for machine learning.

You can follow along with the associated notebook in GitHub. Continue reading “Recover audio using linear regression”