Logistic regression using SKlearn

We have seen an introduction of logistic regression with a simple example how to predict a student admission to university based on past exam results.
This was done using Python, from scratch defining the sigmoid function and the gradient descent, and we have seen also the same example using the statsmodels library.

Now we are going to see how to solve a logistic regression problem using the popular SciKitLearn library, specifically the LogisticRegression module.

The example this time is to predict survival on the Titanic ship (that sank against an iceberg).
It’s a basic learning competition on the ML platform Kaggle, a simple introduction to machine learning concepts, specifically binary classification (survived / not survived).
Here we are looking into how to apply Logistic Regression to the Titanic dataset.

You can follow along the Python notebook on GitHub or the Python kernel on Kaggle. Continue reading “Logistic regression using SKlearn”

Advertisements

Azure Machine Learning Studio

Azure ML (Machine Learning) Studio is an interactive environment from Microsoft to build predictive analytics solutions.
You can upload your data or use the Azure cloud and via a drag-and-drop interface you can combine existing machine learning algorithms – or your own scripts, in several languages – to build and test a data science pipeline.
Eventually, the final model can be deployed as a web service for e.g. Excel or custom apps.

If you are already using the Azure solutions, it offers a valuable add-on for machine learning. Especially if you need a quick way to analyse an dataset and evaluate a model.

This is what Gartner says about Azure ML Studio in its 2018 “Magic Quadrant for Data Science and Machine-Learning Platforms”:

Microsoft remains a Visionary.
Its position in this regard is attributable to low scores for market responsiveness and product viability, as Azure Machine Learning Studio’s cloud-only nature limits its usability for the many advanced analytic use cases that require an on-premises option.

Note: I have no affiliation with Microsoft nor I am payed by them. I am just looking  into the main tools available for machine learning.

We will see how to create and build a regression model based on the Autos dataset that we already used earlier.

You can follow up this experiment, directly from Azure ML Studio. Continue reading “Azure Machine Learning Studio”

What makes an effective team?

Google has always been a data-driven company, not only for its businesses but also for nearly every aspect of its employees’ professional lives.

A few years ago, the project Aristotle was started to assess what it takes to build the perfect team. A dedicated work group of researchers measured many Google’s teams effectiveness using a combination of different qualitative evaluations and quantitative metrics, with the goal of finding the comprehensive definition of team effectiveness.

You can read the story in this New York Times article, here are the results of the project.

A quick summary

The researchers found soon that what really mattered was NOT WHO is on the team, BUT HOW the team worked together. Which makes sense: teams are highly interdependent – team members need one another to get work done.

Continue reading “What makes an effective team?”

Recover audio using linear regression

In this example, we will use linear regression to recover or ‘fill out’ a completely deleted portion of an audio file!
For this, we use the FSDD, Free-Spoken-Digits-Dataset, an audio dataset put together by Zohar Jackson:

cleaned up audio (no dead-space, roughly same length, same bitrate, same samples-per-second rate, same speaker, etc) samples ready for machine learning.

You can follow along with the associated notebook in GitHub. Continue reading “Recover audio using linear regression”

Regularisation in neural networks

We have seen the concept of regularisation and how is applied to linear regression, let’s see now another example for logistic regression done with artificial neural networks.

The question to answer is to recognise hand-written digits and is a classic one-vs-all logistic regression problem.

The dataset  contains 5000 training examples of handwritten digits and is a subset of the MNIST handwritten digit dataset.

Each training example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location.

The 20 by 20 grid of pixels is unrolled into a 400-dimensional vector. Each of these training examples becomes a single row in our data matrix X. This gives us a 5000 by 400 matrix X where every row is a training example for a handwritten digit image.

Let’s get more familiar with the dataset.
You can follow along on the associated notebook.
Continue reading “Regularisation in neural networks”

Regularisation

The basic idea of regularisation is to penalise or shrink the large coefficients of a regression model.
This can help with the bias / variance trade-off (shrinking the coefficient estimates can significantly reduce their variance and will improve the prediction error)  and can help with model selection by automatically removing irrelevant features (that is, by setting the corresponding coefficient estimates to zero).
Its cons are that this approach may be very demanding computationally.

There are several ways to perform the shrinkage; the regularisations models that we will see are the Ridge regression and the Lasso. Continue reading “Regularisation”