An introduction to logistic regression

Variables can be described as either quantitative or qualitative.
Quantitative variables have a numerical value, e.g. a person’s income, or the price of a house.
Qualitative variables have a values taken from one of different classes or categories. E.g., a person’s gender (male or female), the type of house purchased (villa, flat, penthouse, …) the colour of the eye (brown, blue, green) or a cancer diagnosis.

Linear regression predicts a continuous variable but sometime we want to predict a categorical variable, i.e. a variable with a small number of possible discrete outcomes, usually unordered (there is no order among the outcomes).

This kind of problems are called Classification.


Given a feature vector X and a qualitative response y taking values from one fixed set, the classification task is to build a function f(X) that takes as input the feature vector X and predicts its value for y.
Often we are interested also (or even more) in estimating the probabilities that X belongs to each category in C.
For example, it is more valuable to have the probability that an insurance claim is fraudulent, than if a classification is fraudulent or not.

There are many possible classification techniques, or classifiers, available to predict a qualitative response.

We will se now one called logistic regression.

Note: this post is part of a series about Machine Learning with Python.
Continue reading “An introduction to logistic regression”

Machines “think” differently but it’s not a problem (maybe)

Yet another article about the interpretability problem of many AI algorithms, this time on the MIT Technology Review, May/June 2017 issue.

The issue is clear; many of the most successful recent AI technologies revolve around deep learning: complex artificial neural networks – with so many layers of so many neurons transforming so many variables – that behave like “black boxes” for us.
We cannot comprehend anymore the model, we don’t know how or why the outcome to a specific input is obtained.
Is it scary?

In the film Dekalog 1 by Krzysztof Kieślowski – the first of ten short films inspired to the ten Christian imperatives, the first one being “I am the Lord your God; you shall have no other gods before me”  – Krzysztof lives alone with Paweł, his 12-years-old and highly intelligent son, and introduces him to the world of personal computers. Continue reading “Machines “think” differently but it’s not a problem (maybe)”

Agile for managing a research data team


An interesting read: Lessons learned managing a research data science team on the ACMqueue magazine by Kate Matsudaira.

The author described how she managed a data science team in her role as VP engineering at a data mining startup.

When you have a team of people working on hard data science problems, the things that work in traditional software don’t always apply. When you are doing research and experiments, the work can be ambiguous, unpredictable, and the results can be hard to measure.

These are the changes that the team implemented in the process: Continue reading “Agile for managing a research data team”

Premium prices and data

Putting a premium price on certain products which cost almost the same to produce is a common marketing trick which can be brought back to the price sensibility of the different consumer and is having now a boom thanks to the extremely exact consumer profiling.
Think about the cappuccino made with fair trade coffee or with vanilla flavour. By buying one of them you send the message that you don’t mind paying a bit extra.
The strategy is to charge the highest price that the consumer will pay for that product.
A pricing strategy called First Price discrimination:

Exercising First degree (or Perfect/Primary) price discrimination requires the seller of a good or service to know the absolute maximum price (or reservation price) that every consumer is willing to pay.

More generally there are 3 common techniques for finding customers who are the first degree price discrimination: Continue reading “Premium prices and data”

Big data, data science and machine learning explained

Data are considered the new secret sauce, are everywhere and have been the cornerstone for the success of many high-tech companies, from Google to Facebook.

But we always used data, there are examples from the ancient times dated thousands of years ago.
In the latest centuries data started to find more and more practical applications thanks to the emergence of statistics and later by the Business Intelligence. The earliest known use of the term “Business Intelligence” is by Richard Millar Devens in 1865. Devens used the term to describe how a banker gained profit by receiving and acting upon information about his environment, prior to his competitors.

Big Data

It is after the WWII that the practice of using data-based systems to improve business decision-making – surely driven by advances in automatic computing systems and storage possibilities – started to take off and be used widely. Digital storage becomes more cost-effective for storing data than paper and since then, an unbelievable amount of data have been collected and organised in data warehouses, initially in structured formats. The term Big Data started to be used meaning just a lot of data.

In a 2001 research report and related lectures, analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing

  1. volume (the amount of data reached peaks that could be handled only by specific systems)
  2. velocity (speed of data in and out, including the emergence of real-time data)
  3. variety (the range of data types and sources, often in unstructured formats)

Gartner, and now much of the industry, quickly picked this  “3Vs” model for describing Big Data which, a decade later, has become the generally accepted three defining dimensions of big data.

Continue reading “Big data, data science and machine learning explained”

Features selection for linear regression

We have seen how to fit a linear regression model, with multiple features, how we can interpret the results and how accurately we can predict.

One issue in linear regression with multiple features was which one to use. Not all available variables should be used: beside the overfitting problem, each new variable requires more data. We will see now how to appropriately choose variables.

Following is an example taken from the masterpiece book Introduction to Statistical Learning  by Hastie, Witten, Tibhirani, James. It is based on an Advertising Dataset, available on the accompanying web site or on Kaggle.

The dataset contains statistics about the sales of a product in 200 different markets, together with advertising budgets in each of these markets for different media channels: TV, radio and newspaper.

Imaging being the Marketing responsible and you need to prepare a new advertising plan for next year. You may be interested in answering questions such as:
Which media contribute to sales?
Do all three media—TV, radio, and newspaper—contribute to sales, or do just one or two of the media contribute?

Suppose that in your role are asked to suggest, on the basis of these data, a marketing plan for next year that will result in high product sales. What information would be useful in order to provide such a recommendation?

As usual, this example is also available as Jupyter notebook in Github.

Note: I will use interchangeably the words feature, variable or predictor to indicate the x_i inputs of the linear regression. Continue reading “Features selection for linear regression”