Fairness can perpetuate discrimination

In the last century insurances had not sophisticated algorithms like today. The original idea of risk spreading and the principle of solidarity was based on the notion that sharing risk bound people together, encouraging a spirit of mutual aid and interdependence.

This started to change in the last decades, around the 70s, when this vision had given way to the so-called actuarial fairness. Living in a poor or minority-dense neighbourhood  was going to cost you more in insurance, denied loans and so on. Soon the insurances were accused of justifying discrimination to which they replied that were just doing their job, a purely technical job, and that it didn’t involve moral judgments.  Effects on society were really not their problem or business.

Sounds familiar? It’s the same arguments made by social network platforms today: they are technical platforms running algorithms and are not involved in judging the content. 

Civil rights activists lost their battles with the insurance industry because they insisted on arguing about the accuracy of certain statistics or the validity of certain classifications rather than questioning whether actuarial fairness was a valid way in the first place.

There are several obvious problems with it. If you believe the risk scores are accurate in predicting the future outcomes of a certain group of people, then it means it’s “fair” that a person is more likely to spend time in jail simply because they are black. 

The other problem is that there are fewer arrests in rich neighbourhoods, not because they commit less crimes but because there is less policing. One is more likely to be rearrested if one lives in an over-policed neighbourhood and that creates a feedback loop—more arrests mean higher recidivism rates. 

Over-policing and predictive policing may be “accurate” in the short term but the long-term effects on communities have been shown to be negative, creating self-fulfilling prophecies.

Like the insurers, large tech firms and the computer science community also tend to frame “fairness” in a de-politicised way involving only mathematics and code.

The problem is that fairness cannot be reduced to a simple self-contained mathematical definition—fairness is dynamic and social and not a statistical issue. It can never be fully achieved and must be constantly audited, adapted and debated in a democracy. By merely relying on historical data and current definitions of fairness, we will lock in the accumulated unfairnesses of the past and our algorithms and the products they support will always trail the norms, reflecting past norms rather than future ideals and slowing social progress rather than supporting it.

This is a summary of the inspiring idea from Joi Ito, that you can read in full in his article on Wired.

Advertisements

Introduction to time series – Part II: an example

Exploring a milk production Time Series

Time series models are used in a wide range of applications, particularly for forecasting, which is the goal of this example, performed in four steps:

– Explore the characteristics of the time series data.
– Decompose the time series into trend, seasonal components, and remainder components.
– Apply time series models.
– Forecast the production for a 12 month period.

Part I of this mini-series explored the basic terms of time series analysis.

Load and clean the data

The dataset is the production amount of several diary products in California, month by month, for 18 years.
Our goal: forecast the next year production for one of those products: milk.

You can follow along with the associated notebook in GitHubContinue reading “Introduction to time series – Part II: an example”

Merry GDPR everyone

The GDPR (The General Data Protection Regulation from the European Union) came into force on 25th of May (here is a funny top 10 of the worst ways to apply the GDPR).

Now that the spam-madness of GDPR mails has luckily passed its peak, I would like to break a lance in its favour.

Let’s start with the name: Data Protection is not fully representative, its goal is more than that.
In a nutshell it gives each European citizen the rights:

  • to know which kind of personal and sensitive data are collected about them
  • to know why and which use will be done with those data
  • to refuse the collection and processing of those data (deny consent)
  • to access the data (see them, edit if needed, obtain and port them somewhere else)
  • to be forgotten (data erase, collection and processing stopped)
  • to be promptly informed if a security breach happened and involved their data

Now, imagine a not far future – algorithmic pricing is already an established methodology – when you enter a shop (online or offline) to buy a new mobile phone but they have no price tag. Continue reading “Merry GDPR everyone”

Introduction to NTLK

Text is everywhere, approximately 80% of all data is estimated to be unstructured text/rich data (web pages, social networks, search queries, documents, …) and text data is growing fast, an estimated 2.5 Exabytes every day!

We have seen how to do some basic text processing in Python, now we introduce an open source framework for natural language processing that can further help to work with human languages: NLTK (Natural Language ToolKit).

Tokenise a text

Let’s see it firstly with a basic NLP task, the usual tokenisation (split a text into tokens or words).

You can follow along with a notebook in GitHub. Continue reading “Introduction to NTLK”

The overfitting problem and the bias vs. variance dilemma

We have seen what is linear regression, how to make models and algorithms for estimating the parameters of such models, how to measure the loss.

Now we see how to assess how well the considered method should perform in predicting new data, how to select amongst possible models to choose the best performing.

We will first explore the concept of training and test error, how they vary with model complexity and how they might be utilised to form a valid assessment of predictive performance. This leads directly to an important bias-variance tradeoff, which is fundamental to machine learning.

The concepts described in this post are key to all machine learning problems, well-beyond the regression setting. Continue reading “The overfitting problem and the bias vs. variance dilemma”

NLP3o: A web app for simple text analysis

We have seen earlier an introduction of few NLP (Natural Language Processing) topics, specifically how to tokenise a text, remove the punctuation and its stop words (the most common words such as the conjunctions) and how to find the top used words.

Now we can see – as an example – how to put all these inside a simple web app.

The entire source code is available on GitHub, so you can better follow along; and you can see it in action in a Heroku dyno.

Screen Shot 2017-07-29 at 14.55.08 Continue reading “NLP3o: A web app for simple text analysis”