We have seen how to show the relation between two or more variables visually, using the scatter plot.

Let’s see now how to measure a relation between two variable in a mathematical way.

# Covariance

Covariance is a type of value used in statistics to describe the linear relationship between two variables. It describes both how far the variables are spread out (a measure of how much one variable goes up when the other goes up) and the nature of their relationship: a positive covariance indicates that when one variable increases, the second increases and when one decreases the other decreases; on the other hand if the covariance is negative it means that an increase in one will cause a decrease in the other.

This answer on CrossValidated shows a nice visual explanation of covariance based on scatter plots:

The formula for the covariance is :

If it’s a sample instead of a population (n-1) makes the estimator unbiased.

The code is:

import numpy as np # given a z data set, return an array with the distance between each # data point in z and the mean of the input data set z. def diff_mean(v): v_bar =np.mean(v)return # given two data sets x and y, return their covariance def covariance(x, y): n = len(x) returnnp.dot(diff_mean(x), diff_mean(y)) / (n - 1)

Interesting Python topics here:

- Numpy includes many useful functions, such as
*mean()*to calculate the mean of an array - One of the strong point of numpy for data science is that it has many operators for vectors and matrices such as
*dot()*which calculates the inner product of two vectors - the use of
**list comprehension**.

A list comprehension is a syntactic construct for creating a list based on existing lists.

It follows the form of the mathematical set-builder notation and the simplest Python syntax is:

> originalList = [1,2,3,4] > newList = [2*i for i in originalList] > newList [2, 4, 6, 8]

This is a very common application: make a new list where each element is the result of some operation (in this case doubling the value= applied to each member of another sequence); another common case is to create a sub-sequence of those elements that satisfy a certain condition:

> subList = [i for i in originalList if 2*i < 7] > subList [1, 2, 3]

List comprehensions can contain complex expressions and nested functions.

Let’s apply the covariance function to some examples:

> x = [1,2,3,4] > y = [2*i for i in x] # second set is double > covariance(x,x) 1.6666666666666667 > covariance(x,y) 3.3333333333333335 # cov(x,y) is double than cov(x,x)

And now to our wine data set:

> import pandas as pd > wine = pd.read_csv('wine.csv') > wine.columns Index(['Year', 'Price', 'WinterRain', 'AGST', 'HarvestRain', 'Age', 'FrancePop'], dtype='object') > covariance(wine['AGST'], wine['Price']) 0.28970517493333342 > covariance(wine['Price'], wine['Price']) 0.42294323856666671 > covariance(wine['AGST'], wine['AGST']) 0.45616087490000007

Actually numpy has a similar function pre-defined called cov() :

> import numpy as np > np.cov(wine['AGST'], wine['Price']) array([[ 0.45616087, 0.28970517], [ 0.28970517, 0.42294324]])

It computes the covariance matrix, where the element i,j is the covariance between the i-th and j-th elements of the data sets.

This tells us that the average temperature is **positively correlated** with the wine price. Let’s see if we can find out more.

# Correlation

Because the data are not standardised, you cannot use the covariance statistic to assess the **strength** of a linear relationship. To assess it using a standardised scale of -1 to +1, we can use the Correlation coefficient.

One of the most used way to calculate the correlation is the Pearson coefficient, which gives a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation

For a population, the Pearson correlation coefficient is:

For a sample is:

This is the python code for the two formulas:

def correlation_pop(x, y): # when population, re-use the covariance above stdev_x = np.std(x) stdev_y = np.std(y) if stdev_x > 0 and stdev_y > 0: return covariance(x, y) / (stdev_x * stdev_y) else: return 0 # if no variation, correlation is zero def correlation_sample(x, y): # when sample, Pearson coefficient # check assert len(x) == len(y) n = len(x) assert n > 0 # helpers x_diff = diff_mean(x) y_diff = diff_mean(y) # numerator sum_xy = np.sum([i * j for i, j in zip(x_diff, y_diff)]) # denominator x_diff_sq = np.sum([i**2 for i in x_diff]) y_diff_sq = np.sum([i**2 for i in y_diff]) return sum_xy / np.sqrt(x_diff_sq * y_diff_sq)

For the population case, the function is quite straightforward and re-use the previously defined covariance function.

Interesting in the sample case is that we use again the list comprehension but in addition the zip function, which returns an iterator of tuples, where the *i*-th tuple contains the *i*-th element from each of the argument sequences or iterables.

In this case, it takes the i-th element in the vector x_diff (x_i) and the i-th element in the vector y_diff (y_i), which will be the used to calculate the product x_i * y_i.

As for the covariance, there is a pre-defined function in numpy to calculate the Pearson coefficient too, called corrcoef() and even one in in pandas, called corr().

Let’s see how they work, using our wine set:

In [1]: wine = pd.read_csv('wine.csv') In [2]: correlation_sample(wine['AGST'], wine['Price']) Out[2]:0.659562861144In [3]: np.corrcoef(wine['AGST'], wine['Price']) Out[3]: [[ 1.0.65956286] [0.659562861. ]] In [4]: wine.corr() # note: output is a subset Out[4]: Year Price WinterRain AGST Year 1.000000 -0.447768 0.016970 -0.246916 Price -0.447768 1.000000 0.1366510.659563WinterRain 0.016970 0.136651 1.000000 -0.321091 AGST -0.2469160.659563-0.321091 1.000000

As you can see, the value is correctly calculated, only with different approximation.

The numpy function returns a correlation **matrix**, while the pandas function returns a **dataframe**.

The strongest correlated variable with the wine price is the Average Temperature (AGST), at the value of around 0.66

# Correlation and causation

“**Correlation does not imply causation**” is a phrase used in statistics to emphasise that a correlation between two variables does not imply that one causes the other.

This is certainly true, one variable can cause the other one or both can be a consequence of a common cause or even correlation can be a mere coincidence.

There is even a hilarious website and book making fun of random correlations.

What is the value in correlation? Is it useless to have knowledge that two variables are correlated?

Correlation and covariance are useful for **prediction** regardless of causation.

When you measure a clear, stable association between two variables you know that the level of one variable provides you with some information about another variable of interest, which you can use to help predict one variable as a function of the other.

We will see more about making predictions in the next posts. And how to find real signal among the noise (i.e., the random correlation).

Pingback: PCA: Principal Component Analysis – Look back in respect

Pingback: Multiple Linear Regression – Look back in respect

Pingback: Linear regression – an introduction | Look back in respect