We have seen how to show the relation between two or more variables visually, using the scatter plot.
Let’s see now how to measure a relation between two variable in a mathematical way.
Covariance is a type of value used in statistics to describe the linear relationship between two variables. It describes both how far the variables are spread out (a measure of how much one variable goes up when the other goes up) and the nature of their relationship: a positive covariance indicates that when one variable increases, the second increases and when one decreases the other decreases; on the other hand if the covariance is negative it means that an increase in one will cause a decrease in the other.
This answer on CrossValidated shows a nice visual explanation of covariance based on scatter plots:
If it’s a sample instead of a population (n-1) makes the estimator unbiased.
The code is:
import numpy as np # given a z data set, return an array with the distance between each # data point in z and the mean of the input data set z. def diff_mean(v): v_bar = np.mean(v) return # given two data sets x and y, return their covariance def covariance(x, y): n = len(x) return np.dot(diff_mean(x), diff_mean(y)) / (n - 1)
Interesting Python topics here:
- Numpy includes many useful functions, such as mean() to calculate the mean of an array
- One of the strong point of numpy for data science is that it has many operators for vectors and matrices such as dot() which calculates the inner product of two vectors
- the use of list comprehension.
A list comprehension is a syntactic construct for creating a list based on existing lists.
It follows the form of the mathematical set-builder notation and the simplest Python syntax is:
> originalList = [1,2,3,4] > newList = [2*i for i in originalList] > newList [2, 4, 6, 8]
This is a very common application: make a new list where each element is the result of some operation (in this case doubling the value= applied to each member of another sequence); another common case is to create a sub-sequence of those elements that satisfy a certain condition:
> subList = [i for i in originalList if 2*i < 7] > subList [1, 2, 3]
Let’s apply the covariance function to some examples:
> x = [1,2,3,4] > y = [2*i for i in x] # second set is double > covariance(x,x) 1.6666666666666667 > covariance(x,y) 3.3333333333333335 # cov(x,y) is double than cov(x,x)
And now to our wine data set:
> import pandas as pd > wine = pd.read_csv('wine.csv') > wine.columns Index(['Year', 'Price', 'WinterRain', 'AGST', 'HarvestRain', 'Age', 'FrancePop'], dtype='object') > covariance(wine['AGST'], wine['Price']) 0.28970517493333342 > covariance(wine['Price'], wine['Price']) 0.42294323856666671 > covariance(wine['AGST'], wine['AGST']) 0.45616087490000007
> import numpy as np > np.cov(wine['AGST'], wine['Price']) array([[ 0.45616087, 0.28970517], [ 0.28970517, 0.42294324]])
It computes the covariance matrix, where the element i,j is the covariance between the i-th and j-th elements of the data sets.
This tells us that the average temperature is positively correlated with the wine price. Let’s see if we can find out more.
Because the data are not standardised, you cannot use the covariance statistic to assess the strength of a linear relationship. To assess it using a standardised scale of -1 to +1, we can use the Correlation coefficient.
One of the most used way to calculate the correlation is the Pearson coefficient, which gives a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation
For a population, the Pearson correlation coefficient is:
For a sample is:
This is the python code for the two formulas:
def correlation_pop(x, y): # when population, re-use the covariance above stdev_x = np.std(x) stdev_y = np.std(y) if stdev_x > 0 and stdev_y > 0: return covariance(x, y) / (stdev_x * stdev_y) else: return 0 # if no variation, correlation is zero def correlation_sample(x, y): # when sample, Pearson coefficient # check assert len(x) == len(y) n = len(x) assert n > 0 # helpers x_diff = diff_mean(x) y_diff = diff_mean(y) # numerator sum_xy = np.sum([i * j for i, j in zip(x_diff, y_diff)]) # denominator x_diff_sq = np.sum([i**2 for i in x_diff]) y_diff_sq = np.sum([i**2 for i in y_diff]) return sum_xy / np.sqrt(x_diff_sq * y_diff_sq)
For the population case, the function is quite straightforward and re-use the previously defined covariance function.
Interesting in the sample case is that we use again the list comprehension but in addition the zip function, which returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.
In this case, it takes the i-th element in the vector x_diff (x_i) and the i-th element in the vector y_diff (y_i), which will be the used to calculate the product x_i * y_i.
Let’s see how they work, using our wine set:
In : wine = pd.read_csv('wine.csv') In : correlation_sample(wine['AGST'], wine['Price']) Out: 0.659562861144 In : np.corrcoef(wine['AGST'], wine['Price']) Out: [[ 1. 0.65956286] [ 0.65956286 1. ]] In : wine.corr() # note: output is a subset Out: Year Price WinterRain AGST Year 1.000000 -0.447768 0.016970 -0.246916 Price -0.447768 1.000000 0.136651 0.659563 WinterRain 0.016970 0.136651 1.000000 -0.321091 AGST -0.246916 0.659563 -0.321091 1.000000
As you can see, the value is correctly calculated, only with different approximation.
The numpy function returns a correlation matrix, while the pandas function returns a dataframe.
The strongest correlated variable with the wine price is the Average Temperature (AGST), at the value of around 0.66
Correlation and causation
“Correlation does not imply causation” is a phrase used in statistics to emphasise that a correlation between two variables does not imply that one causes the other.
This is certainly true, one variable can cause the other one or both can be a consequence of a common cause or even correlation can be a mere coincidence.
There is even a hilarious website and book making fun of random correlations.
What is the value in correlation? Is it useless to have knowledge that two variables are correlated?
Correlation and covariance are useful for prediction regardless of causation.
When you measure a clear, stable association between two variables you know that the level of one variable provides you with some information about another variable of interest, which you can use to help predict one variable as a function of the other.
We will see more about making predictions in the next posts. And how to find real signal among the noise (i.e., the random correlation).