Covariance and correlation

We have seen how to show the relation between two or more variables visually, using the scatter plot.

Let’s see now how to measure a relation between two variable in a mathematical way.


Covariance is a type of value used in statistics to describe the linear relationship between two variables. It describes both how far the variables are spread out (a measure of how much one variable goes up when the other goes up) and the nature of their relationship: a positive covariance indicates that when one variable increases, the second increases and when one decreases the other decreases; on the other hand if the covariance is negative it means that an increase in one will cause a decrease in the other.

This answer on CrossValidated shows a nice visual explanation of covariance based on scatter plots:

Red “rectangles” between pairs are positive relations, blue are negative.

The formula for the covariance is :

cov(X, Y) = \sum_{i=1}^{n} \frac{(x_{i} - \bar{x})\cdot (y_{i} - \bar{y})}{n-1}

If it’s a sample instead of a population (n-1) makes the estimator unbiased.

The code is:

import numpy as np

  # given a z data set, return an array with the distance between each
  # data point in z and the mean of the input data set z.
def diff_mean(v):
  v_bar = np.mean(v)

  # given two data sets x and y, return their covariance
def covariance(x, y):
  n = len(x)
  return, diff_mean(y)) / (n - 1)

Interesting Python topics here:

  • Numpy includes many useful functions, such as mean() to calculate the mean of an array
  • One of the strong point of numpy for data science is that it has many operators for vectors and matrices such as dot() which calculates the  inner product of two vectors
  • the use of list comprehension.

A list comprehension is a syntactic construct for creating a list based on existing lists.
It follows the form of the mathematical set-builder notation and the simplest Python syntax is:

> originalList = [1,2,3,4]
> newList = [2*i for i in originalList]
> newList
[2, 4, 6, 8]

This is a very common application: make a new list where each element is the result of some operation (in this case doubling the value= applied to each member of another sequence); another common case is to create a sub-sequence of those elements that satisfy a certain condition:

> subList = [i for i in originalList if 2*i < 7]
> subList
[1, 2, 3]

List comprehensions can contain complex expressions and nested functions.

Let’s apply the covariance function to some examples:

> x = [1,2,3,4]
> y = [2*i for i in x]  # second set is double

> covariance(x,x)

> covariance(x,y)
3.3333333333333335  # cov(x,y) is double than cov(x,x)

And now to our wine data set:

> import pandas as pd
> wine = pd.read_csv('wine.csv')

> wine.columns
Index(['Year', 'Price', 'WinterRain', 'AGST', 'HarvestRain', 'Age',
> covariance(wine['AGST'], wine['Price'])
> covariance(wine['Price'], wine['Price'])
> covariance(wine['AGST'], wine['AGST'])

Actually numpy has a similar function pre-defined called cov() :

> import numpy as np
> np.cov(wine['AGST'], wine['Price'])
array([[ 0.45616087, 0.28970517],
       [ 0.28970517, 0.42294324]])

It computes the covariance matrix,  where the element i,j is the covariance between the i-th and j-th elements of the data sets.

This tells us that the average temperature is positively correlated with the wine price. Let’s see if we can find out more.


Because the data are not standardised, you cannot use the covariance statistic to assess the strength of a linear relationship. To assess it using a standardised scale of -1 to +1, we can use the Correlation coefficient.

One of the most used way to calculate the correlation is the Pearson coefficient, which gives a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation

For a population, the Pearson correlation coefficient is:

\rho = \frac{cov(X,Y)}{\sigma_{x} \cdot \sigma _{y} }

For a sample is:

r = \frac{\sum (x_{i}-\overline{x}) \cdot (y_{i}-\overline{y})} {\sqrt{ \sum (\overline{x}-x_{i})^2 \cdot \sum (\overline{y}-y_{i})^2}}

This is the python code for the two formulas:

def correlation_pop(x, y):
  # when population, re-use the covariance above

  stdev_x = np.std(x)
  stdev_y = np.std(y)
  if stdev_x > 0 and stdev_y > 0:
    return covariance(x, y) / (stdev_x * stdev_y)
    return 0 # if no variation, correlation is zero

def correlation_sample(x, y):
  # when sample, Pearson coefficient
        # check
  assert len(x) == len(y)
  n = len(x)
  assert n > 0
        # helpers
  x_diff = diff_mean(x)
  y_diff = diff_mean(y)

        # numerator
  sum_xy = np.sum([i * j for i, j in zip(x_diff, y_diff)])
        # denominator
  x_diff_sq = np.sum([i**2 for i in x_diff])
  y_diff_sq = np.sum([i**2 for i in y_diff])

  return sum_xy / np.sqrt(x_diff_sq * y_diff_sq)

For the population case, the function is quite straightforward and re-use the previously defined covariance function.

Interesting in the sample case is that we use again the list comprehension but in addition the zip function, which returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.

In this case, it takes the i-th element in the vector x_diff (x_i) and the i-th element in the vector y_diff (y_i), which will be the used to calculate the product x_i * y_i.

As for the covariance, there is a pre-defined function in numpy to calculate the Pearson coefficient too, called corrcoef() and even one in in pandas, called corr().

Let’s see how they work, using our wine set:

In [1]: wine = pd.read_csv('wine.csv')

In [2]: correlation_sample(wine['AGST'], wine['Price'])
Out[2]: 0.659562861144
In [3]: np.corrcoef(wine['AGST'], wine['Price'])
Out[3]: [[ 1.  0.65956286]
         [ 0.65956286  1. ]]

In [4]: wine.corr()  # note: output is a subset
           Year      Price     WinterRain AGST 
Year       1.000000 -0.447768  0.016970  -0.246916  
Price     -0.447768  1.000000  0.136651   0.659563  
WinterRain 0.016970  0.136651  1.000000  -0.321091  
AGST      -0.246916  0.659563 -0.321091   1.000000 

As you can see, the value is correctly calculated, only with different approximation.
The numpy function returns a correlation matrix, while the pandas function returns a dataframe.

The strongest correlated variable with the wine price is the Average Temperature (AGST), at the value of around 0.66

Correlation and causation

Correlation does not imply causationis a phrase used in statistics to emphasise that a correlation between two variables does not imply that one causes the other.

This is certainly true, one variable can cause the other one or both can be a consequence of a common cause or even correlation can be a mere coincidence.
There is even a hilarious website and book making fun of random correlations.


What is the value in correlation? Is it useless to have knowledge that two variables are correlated?

Correlation and covariance are useful for prediction regardless of causation.
When you measure a clear, stable association between two variables you know that the level of one variable provides you with some information about another variable of interest, which you can use to help predict one variable as a function of the other.

We will see more about making predictions in the next posts. And how to find real signal among the noise (i.e., the random correlation).


2 thoughts on “Covariance and correlation

  1. Pingback: Multiple Linear Regression – Look back in respect

  2. Pingback: Linear regression – an introduction | Look back in respect

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s