We have seen that measures of central tendency like the mean can describe how a set of data is typical compared to other sets.
In the same way, the variance can describe the spread of a set of data.
Let’s say we have this set as example: A=[1, 3, 29] and B=[10,11,12] both with mean = 11.
These are the deviation measures:
- deviation from the mean is the difference between the mean and a given data point.
For the set A, they are respectively 10, 8 and 18.
- Variance is the mean square deviation, i.e. the sum of all the deviations from the mean, squared, and divided by the number of data points:
- As you can see, the variance is hard to interpret, being its unit a squared, therefore the standard deviation has been introduced, that is just the square root of the variance.
For set A it would be 12.75; the high standard deviation shows that the set is quite dispersed (in this case due to the number 29).
Let’s see how to calculate the standard deviation in Python, given a list of values:
def stdDev(X): """ X: a list of values returns: float, the standard deviation of the input, """ tot = 0.0 meanX = mean(X) for x in X: tot += (x - meanX) ** 2 return (tot/len(X))**0.5
The operator ** is the power, so **0.5 means doing the square root.
The function mean() was previously defined.