Rounding error in Python

All the data analysis Python examples, are stored in GitHub in the repository called datascience.

The functions there are slightly differently coded, because of how floating point works in Python (and most of the computer languages): in short the fact that on computers all numbers are stored in finite number of bits introduces a rounding error (if you want to see the details: What Every Computer Scientist Should Know About Floating-Point Arithmetic, David Goldberg, March 1991).

Therefore I just introduced an argument “precision” with a default value. In Python arguments can have a default value associated that makes them optional: if you do not pass this argument when calling the function, the default value is taken.

Let’s see how the function to calculate the mean is different with precision added:

def mean(dataPoints, precision=3):
  try:
    return round(sum(dataPoints) / float(len(dataPoints)), precision)
  except ZeroDivisionError:
    raise StatsError('no data points passed')

The difference is the Python standard function round (number, n_digits) that rounds a number to n digits after the comma.
The number of rounding digits can be passed (through the extra argument “precision”) or otherwise the default (=3 digits) will be taken.

These are valid way to call the function:

mean(x)   # will return the mean with max 3 digits after the comma
mean(x,5) #  will return the mean with max 5 digits after the comma

The coefficient of variation

We have seen how the standard deviation can describe how spread is a set of data.
The coefficient of variation (CV) is a standardised measure of dispersion of a probability distribution or frequency distribution. It is defined as the ratio of the standard deviation (sigma) to the mean (mu):

CV = \sigma / \mu

It shows the extent of variability in relation to mean of the population.

The coefficient of variation is useful because the standard deviation of data must always be understood in the context of the mean of the data. In contrast, the actual value of the CV is independent of the unit in which the measurement has been taken, so it is a dimensionless number.
For comparison between data sets with different units or widely different means, one should use the coefficient of variation instead of the standard deviation.

In Python (this fragment does make use of previously defined mean and standard deviation functions) is straightforward:

def coeffVar(X):
    try:
        return stdDev(X) / mean(X)
    except ZeroDivisionError:
        raise StatsError('mean is zero')