Measures of central tendency (the mean)

These measures are optimal ways to describe a bunch of data, to give an indication of what is typical or middle, so you can compare different series of data.

For a set of numerical data, these measures can be used:

The mean: specifically the arithmetic mean is the sum of the numbers divided by the number of samples

The median: order the numbers and pick up the one in the middle. If the data points are even, then take the mean of the middle two.
This is useful when the data points are very spread.

The mode: the most common number in the set. There can be multiple modes, if two or more data points occur the same number of times.

Now, how to calculate them in Python, given a set of data points?

Let’s start with the mean. Here is the core part of the python function:

def mean(dataPoints):
    return sum(dataPoints) / float(len(dataPoints))

It takes as only argument a list or a tuple of values (can be integer or float, can also be mixed).
The list and the tuple are among the most used data structures in Python.
They are a sequence of values or more generic of objects and the difference is that tuple cannot be changed while lists can.

x = [1, 3, 2.8]  # this is a list, can be modified later; note the squared brackets
y = (7.1, 2)  # this is a tuple, cannot be modified; note the round brackets

There are a couple of other interesting facts in the function above:

  • len(aListOrATuple)  returns the length of the argument, i.e. its number of elements
  • sum(aListOrATuple) returns the sum of all the elements of the argument
  • float(anObject) casts the object as argument into a float type. This is necessary because if the list / tuple has only integers then the formula will also return an integer therefore rounding it (not what is expected from a mean).
    For example mean([3,4,9]) would return 5 instead of 5.33
    Just for information: Python type system is dynamic (means the type is checked at runtime not at compile time) but is also strong (means that every change of type requires an explicit conversion).
  • To indicate the run of a block (here for example the function body) Python uses indentation instead of punctuation.
    It’s not widely used among computer languages and personally I find it irritating sometimes …

Here is the complete function, with initial description in a comment section and an extra check if the list / tuple is empty: in this case returns the special value “NaN” = Not a Number. A couple of examples follow.

def mean(dataPoints):
 """
 dataPoints: a list of data point, int or float
 returns: float, the mean of the input, or NaN if X is empty.
 """
 if not dataPoints:
   return float('NaN')
 else:
   return sum(dataPoints) / float(len(dataPoints))

'''
Unit tests 
'''
X = [10.3, 4.1, 12, 15.5, 20.2, 5.8, 7] 
Z = []    
Y = (3,4,9)    
print ("mean = ", mean(X))  # returns 10.7
print ("mean = ", mean(Z))  # returns NaN
print ("mean = ", mean(Y))  # returns 5.33
Advertisements

6 thoughts on “Measures of central tendency (the mean)

  1. Pingback: Introduction to Python package NumPy | Look back in respect

  2. Pingback: Quartiles and summary statistics in Python | Look back in respect

  3. Pingback: Automatic tests in Python | Look back in respect

  4. Pingback: Rounding error in Python | Look back in respect

  5. Pingback: deviation measures | Look back in respect

  6. Pingback: Look back in respect

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s