We have seen how to calculate measures of central tendency as mode and mean, and deviation measures such as the variance. Let’s see another measure describing how the data is distributed: the quartiles.
The quartiles of a set of data values are the three points that divide the ranked data set (i.e. you need to order the data points first) into four equal groups, each group comprising a quarter of the data.
Quartiles are actually a type of quantiles which are values taken at regular intervals; another popular type of quantiles are the percentiles – where you divide the data sets into 100 groups – like in “a student scoring above the 80th percentile of a standardised test”.
Back to the quartiles, the three data points are:
- first quartile, also called the lower quartile: splits off the lowest quarter (25%) of data from the rest
- second quartile, also called the median: cuts data set in half
- third quartile, also called the upper quartile: splits off the highest quarter (25%) of data from the rest
Let’s see how to calculate them with Python.
There are different methods how to get the quartiles (the difference is when the data set is composed of an odd number of data points) but I use this one:
- Order the data set
- Use the median to divide the ordered data set into two halves. Do not include the median in either half.
- The lower quartile value is the median of the lower half of the data.
- The upper quartile value is the median of the upper half of the data.
Pretty simple and the implementation is also quite straightforward using the already implemented function to find the median; the initial part is exactly like in the median function:
def quartiles(dataPoints): # check the input is not empty if not dataPoints: raise StatsError('no data points passed') # 1. order the data set sortedPoints = sorted(dataPoints) # 2. divide the data set in two halves mid = len(sortedPoints) // 2 # uses the floor division to have integer returned
The check for even or add number of data points is similar, and then is calling the median function itself:
if (len(sortedPoints) % 2 == 0): # even lowerQ = median(sortedPoints[:mid]) upperQ = median(sortedPoints[mid:]) else: # odd lowerQ = median(sortedPoints[:mid]) # same as even upperQ = median(sortedPoints[mid+1:])
The code shows how to subset a vector in Python. Usually you can extract a part using a range of indexes, say we want to extract the items in position 2,3 and 4:
smallerVector = theVector[2:5]
But you can omit the lower or upper limit and by default it will take the very first or the very last:
theVector[:index] = takes only the part from the beginning until the index, excluded.
theVector[index:] = takes the part from the index (excluded) until the end.
And finally:
return (lowerQ, upperQ)
Python allows to return more than one value from a function, via a tuple.
You access it then as a normal list. For example, let’s define a function printing the summary statistics of a data set (to get a better feel for how is distributed):
def summary(dataPoints): if not dataPoints: raise StatsError('no data points passed') print ("Summary statistics") print ("Min : ", min(dataPoints)) print ("Lower Qu.: ", quartiles(dataPoints)[0]) print ("Median : ", median(dataPoints)) print ("Mean : ", mean(dataPoints)) print ("Upper Qu.: ", quartiles(dataPoints)[1]) print ("Max : ", max(dataPoints))
This entire code is available on GitHub.
When it is odd, a divide by 2 will give you a non integer for the index. How does that work?
Very good point. I adjusted the code in the gitHub repository but I forgot to update it here (I will do now, thanks for spotting it).
Basically, if the length is odd you split in two sets where one is slightly bigger than the other:
mid = len(sortedPoints) // 2 # uses the floor division to have integer returned
Pingback: Introduction to Python package NumPy | Look back in respect
Pingback: visualize quartiles and summary statistics in python | Look back in respect