We have seen how to calculate several measures of central tendency (like mean, mode and median) in Python, using the native lists.

Now, a more memory-efficient and fast handling than lists would be to use the array object, which gives me the opportunity to introduce one of the key Python package for data science: NumPy.

# What is NumPy?

NumPy, short for Numerical Python, is a module that provides high-performance (thanks to its implementation in C and Fortran) vector, matrix and higher-dimensional data structures for Python.

The array object class is the foundation of NumPy, and they are basically like lists in Python, except that have a fixed size at creation, are statically typed and homogeneous (everything inside them must be of the same type); therefore the type of the elements is determined when the array is created and this improves the performance.

NumPy arrays are also a much more efficient way of storing and manipulating data than the built-in Python lists, allowing to exchange data between different programs and systems (for example between a Python program and another C++ program).

To create vector and matrix arrays there are several methods, from Python lists or from scratch:

import numpy as np # creates an X array starting from a Python list: X =np.array([10.3, 4.1, 12, 15.5, 20.2, 5.5, 15.5, 4.1]) # creates an Y array with numbers from 5 to 10: Y =np.arange(5, 10) # Y = 5,6,7,8,9 print Y.dtype # note that its elements' type is integer $ int64 Y_fl = np.arange(5, 10, dtype = 'float') # force type to be float # now Y_fl = 5.0, 6.0, 7.0, 8.0, 9.0 print Y_f.dtype $ float64 # creates a Z matrix 2x2 filled with zeros: Z =np.zeros((2,2))

An array object has some** public attribute** which you can access directly (like *dtype* used above), for example:

print (X.size) # returns the number of elements in the array print (X.shape) # returns a tuple with the dimensions, e.g. for matrixes

An array object has many **methods** which operate on it, typically returning an array result, for example:

X.sum() # returns the sum of all elements X.min() ; X.max() # return its minimum and maximum value X.mean() # returns the average of all elements X.std() # returns the standard deviation of all elements

NumPy provides also several statistics functions, like the ones to calculate the mean or the median and more, as we have seen for the Python lists:

print ("Summary statistics (using numpy)") print ("Mean: ",np.mean(X)) # same as X.mean() print ("Median: ",np.median(X)) print ("Std. Dev.: ",np.std(X)) # same as X.std() print ("Min : ",np.amin(X)) # same as X.min() print ("Max: ",np.amax(X)) # same as X.max() print ("Range (max-min): ",np.ptp(X)) # same as X.ptp() print ("Lower Qu.: ",np.percentile(X, 25)) print ("Upper Qu.: ",np.percentile(X, 75))

You may notice that the *percentile*() function returns 5.15 instead of 4.8 as a lower quartile. This depends on how it calculates it when the percentile is falling between two data points (you may recall that there are different ways to do it).

This can be controlled through a parameter called *interpolation*, which has been introduced after the NumPy version 1.9 (if you use Python 2.7 you may have still an older version of NumPy):

interpolation: {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points

iandj:

- linear:
i + (j – i) * fraction, wherefractionis the fractional part of the index surrounded byiandj.- lower:
i.- higher:
j.- nearest:
iorjwhichever is nearest.- midpoint: (
i+j) / 2.New in version 1.9.0

*‘Linear’* is the default, while the behaviour we wish is obtained using the argument ‘*midpoint*‘:

np.percentile(X,25,interpolation='midpoint')

NumPy does not define a function to calculate the *mode*() nor one for the coefficient of variation but the ones previously defined are working using an array object as input too:

print ("Mode(s): ", mode(X)) $ Mode(s): [4.1, 15.5] print ("Coeff. Var.: ", round(np.std(X) / np.mean(X), 3)) $ Coeff. Var.: 0.515

Pingback: Moneyball again: a multiple linear regressione xample – Look back in respect

Pingback: Read and clean data with Python pandas | Look back in respect

Pingback: Introduction to Python package pandas | Look back in respect