Introduction to Python package NumPy

We have seen how to calculate several measures of central tendency (like mean, mode and median) in Python, using the native lists.

Now, a more memory-efficient and fast handling than lists would be to use the array object, which gives me the opportunity to introduce one of the key Python package for data science: NumPy.

What is NumPy?

NumPy, short for Numerical Python, is a module that provides high-performance (thanks to its implementation in C and Fortran) vector, matrix and higher-dimensional data structures for Python.

The array object class is the foundation of NumPy, and they are basically like lists in Python, except that have a fixed size at creation, are statically typed and homogeneous (everything inside them must be of the same type); therefore the type of the elements is determined when the array is created and this improves the performance.

NumPy arrays are also a much more efficient way of storing and manipulating data than the built-in Python lists, allowing to exchange data between different programs and systems (for example between a Python program and another C++ program).

To create vector and matrix arrays there are several methods, from Python lists or from scratch:

import numpy as np
   # creates an X array starting from a Python list:
X = np.array([10.3, 4.1, 12, 15.5, 20.2, 5.5, 15.5, 4.1])
    # creates an Y array with numbers from 5 to 10:
Y = np.arange(5, 10) # Y = 5,6,7,8,9
print Y.dtype  # note that its elements' type is integer
$ int64
Y_fl = np.arange(5, 10, dtype = 'float') # force type to be float
    # now Y_fl = 5.0, 6.0, 7.0, 8.0, 9.0
print Y_f.dtype
$ float64
    # creates a Z matrix 2x2 filled with zeros:
Z = np.zeros((2,2))

An array object has some public attribute which you can access directly (like dtype used above), for example:

print (X.size)  # returns the number of elements in the array
print (X.shape) # returns a tuple with the dimensions, e.g. for matrixes

An array object has many methods which operate on it, typically returning an array result, for example:

X.sum() # returns the sum of all elements
X.min() ; X.max() # return its minimum and maximum value
X.mean() # returns the average of all elements
X.std() # returns the standard deviation of all elements

NumPy provides also several statistics functions, like the ones to calculate the mean or the median and more, as we have seen for the Python lists:

print ("Summary statistics (using numpy)")
print ("Mean: ", np.mean(X))  # same as X.mean()
print ("Median: ", np.median(X))
print ("Std. Dev.: ", np.std(X))  # same as X.std()
print ("Min : ", np.amin(X))   # same as X.min()
print ("Max: ", np.amax(X))  # same as X.max()
print ("Range (max-min): ", np.ptp(X))  # same as X.ptp()
print ("Lower Qu.: ", np.percentile(X, 25))
print ("Upper Qu.: ", np.percentile(X, 75))

You may notice that the percentile() function returns 5.15 instead of 4.8 as a lower quartile. This depends on how it calculates it when the percentile is falling between two data points (you may recall that there are different ways to do it).

This can be controlled through a parameter called interpolation, which has been introduced after the NumPy version 1.9 (if you use Python 2.7 you may have still an older version of NumPy):

interpolation : {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

  • linear: i + (j – i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
  • lower: i.
  • higher: j.
  • nearest: i or j whichever is nearest.
  • midpoint: (i + j) / 2.

New in version 1.9.0

‘Linear’ is the default, while the behaviour we wish is obtained using the argument ‘midpoint‘:

np.percentile(X,25,interpolation='midpoint')

NumPy does not define a function to calculate the mode() nor one for the coefficient of variation but the ones previously defined are working using an array object as input too:

print ("Mode(s): ", mode(X))
$ Mode(s): [4.1, 15.5]
print ("Coeff. Var.: ", round(np.std(X) / np.mean(X), 3))
$ Coeff. Var.: 0.515
Advertisements

3 thoughts on “Introduction to Python package NumPy

  1. Pingback: Moneyball again: a multiple linear regressione xample – Look back in respect

  2. Pingback: Read and clean data with Python pandas | Look back in respect

  3. Pingback: Introduction to Python package pandas | Look back in respect

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s