Introduction to Python package pandas

Pandas is another key Python library for data science.

It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. Pandas is built on top of NumPy.

Let’s see how can help by reading and analysing a data set.

The Series and the DataFrame are the pandas foundation classes.

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.
The simplest Series is formed from only one array of data:


import pandas as pd
simpleObjectNumbers = pd.Series([4, 7, -5, 3])
simpleObjectMixed = pd.Series(['Max', 'Offmu', 42, -1789710578])

If you print the objects you will notice that they have an index associated, starting from zero:


>>> simpleObjectNumbers
0 4
1 7
2 -5
3 3
dtype: int64
>>> simpleObjectMixed
0 Max
1 Offmu
2 42
3 -1789710578
dtype: object

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

The DataFrame has both a row and column index; it can be thought of as a dictionary of Series (one for all sharing the same index) and is similar – but not completely the same – as R’s structure data.frame, which was the inspiration, I guess.

To create a dataFrame, you can pass a dictionary of lists to the DataFrame constructor:

  • The key of the dictionary will be the column name
  • The associating list will be the values within that column.

Let’s see an example to make it clear how it looks like, such as a list of stocks with the associated value:


# Since DataFrame and Series are so much used and associated to the pandas 
# package you can add the following import line at the beginning so you do
# not need to explicitly say from which package are coming every time.

from pandas import Series, DataFrame

data = {'Stock': ['Blade Industries', 'Arcady Pharma', 'Global Omnium', 
  'Tonkatsu', 'Future Airlines','Lumocorp', 'Hypoxio', 'Transdyne'],
  'Value': [10.3, 4.1, 12, 15.5, 20.2, 5.5, 15.5, 4.1]}

futureExchange = DataFrame(data, columns=['Stock', 'Value'])

>>> futureExchange
              Stock  Value
0  Blade Industries   10.3
1     Arcady Pharma    4.1
2     Global Omnium   12.0
3          Tonkatsu   15.5
4   Future Airlines   20.2
5          Lumocorp    5.5
6           Hypoxio   15.5
7         Transdyne    4.1

The DataFrame class has many methods which will see gradually. One of its nice function is describe which prints a statistical summary (size, mean, standard deviation, min and max, median and the quartiles) of the values found in the columns:


>>> futureExchange.describe()
           Value
 count  8.000000
 mean  10.900000
 std    6.000238
 min    4.100000
 25%    5.150000
 50%   11.150000
 75%   15.500000
 max   20.200000

These are the same values and the same summary as in the previous example using numpy arrays (in fact DataFrame and Series are based on them).

Advertisements

One thought on “Introduction to Python package pandas

  1. Pingback: Read and clean data with Python pandas | Look back in respect

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s