Read and clean data with Python pandas

Probably the biggest time-consuming task when analysing data is to find, extract and clean the data itself.

You wish that the data looks like a nice table as a DataFrame but in reality data is stored in different places (databases, web sites, files, papers, sensors), in different formats (binary, JSON, Excel or other proprietary formats, hand-written), with incoherent layouts, with missing or incorrect values.

Therefore before doing any analysis on the data you need to perform what is called data mangling. Your goal version shall be what Hadley Wickham described in this paper : a  tidy version (tidy is an adjective meaning “arranged neatly and in order”), e.g. a data frame where:

  • Each variable shall be in one column
  • Each different observation shall be in a different row
  • If the variable are of different types, one table for each kind of variable
  • If you have multiple tables, then shall include a column to link them

(see also Jeff Leek’s page “how to share data with a statistician”).
The first step is then to get the raw data and produce a tidy version (processed data) through a processing script, for which Python and Pandas can be a big help.
In the script – or in a separate code book – you shall include the information about the variables, their units and the choices you made in the script.

Remember what happened to Reinhart and Rogoff to understand the importance of having a script with instructions!

Let’s see a practical example.

The University of California, Irvine (UCI) has a repository with dozens of datasets (to train Machine Learning models) and one of them is a “Human Activity Recognition (HAR) Using Smartphones Data Set”: a database built from the recordings of 30 subjects performing six activities of daily living while carrying a waist-mounted smartphone with embedded inertial sensors.
Our goal is to read this raw dataset and transform it into a tidy dataset according to the rules above, using a Python script, step by step.

If you prefer to look at the code all together, the Python script is on a GitHub repository, and you can find also an R script if you prefer that language.

This specific dataset is in zipped format and although there are Python libraries to download and unzip archives, for the sake of this example let’s assume the archive has been already downloaded and uncompressed, into a folder called “UCI HAR Dataset”. The folder contains several files (in text format) and one of them (README.txt) describes the dataset:

The dataset includes the following files:

– ‘README.txt’:

– ‘features_info.txt’: Shows information about the variables used on the feature vector.

– ‘features.txt’: List of all 561 features.

– ‘activity_labels.txt’: Links the class labels with their activity name.

– ‘train/X_train.txt’: Training set.

– ‘train/y_train.txt’: Training labels.

– ‘test/X_test.txt’: Test set.

– ‘test/y_test.txt’: Test labels.

The following files are available for the train and test data. Their descriptions are equivalent.

– ‘train/subject_train.txt’: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.

– ‘train/Inertial Signals/total_acc_x_train.txt’: The acceleration signal from the smartphone accelerometer X axis in standard gravity units ‘g’. Every row shows a 128 element vector. The same description applies for the ‘total_acc_x_train.txt’ and ‘total_acc_z_train.txt’ files for the Y and Z axis.

– ‘train/Inertial Signals/body_acc_x_train.txt’: The body acceleration signal obtained by subtracting the gravity from the total acceleration.

– ‘train/Inertial Signals/body_gyro_x_train.txt’: The angular velocity vector measured by the gyroscope for each window sample. The units are radians/second.

Reading files with pandas

Let’s start reading the activity labels which are contained in the file activity_labels.txt:

import pandas as pd
act = pd.read_table(
            'UCI HAR Dataset/activity_labels.txt',   # path and filename
            sep=' ',                  # use blank spaces as separator
            header=None,              # there is no header in the file
            names=('ID','Activity'))  # use instead these names as header
> act
 ID Activity

> type(act)

The function read_table() can be used for text files separated by tabs (default) or some other delimiters, in this case sep=’ ‘ tells to use blank spaces to separate the columns.

The file has no header, so I explicitly say not to use the first row as a header (header = None) and instead I pass a list of column names to be used (names).

The result is a DataFrame with 6 rows (row index starts from 0) and 2 columns.

Now we can read the file called features.txt (the list of movement sensors) with similar parameters  as the function above (just different column names).

features = pd.read_table(
                'UCI HAR Dataset/features.txt', sep=' ', 
                header=None, names=('ID','Sensor'))

The resulting data frame is a bigger one. We can get some information on it with the useful pandas function info():

<class 'pandas.core.frame.DataFrame'>
Int64Index: 561 entries, 0 to 560
Data columns (total 2 columns):
ID 561 non-null int64
Sensor 561 non-null object
dtypes: int64(1), object(1)
memory usage: 13.1+ KB

The features data frame has 561 observations of two variables: ID and sensor’s name.
A concise way to see the data frame’s dimension is to use the shape attribute:

> features.shape
(561, 2)

And we can see the first five rows (i.e. the first observations) using the method head():

> features.head()
 ID Sensor
0 1 tBodyAcc-mean()-X
1 2 tBodyAcc-mean()-Y
2 3 tBodyAcc-mean()-Z
3 4 tBodyAcc-std()-X
4 5 tBodyAcc-std()-Y

Next let’s read all the files in the test folder. Note that it is a common practice in machine learning to split a data set into a training subset (that will be used to train the models) and one test subset (that will be used to validate the resulted models); more on this on a later post.

testSub = pd.read_table(
               'UCI HAR Dataset/test/subject_test.txt', header=None, 
> testSub.shape
(2947, 1)

testX = pd.read_table(
             'UCI HAR Dataset/test/X_test.txt', sep='\s+', header=None)
# The file X_test requires to use as a separator a regular expression, 
# because sometimes more than one blanks are used (data mangling!)
> testX.shape
(2947, 561)

# The file y_test contains the outcome activity label for each observation
testY = pd.read_table(
              'UCI HAR Dataset/test/y_test.txt', sep=' ', header=None)
> testY.shape
(2947, 1)

As you see, all test data tables have 2947 observations.

It’s also possible to add a column name after creation:

> testY.columns = ['ActivityID']
> testY.head()

0           5
1           5
2           5
3           5
4           5

Now let’s move to the train folder:

trainSub = pd.read_table(
                'UCI HAR Dataset/train/subject_train.txt', header=None, 
> trainSub.shape
(7352, 1)

trainX = pd.read_table(
              'UCI HAR Dataset/train/X_train.txt', sep='\s+', header=None)
> trainX.shape
(7352, 561)

trainY = pd.read_table(
               'UCI HAR Dataset/train/y_train.txt', sep=' ', header=None, 
> trainY.shape
(7352, 1)

As you see, the training set has 7352 observations, spread in 3 files.

The next step will be to merge.

Merge the training and the test data sets

allSub = pd.concat([trainSub, testSub], ignore_index=True)
> allSub.shape
(10299, 1)

The very flexible concat() is the method used, which can append a data frame to another one.
Now the new allSub data frame contains 10299 = 2947+7352 rows. Note that ignore_index=True is necessary to have an index starting from 0 and ending at 10298, without restarting after the first 7352 observations.
In this simple example no other parameters are necessary but concat() can do much more, such as merging other objects (e.g. dictionaries or series), merging by rows instead of columns, perform logic unions or intersections.

Now we do the same for the X and Y data sets:

allX = pd.concat([trainX, testX], ignore_index = True)
> allX.shape
(10299, 561)

allY = trainY.append(testY, ignore_index=True)
> allY.shape
(10299, 1)

For the Y dataset I used the pandas method append() just to show an alternative merge possibility: it is basically the simpler version of concat() and it can only “append” rows of another data frame (as input) to the calling data frame.

Appropriately labels the data set

Let’s now use descriptive names for the activities in the data set. The current situation is that they have just the index as the name and the actual name is stored in another file (the one called features.txt that we read into the data frame features earlier on.

> allX.head()

   0         1         2         3         4         5         6       \

0  0.288585 -0.020294 -0.132905 -0.995279 -0.983111 -0.913526 -0.995112   

1  0.278419 -0.016411 -0.123520 -0.998245 -0.975300 -0.960322 -0.998807   

2  0.279653 -0.019467 -0.113462 -0.995380 -0.967187 -0.978944 -0.996520   

3  0.279174 -0.026201 -0.123283 -0.996091 -0.983403 -0.990675 -0.997099   

4  0.276629 -0.016570 -0.115362 -0.998139 -0.980817 -0.990482 -0.998321  


Luckily with pandas data frames this is easy to get: the sensor names are in the column named ‘Sensor’ in features and you can assign new column names to a data frame using the attribute columns:

sensorNames = features['Sensor']  # get the names from features DF
allX.columns = sensorNames  # change column names

> allX.head()

Sensor  tBodyAcc-mean()-X  tBodyAcc-mean()-Y  tBodyAcc-mean()-Z  \

0                0.288585          -0.020294          -0.132905   

1                0.278419          -0.016411          -0.123520   

2                0.279653          -0.019467          -0.113462   

3                0.279174          -0.026201          -0.123283   

4                0.276629          -0.016570          -0.115362   

Much clearer. Now we have a data frame called allX which contains all sensor activities observations but one thing is still missing: the explicit reference to which subject performed the activities. Right now is only implicit by looking at the index data frame.
So, let’s merge allX with the allSub data frame which contains the subject ID:

        # Merge Subjects and sensors data frames by columns
all = pd.concat([allX, allSub], axis=1) 

> all.shape
(10299, 562)

Note the use of axis=1 to tell concat() that this time we want to keep the rows fixed and add columns.

Now we want to do the same for allY: add it to the new “all” data frame as extra column but first we want to map the activity label (this is data frame “act”) to the activity code:

> allY.head()


0           5
1           5
2           5
3           5
4           5

> allY.tail()


10294           2
10295           2
10296           2
10297           2
10298           2

for i in act['ID']:
  activity = act[act['ID'] == i]['Activity'] # get activity cell given ID
  allY = allY.replace({i: activity.iloc[0]}) # replace this ID with activity string

allY.columns = ['Activity'] # change column name from ID to Activity

> allY.head()



> allY.tail()



Voilà: now allY contains the activity label and we will merge with the rest.

all = pd.concat([all, allY], axis=1)

> all.shape
(10299, 563)

Now we have a single table with all the information about each single observation, with proper labels. This can be exported into a file with format CSV (Comma Separated Value) by using the function to_csv():


You can group the observations by each subject and each activity if you wish:

grouped = all.groupby (['SubjectID', 'Activity'])

Additionally we could also create a second, independent tidy data set with the average of each variable for each activity  and each subject.
You can do this by using the groupby() method and the aggregate() method for data frames from pandas (note that we use the numpy mean() function here):

import numpy as np

tidier = all.groupby (['Activity']).aggregate(np.mean)

tidier = tidier.drop('SubjectID', axis=1)

> tidier.head()
Sensor          tBodyAcc-mean()-X  tBodyAcc-mean()-Y  tBodyAcc-mean()-Z  \

LAYING                   0.268649          -0.018318          -0.107436   
SITTING                  0.273060          -0.012690          -0.105517   
STANDING                 0.279153          -0.016152          -0.106587   
WALKING                  0.276337          -0.017907          -0.108882   
WALKING_DOWNSTAIRS       0.288137          -0.016312          -0.105762 

That’s all.


2 thoughts on “Read and clean data with Python pandas

  1. mashimo

    This is a classification problem. You want to predict in which of the six activity classes fits your new observations. Any prediction model for classification would do.
    Probably the simplest model would be a K-nearest neighbour model (in short called kNN).

    In the original paper ( the authors showed that they achieved 99% accuracy using the Adaboost model.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s