Probably the biggest time-consuming task when analysing data is to find, extract and clean the data itself.
You wish that the data looks like a nice table as a DataFrame but in reality data is stored in different places (databases, web sites, files, papers, sensors), in different formats (binary, JSON, Excel or other proprietary formats, hand-written), with incoherent layouts, with missing or incorrect values.
Therefore before doing any analysis on the data you need to perform what is called data mangling. Your goal version shall be what Hadley Wickham described in this paper : a tidy version (tidy is an adjective meaning “arranged neatly and in order”), e.g. a data frame where:
- Each variable shall be in one column
- Each different observation shall be in a different row
- If the variable are of different types, one table for each kind of variable
- If you have multiple tables, then shall include a column to link them
(see also Jeff Leek’s page “how to share data with a statistician”).
The first step is then to get the raw data and produce a tidy version (processed data) through a processing script, for which Python and Pandas can be a big help.
In the script – or in a separate code book – you shall include the information about the variables, their units and the choices you made in the script.
Remember what happened to Reinhart and Rogoff to understand the importance of having a script with instructions!
Let’s see a practical example.