Welcome to DATA CLEANING 101! (clapping)
Let us start with what you can accomplish after this lecture:
- Automated manipulation of over 500 excel files (old version .xls format)
- Merging inconsistent column headers and value labels that represents the same concept across different files by different authors
- Removal of the duplicated rows that comes from multiple copies of excel files with different names and updated version of old files
The first challenge of processing raw data in old excel (xls) file is how to read them. Excel file is non-trivially formatted byte code, therefore normal text editor cannot read its content (unlike csv files). However, you will need some way to reading the files so that you can automate your cleaning process in Python. For this reason, you need the xlrd, xlwt, xlutils libraries from http://www.python-excel.org/. Be warned that they do not work on the new Excel (xlsx) files. But do not be worried if you need to deal with the xlsx files, just read on!
The second challenge comes from merging inconsistent headers and value labels in the Excel files. For example, the ‘Charges’ column in file A may map to the ‘charge(s)’ column in file B, while ‘DIP’ under ‘Charge’ in file A may correspond to ‘Drinking in Public’ under ‘charge(s)’ in file B. How do you merge these headers and labels?
You will have to inspect the files briefly to identify a key pattern for each target column that you plan to collect. The pattern is then turned into regular expression. Python provides a built-in library re that supports compilation and matching of regular expressions (https://docs.python.org/2/library/re.html) After all patterns are compiled into regular expression, you can assign a key string for each of compiled object and put the key-value pairs into a dictionary. This dictionary forms a data structure of the equivalence class, except all strings that match with a regular expression is equivalent to the corresponding key string. The key strings are used to replace all the matched headers in the Excel files. Similar methodology applies to merging of the value labels.
The last challenge arises naturally after merging the labels: how do you filter duplication of information? You will need pandas (http://pandas.pydata.org/), PostgreSQL database (http://www.postgresql.org/) and psycopg2 (http://initd.org/psycopg/). Pandas is a data analysis library that implements a very powerful data structure, DataFrame. DataFrame is a high-level indexed table that enables filtering of duplicated rows with just one call of a function (pd.DataFrame.drop_duplicates). It also provides a function to read the new Excel files. You can then factor the data in a SQL server into several sub-tables and further clamp the rows of the sub-tables. Any SQL database should work fine, but our choice is the PostgreSQL. To get your data into PostgreSQL server in Python, you need the cursor and connection objects from psycopg2.
What? You don’t have the details on how to tackle each challenge? Being a responsible lecturer, I always leave out the answer space for the student so that they learn something. You will be able to gather enough information with the links attached to solve your problems. That said, I am more than happen to answer questions in comments. (The face is there is not enough space in this post)
To wrap this up, my Python script reduces 31000 rows from over 100 files to 6000 rows before even inputting them to SQL server. The noisy data is about 4 times the amount of the useful data. I cannot imagine how bad it would be if including all the 31000 rows in any analysis.
Now if you would excuse me, I am watching the next FIFA game:)