Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

At this point you have what you need to read csvs, use INSERT statements to import data to your database, use a SELECT query to get your data back out and then save those results to CSV for use in your analysis program.

But sometimes the data in the input CSV doesn’t line up with the structure in your database. These two screencasts show how to handle two types of different input files (pivot table types and record types). These start to form a toolbox that should help you handle just about any input format.

Below I highlight some we’ll talk about next week, but you’ll find many useful tidbits on the FAQ page. Take a look at those techniques as you work on importing your CSV files.

  1. “Record style” input files (data about one record on multiple lines). See Screencast.
  2. Pivot table style input files (summary data on multiple variables, you want to transform the schema). A common example is having years as headers, such as GDP figures (country,2014,2015,2016). The faq shows See Screencast and some useful code for handling these:
  3. Multiple pieces of data in columns you want to split up (e.g., s03e04 to Season: 3, Episode: 4), using Regular Expressions or Regex. See Screencast
  4. Handling dates. See Screencast
  5. Reading data out of filenames. e.g., different files for different states or years. See FAQ entry.

Read pivot table style files or years as column titles.

Often we need to read in summarized data summarized data, where the columns are groups and the rows are different groups and the cells are values. This is variously called “wide data” or “pivot style” data. One frequently seen is having years as column titles.

event_type,1992,1993, ...
join,3,1, ...
post,34,33, ...

We can store this directly in SQL but it makes things like grouping by year very hard. So we want to store this in sql, we want a table that looks like (often called a long data format). This puts the values into rows and keeps them out of headers.

id, event_type, year, count
1,join,1992,3
2,join,1993,1
3,post,1992,34
4,post,1993,33
5, ...

We can do this with python and reading a csv, but we have to be conscious of what we have in each row and often have to handle columns differently. We have to write some logic to make the conversion. Most importantly we iterate across each row, and execute an insert for each field of the repeated columns.

import csv
import pprint

with open('year-posts.csv') as csvfile:

    myCSVReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
    for row in myCSVReader:
        pprint.pprint(row)

We will have a dict per row of the input file. If we pprint(row) we can see that (as usual) the column headers become keys, so the years are actually in the keys and the counts are in the values. However, there is one item in the dict that is different. It came from the first column of the csv and it holds the event_type.

{'1992': '3', 'event_type': 'join', '1994': '2', '1995': '4', '1996': '1'}

We can get the event_type to use later and remove it from the row dict at the same time, leaving the dict as pairs of years and counts. Event type is the same for all inserts.

        event_type = row.pop("event_type")
        print("After: consistent types in keys")        
        pprint.pprint(row)

leaves us:

{'1992': '3', '1994': '2', '1995': '4', '1996': '1'}

And we can now process the rest of the Dict

        for key, value in row.items(): #
            param_dict = {"event_type": event_type,
                          "year": key,
                          "count": value}
            cursor.execute(sql, param_dict)

Handling dates.

Dates and datetimes can be difficult because they come in many different formats: e.g.,

But whatever format data arrives in mysql wants datetimes formatted like 2013-03-30 14:30:00. In addition, when we have queried data from MySQL we have to decide on the format with which we will use the datetime.

Python has a datetime datatype which can help do conversions, especially in combination with parameterized queries (%(keyname)s and param_dict).

Incoming.

We make a datetime object from the incoming string read from CSV, then we pass that object to the sql server via param_dict. We have to use little template codes for different parts of the date (like %Y for the four-digit year). I show show of those that you’ll need below, but you can see more in Python strftime Documentation. Note that the same codes are used for incoming and outgoing:

# avoids having to write datetime.datetime.strptime
from datetime import datetime

incoming = "03/30/2013 14:30pm"

myPythonDate = datetime.strptime(incoming, "%m/%d/%Y %H:%M%p")

To use this in reading a CSV like performances.csv:

band_id,venue_id,start
5,4,04/30/2014 14:30pm
6,4,04/30/2014 15:30pm
import csv
import pymysql
import pymysql.cursors
from datetime import datetime

connection = pymysql.connect(host="mariadb",
                             user="root",
                             passwd="",
                             db="empty_music_festival",
                             autocommit=True,
                             cursorclass=pymysql.cursors.DictCursor)

sql = """
INSERT INTO performances(band_id, venue_id, start)
            VALUE (%(band_id)s, %(venue_id)s, %(start_datetime)s)
"""

with connection.cursor() as cursor:
    with open('performances.csv') as csvfile:
        myCSVReader = csv.DictReader(csvfile)

        for row in myCSVReader:
            start_str = row['start']
            template = "%m/%d/%Y %H:%M%p"
            start_datetime = datetime.strptime(start_str, template)
            param_dict = {'band_id': int(row['band_id']),
                          'venue_id': int(row['venue_id']),
                          'start_datetime': start_datetime}
            cursor.execute(sql, param_dict) # datetime converts to expected mysql format

Formatting a datetime object to output to a CSV, file, or to print to the screen uses the same template codes but uses the datetime.strftime method.

# Can create datetimes from ints:
# year, month, day, hours, minutes, sec
my_date = datetime.datetime(2017, 04, 03, 12, 5, 1)

my_format = "%Y-%m-%d %H:%M:%S"

print(mydate.strftime(my_format))

You can use that to write in a particular format to a CSV, you convert the datetime object to a string before passing to myCSVWriter.writerow(row).

Getting current time.

We can get the current time both in python and in MySQL. We can use NOW() in the SELECT clause of a MySQL query. In python we can use current_time = datetime.now()

Exercises:

Splitting out data

Sometimes data is mushed together inside a string and to build databases or to conduct an analysis we need to break up the string and extract the information.

e.g., “hat,cat,mat,bat” or “s3e10” or “s3e109”

We’ll look at three techniques: str.split(), string slicing, and regex.

A simple example is when the string has a consistent separator. Then we can use str.split(), which is useful when a single character is used such as with a simple comma-separated string (use csv for more complex stuff!)

codedString = "hat,cat,mat,bat"
myList = codedString.split(",")

pprint.pprint(myList)

But str.split() is only useful if the separator is always the same. Many times it is not. A good example is naming TV episodes. Consider this string: “s3e10” which stands for Season 3, Episode 10. Let’s say that we want to split this string up, so that we have the 3 in a season variable and the 10 in an episode variable.

One way uses Python “slicing”. This lets you treat a string like it was a list of characters (starting at 0, of course). We can count out the characters we want. We can also select ranges of characters with start_pos (included):end_pos (excluded).

codedString = "s3e10"
season = codedString[1]  # the second character in the string
episode = codedString[3:5]

print("Season: {}, Episode: {}".format(season, episode))

The string slicing syntax has some useful defaults. If you leave out the start then it starts at the beginning of the string, leave out the end then it goes to the end.

episode = codedString[3:]
print("Season: {}, Episode: {}".format(season, episode))

You can also count backward from the end of a string.

episode = codedString[-2:]
print("Season: {}, Episode: {}".format(season, episode))

Note that you can’t change a letter by assigning to a position (strings can only be read like lists, not changed like lists.) You can, though, use a for loop over a string to do something for each character.

Slicing is useful but limited unless the character positions of the string don’t change at all. Consider: “s03e10” or “s10e10” “s3e109” (long season with three digit episode numbers, def. not a UK series.)

What we need to be more flexible is a way to say “get me everything after the e, until you hit an s, then get the everything until the end of the string. You could do this by iterating through the characters in the string, but there’s a more general way called Regular Expressions, usually abbrev. to regex and in code it’s abbreviated to re (the re module).

Regex use a template to match against strings, using codes to stand for types of characters. For example \d matches a digit, and \d+ matches any number of digits (must be at least 1). To pull an element out you can surround the part you want in parens.

import re

codedString = "s03e10"
matches = re.search('s(\d+)e(\d+)', codedString)
# matches then has group methods. group(0) is the whole string
# So in this case we get the elements from the parens left to right
# starting at 1.
season = matches.group(1)
episode = matches.group(2)

print("Season: {}, Episode: {}".format(season, episode))

Because \d+ matches any number of digits (at least 1) it will work with other strings too.

codedString = "s3e109"
matches = re.search('s(\d+)e(\d+)', codedString)
# matches then has group methods. group(0) is the whole string
# So in this case we get the elements from the parens left to right
# starting at 1.
season = matches.group(1)
episode = matches.group(2)

print("Season: {}, Episode: {}".format(season, episode))

Learning the syntax for Regex can take a while, but is fun and very powerful. For example we could also get a title. \s represents a single space. and the . “dot” character matches any character at all, so that .+ means “get everything” the $ is the end of a line. So here we define the title as everything from colon-space to the end of the line.

codedString = "s03e10: The excitement of parsing"

matches = re.search('s(\d+)e(\d+):\s(.+)$',codedString)
season = matches.group(1)
episode = matches.group(2)
title = matches.group(3)

print("Season: {}, Episode: {}, Title: {}".format(season, episode, title))

Now, to learn to write better Regex try these exercise sites (hint: Python uses “Perl-compatible Regex syntax” (PCRE) which is the most common, anyway, but worth knowing. Just because it says “in perl” doesn’t mean it won’t work in Python.