James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.
At this point you have what you need to read csvs, use INSERT statements (together with “find or create” SELECTs) to import data to your database, use a SELECT query to get your data back out and then save those results to CSV for use in your analysis program.
But almost always the data in the input CSV doesn’t line up with the structure in your database. These are called “transformations”. Below I highlight some transformations but you’ll find many useful tidbits on the FAQ page. Take a look at those techniques as you work on importing your CSV files. They are derived from situations seen in previous class projects. The text below has screencasts that have a different infrastructure (they use pymysql rather than psycopg2) but the logic in the code is the same.
Often we need to read in summarized data summarized data, where the columns are groups and the rows are different groups and the cells are values. This is variously called “wide data” or “pivot style” data. One frequently seen is having years as column titles.
year-posts.csv
event_type,1992,1993, ...
join,3,1, ...
post,34,33, ...
We can store this directly in SQL but it makes things like grouping by year very hard. So we want to store this in sql, we want a table that looks like (often called a “long” data format and this operation is known as “pivot longer”). This puts the values into rows and keeps them out of headers.
id, event_type, year, count
1,join,1992,3
2,join,1993,1
3,post,1992,34
4,post,1993,33
5, ...
We can do this with python and reading a csv, but we have to be conscious of what we have in each row and often have to handle columns differently. We have to write some logic to make the conversion. Most importantly we iterate across each row, and execute an insert for each field of the repeated columns.
import csv
import pprint
with open('year-posts.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
for row in myCSVReader:
print(row)
We will have a dict per row of the input file. If we print(row) we can see that (as usual) the column headers become keys, so the years are actually in the keys and the counts are in the values. However, there is one item in the dict that is different. It came from the first column of the csv and it holds the event_type
.
{'1992': '3', 'event_type': 'join', '1994': '2', '1995': '4', '1996': '1'}
We can get the event_type
to use later and remove it from the row dict at the same time, leaving the dict as pairs of years and counts. The Event type is the same for all inserts from that row.
event_type = row.pop("event_type")
print("After: consistent types in keys")
print(row)
row.pop actually removes the key/value pair (and returns the value), leaving us:
{'1992': '3', '1994': '2', '1995': '4', '1996': '1'}
And we can now process the rest of the Dict by iterating across the key, value pairs.
for key, value in row.items(): #
param_dict = {"event_type": event_type,
"year": key,
"count": value}
cursor.execute(sql, param_dict)
cursor.execute is called not once per row of the input, rather it is called once per remaining columm (after we remove the event_type column.)
Dates and datetimes can be difficult because they come in many different formats: e.g.,
But whatever format data arrives in mysql wants datetimes formatted like 2013-03-30 14:30:00
. In addition, when we have queried data from MySQL we have to decide on the format with which we will use the datetime.
Python has a datetime datatype which can help do conversions, especially in combination with parameterized queries (%(keyname)s
and param_dict
).
We make a Python datetime
object from the incoming string we read from the CSV, then we pass that object to the sql server via param_dict
. We have to use little template codes for different parts of the date (like %Y
for the four-digit year).
I show some of those that you’ll need below, but you can see more in Python strftime Documentation. Note that the same codes are used for incoming and outgoing:
datetime.strptime
(“string p arse time”)datetime.strftime
(“string f time”)# avoids having to write datetime.datetime.strptime
from datetime import datetime
incoming = "03/30/2013 14:30pm"
myPythonDate = datetime.strptime(incoming, "%m/%d/%Y %H:%M%p")
To use this in reading a CSV like performances.csv
:
band_id,venue_id,start
5,4,04/30/2014 14:30pm
6,4,04/30/2014 15:30pm
#band_id,venue_id,start
#5,4,04/30/2014 14:30pm
#6,4,04/30/2014 15:30pm
import psycopg2
from psycopg2.extras import RealDictCursor
import csv
from datetime import datetime
with psycopg2.connect(host='localhost', dbname='empty_music_festival') as conn:
with conn.cursor(cursor_factory=RealDictCursor) as cursor:
with open('performances.csv') as csvfile:
myCSVReader = csv.DictReader(csvfile)
for row in myCSVReader:
template = "%m/%d/%Y %H:%M%p" # 2013-04-05 10:09pm
start_datetime = datetime.strptime(row['start'], template)
# start_datetime is a python datetime object (not a string or int).
param_dict = {'band_id': int(row['band_id']),
'venue_id': int(row['venue_id']),
'start_datetime': start_datetime}
cursor.execute("""
INSERT INTO performances(
band_id,
venue_id,
start)
VALUES (
%(band_id)s,
%(venue_id)s,
%(start_datetime)s
)
""", param_dict) # datetime converts to expected db format
Formatting a datetime object to output to a CSV, file, or to print to the screen uses the same template codes but uses the datetime.strftime
method.
# Can create datetimes from ints:
# year, month, day, hours, minutes, sec
my_date = datetime.datetime(2017, 04, 03, 12, 5, 1)
my_format = "%Y-%m-%d %H:%M:%S"
print(mydate.strftime(my_format))
You can use that to write in a particular format to a CSV, you convert the datetime object to a string before passing to myCSVWriter.writerow(row)
.
We can get the current time both in python and in MySQL. We can use NOW()
in the SELECT
clause of a MySQL query. In python we can use current_time = datetime.now()
See this for postgress details on NOW()
print
to figure out what class datetimes read from the database come into Python as. Are they already datetime objects or do we need to convert them?Sometimes data is mushed together inside a string and to build databases or to conduct an analysis we need to break up the string and extract the information.
e.g., “hat,cat,mat,bat” or “s3e10” or “s3e109”
We’ll look at three techniques: str.split(), string slicing, and regex.
A simple example is when the string has a consistent separator. Then we can use str.split(), which is useful when a single character is used such as with a simple comma-separated string (use csv for more complex stuff!)
codedString = "hat,cat,mat,bat"
myList = codedString.split(",")
print(myList)
But str.split() is only useful if the separator is always the same. Many times it is not. A good example is naming TV episodes. Consider this string: “s3e10” which stands for Season 3, Episode 10. Let’s say that we want to split this string up, so that we have the 3
in a season variable and the 10
in an episode variable.
One way uses Python “slicing”. This lets you treat a string like it was a list of characters (starting at position 0, because computers). We can count out the characters we want. We can also select ranges of characters with start_pos (included):end_pos (excluded).
codedString = "s3e10"
season = codedString[1] # the second character in the string
episode = codedString[3:5]
print(f"Season: {season}, Episode: {episode}")
The string slicing syntax has some useful defaults. If you leave out the start then it starts at the beginning of the string, leave out the end then it goes to the end.
episode = codedString[3:]
print(f"Season: {season}, Episode: {episode}")
You can also count backward from the end of a string.
episode = codedString[-2:]
print(f"Season: {season}, Episode: {episode}")
Note that you can’t change a letter by assigning to a position (strings can only be read like lists, not changed like lists.) You can, though, use a for
loop over a string to do something for each character.
Slicing is useful but limited unless the character positions of the string don’t change at all. Consider: s03e10
vs s10e10
vs s3e109
(long season with three digit episode numbers, def. not a UK series!)
What we need to be more flexible is a way to say “get me everything after the e
, until you hit an s
, then get the everything until the end of the string. You could do this by iterating through the characters in the string, but there’s a more general way called Regular Expressions, usually abbrev. to regex and in code it’s abbreviated to re
(the re
module).
Regex use a template to match against strings, using codes to stand for types of characters. For example \d
matches a digit, and \d+
matches any number of digits (must be at least 1). To pull an element out you can surround the part you want in parens.
import re
codedString = "s03e10"
matches = re.search('s(\d+)e(\d+)', codedString)
# matches then has group methods. group(0) is the whole string
# So in this case we get the elements from the parens left to right
# starting at 1.
season = matches.group(1)
episode = matches.group(2)
print(f"Season: {season}, Episode: {episode}")
Because \d+
matches any number of digits (at least 1) it will work with other strings too.
codedString = "s3e109"
matches = re.search('s(\d+)e(\d+)', codedString)
# matches then has group methods. group(0) is the whole string
# So in this case we get the elements from the parens left to right
# starting at 1.
season = matches.group(1)
episode = matches.group(2)
print(f"Season: {season}, Episode: {episode}")
Learning the syntax for Regex can take a while, but is very powerful. It can also be fun, like a hard game. For example we could also get a title. \s
represents a single space. and the .
“dot” character matches any character at all, so that .+
means “get everything” the $
is the end of a line. So here we define the title as everything from colon-space to the end of the line.
codedString = "s03e10: The excitement of parsing"
matches = re.search('s(\d+)e(\d+):\s(.+)$',codedString)
season = matches.group(1)
episode = matches.group(2)
title = matches.group(3)
print(f"Season: {season}, Episode: {episode}, Title: {title}")
Now, to learn to write better Regex try these exercise sites (hint: Python uses “Perl-compatible Regex syntax” (PCRE) which is the most common, anyway, but worth knowing. Just because it says “in perl” doesn’t mean it won’t work in Python.