Appendix A — Data Wrangling FAQ.
This is an FAQ for James Howison’s Data Wrangling course in the Information School at the University of Texas at Austin. It’s an introductory course that teaches enough python to read csv and wrangle data into mysql, then back out again to conduct some analysis.
For Fall 2023 we switched to using duckdb. This implies some small changes to some of the code below. I’m trying to work through to show both versions, but as I work be aware of:
You may need to change
cursor.execute
tocon.execute
(DuckDB doesn’t require creating a cursor first, but postgres and mysql did)Where you see something like
%(keyname)s
change that to$keyname
. The%(keyname)s
format is what the python library to talk to postgres uses as a placeholder.You can ignore the
pprint.pprint
stuff. That was used before notebooks (when outputting to the terminal).
Python packages
Jupyter does not have a package I need
You can install packages locally on your Docker container using:
import sys
!{sys.executable} -m pip install <package_name>
Replace <package_name>
with the name of the package you’d like to install. e.g., glob3
.
Getting Data
Convert an html table to csv
http://www.convertcsv.com/html-table-to-csv.htm Careful to select the actual table you want, avoid using the “All” option in the dropdown box. Wikipedia pages sometimes have multiple tables with headers that should be columns (e.g., Season number in https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes_(seasons_1%E2%80%9320)#Episodes.
You could scrape and parse the HTML but that’s out of scope for this class. You could also save each converted table to a file with the season in the name (e.g., simpsons-season-1.csv) and then combine the tips below on Reading data from filenames and the glob.iglob
from Read files in a directory.
Convert a table in a pdf to csv
http://tabula.technology/ or https://pdftables.com/
Read a csv file
Use csv library and either csv.DictReader or csv.reader.
import csv
with open('your_file.csv') as csvfile:
# With headers in the file
= csv.DictReader(csvfile, delimiter=",", quotechar='"')
myCSVReader # No headers in file, supply your own in a list
= ["id","year","state_abbr","gdp_millions"]
headers = csv.DictReader(csvfile, delimiter=",", quotechar='"',
myCSVReader = headers)
fieldnames
# No headers or uneven rows. Each row a list rather than a dict.
= csv.reader(csvfile, delimiter=",", quotechar='"')
myCSVReader
# move row by row through the file
for row in myCSVReader:
pprint.pprint(row)
See Reading CSV materials.
Work through each ‘cell’ or ‘field’ in tabular data
In some cases we want to read a csv and work from top left to bottom right, across each field in a row, then on to the next row. Just as we’d read a book.
with open('your_file.csv') as csvfile:
= csv.DictReader(csvfile)
myCSVReader = myCSVReader.fieldnames # no parens
columns_in_order # vertical moves (row by row)
for row in myCSVReader:
# horizontal moves (column by column within a row)
for header in columns_in_order:
# header is a string used a key into the row dictionary
# e.g., row["capacity"]
= row[header]
field_value print(field_value)
Very similar if you want to process the results from an SQL query in this fashion. The difference is where you get the columns_in_order from:
= cursor.fetchall() # list of dicts
results
= [i[0] for i in cursor.description] # awkward but works
columns_in_order
# vertical moves (dict by dict in list)
for row in results:
# horizontal moves (column by column within a row)
for header in columns_in_order:
# header is a string used a key into the row dictionary
# e.g., row["capacity"]
= row[header]
field_value print(field_value)
Handle files without headers
Two options.
Use csv.reader rather than DictReader. csv.reader produces a list for each row rather than a dict. So you have to use indexes to indicate which column you want.
Provide a list of strings to use as headers to DictReader using the fieldnames parameter. If you provide that then DictReader will process each row into a dictionary with those keys (and will process the first row as data). You shouldn’t do this if there are headers in the file.
with open('csv_with_preamble.csv') as infile:
= ["name", "capacity"]
headers = csv.DictReader(infile,fieldnames=headers)
myCSVReader for row in myCSVReader:
pprint.pprint(row)
Handle files with repeated identical headers
The DictReader turns columns into keys of dicts. But we can’t have repeated keys for dicts, so you’ll end up with only one of any set of repeated columns. For example, this file has two “points” columns.
game_city,visiting_team,points,home_team,points
Austin,San Antonio,2,Austin,4
As a human we can tell that one is meant to be the visiting_team_points and one is meant to be the home_team_points but only the order of the columns tells us that.
The simplest option is to edit the headers in the csv, but if you are parsing many files or files received automatically that’s not possible. You can use the same techniques as files without headers (manually declare fieldnames or use csv.reader which returns a list for each row, rather than a dictionary.)
Use a tab separated file
Tab separated files can be parsed with the same code as csv files, but with the delimeter set to “:
import csv
with open('Some_tab_separated_file.tsv_or_csv_or_whatever') as tsvfile:
= csv.DictReader(tsvfile, delimiter= "\t") # or csv.reader
myTSVReader for row in myTSVReader:
...
Use a fixed width file
A file that uses the number of characters to show where fields separate can be parsed using string slicing on each line.
NAME STATE TELEPHONE
John Smith WA 418-711-4111
Mary Hartford CA 319-519-4341
Evan Nolan IL 219-532-c301
with open('some_fixed_file') as fixed_file: # file ending doesn't matter
# Need character position for each column. Editors usually show this
# just move your cursor to the start and subtract 1 (0 index)
for line in fixed_file:
= line[:20].rstrip() # rstrip removes spaces at end
name = line[20:30].rstrip()
state = line[30:].rstrip()
telephone ...
Geocoding to get state or county for an address
We can use geopy
and a geocoding service to take addresses, even malformed or partial ones, and convert them to structured data. This is useful when you have an address (or a lat/long) and want to get an enclosing geographic unit (e.g., County/State)
First check whether it is installed:
%pip install geopy
Then you can use like this:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Photon
= Photon(user_agent="myGeocoder")
geolocator = RateLimiter(geolocator.geocode, min_delay_seconds=1)
geocode
= "1616 Guadalupe Austin TX" # change that string to anything.
address
= geocode(address)
location # print(location.raw)
= location.raw['properties']['county']
county = location.raw['properties']['state']
state
print(f"{address} is in {county} which is a county of {state}")
Rather than printing out the county and state you can use them in an SQL insert. The RateLimiter is important because if you run this on a large dataset you don’t want to hammer the Photon server (which is a freely provided service). More services available at geopy website.
Ensuring the RateLimiter works is a matter of ensruing that the RateLimiter is created outside the block where you iterate. ie ensure that you only create the RateLimiter once, then use it many times.
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Photon
= ["1616 Guadalupe Austin TX", 'Travelling the World on Web3', 'Noord-Brabant', 'PG/MD', 'Toronto, Ontario', 'East, England', 'Johannesburg, South Africa', 'الله ', ' Outchea', 'Road to Riches ', 'Mercury', 'cactuses n shit, colorado ', 'fr_US.lproj', 'Wisconsin', '757', 'Abuja,Nigeria', 'Belfast, Northern Ireland', 'Nigeria', 'GTA ', 'Dekalb County', "i think i'm better now"]
address_list
= Photon(user_agent="myGeocoder")
geolocator = RateLimiter(geolocator.geocode, min_delay_seconds=1)
geocode
for address in address_list:
print(f"Attempting to convert: {address}")
= geocode(address)
location print(location.raw)
# exact structure of location.raw depends on country
# ie non-US address unlikely to have county
if 'county' in location.raw['properties']:
= location.raw['properties']['county']
county if 'state' in location.raw['properties']:
= location.raw['properties']['state']
state
print(f"{address} is in {county} which is a county of {state}")
print("\n\n") # spacing to make output readable
Better still is to seperate the geocoding call from the code that attempts to parse the result.
This is useful because parsing is often buggy and this avoids calling the service more than once
Below I do this all in python you could but location.raw into a TEXT column in the database each time, then try parsing and store the result in another column (using UPDATE)
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Photon
import time # to show when calls are made to geocode
"""
helper function to lookup address
must create the geocode function outside the function for the
rte limiter to work
"""
def map_location(address):
print(time.perf_counter())
print(f"Attempting to convert: {address}")
= geocode(address)
location return location
= Photon(user_agent="myGeocoder")
geolocator = RateLimiter(geolocator.geocode, min_delay_seconds=1)
geocode
= ["1616 Guadalupe Austin TX", 'Travelling the World on Web3', 'Noord-Brabant', 'PG/MD', 'Toronto, Ontario', 'East, England', 'Johannesburg, South Africa', 'الله ', ' Outchea', 'Road to Riches ', 'Mercury', 'cactuses n shit, colorado ', 'fr_US.lproj', 'Wisconsin', '757', 'Abuja,Nigeria', 'Belfast, Northern Ireland', 'Nigeria', 'GTA ', 'Dekalb County', "i think i'm better now"]
address_list
= [map_location(i) for i in address_list]
geocode_results
# could call geocode directly, without map_location
# disadvantage is that there is no output.
# geocode_results = [geocode(i) for i in address_list]
Once you’ve gathered the results you can parse them, you can execute this cell many times without calling back to the service.
# No geocoding in this cell, just parsing the results
# ideally we never geocode the same string twice.
for location in geocode_results:
print(location.raw,'\n')
# exact structure of location.raw depends on country
# ie non-US address unlikely to have county
if 'county' in location.raw['properties']:
= location.raw['properties']['county']
county if 'state' in location.raw['properties']:
= location.raw['properties']['state']
state
print(f"{location}\n County: {county} which is in State: {state}")
print("\n\n") # spacing to make output readable
Inserting data into the database
Drop a column from a csv en route to the database
Just choose not to insert that column by not mentioning it in the INSERT. Let’s say we want to drop the id column here:
id,name,capacity
1,AMD,700
2,Honda,2000
= """INSERT INTO my_venues( venue_name, capacity )
sql_insert VALUES ( %(name)s, %(capacity)s )"""
for row in data:
# doesn't matter that row also has a key "id" because it's not
# called for in the sql_insert, ie there is no %(id)s in the
# values part.
cursor.execute(sql_insert,row)
Skip a comment or preamble before the headers in my csv file
Use next()
on the file object before applying the DictWriter.
My great dataset
Copyright 2007, big important person.
name,age,occupation
james,39,professor
heshan,36,dancer
with open('csv_with_preamble.csv') as infile:
next(infile) # toss My great dataset
next(infile) # toss Copyright 2007
# next row is headers so that's all good.
= csv.DictReader(infile) # or csv.reader
myCSVReader for row in myCSVReader:
...
Skip a comment, or other data, in the middle of my csv file
Use continue
, after test the first field for starting with the comment character.
name,age,occupation
james,39,professor
#heshan,36,dancer
danielle,21,student
with open('csv_with_comments.csv') as infile:
# next row is headers so that's all good.
= csv.DictReader(infile) # or csv.reader
myCSVReader for row in myCSVReader:
if (row["name"][:1] == "#") : # or row[0][:1] if using csv.reader
continue # skip rest of this iteration, go onto next line
# process data rows as usual. ...
Handle files with multiple data sections
Some CSV files have multiple data sections, each with a little header. Read these as plain text files and then parse each “minicsv” using StringIO
to treat that section as a file.
For example The US DOJ Universal Crime Reporting Stats does this. If you select more than one state you get data like (with some headers/footers removed):
Estimated crime in Alabama
Year,Population,Violent crime total
2010,4785401,18363
2011,4803689,20166
2012,4822023,21693
Estimated crime in Alaska
Year,Population,Violent crime total
2010,714146,4537
2011,723860,4416
2012,731449,4412
Estimated crime in Arizona
Year,Population,Violent crime total
2010,6413158,26528
2011,6467315,26789
2012,6553255,28108
Each state’s data is repeated in a separate little csv embedded in the larger file.
Eventually we want to create a state column to go along with the others:
State,Year,Population,Violent crime total
To do this we can split up the file and then read each of as though they were separate CSVs. After the split this ends up very similar to the approach we’d use if these were separate files in a directory.
import re
import pprint
# open the file
= open("CrimeStatebyState.csv")
fileobject
= {} # "state": "minicsv"
dict_by_state
= ""
minicsv = ""
state
for line in fileobject: # Go line by line, not using csv library.
# new record or past last record.
if (line[:9] == "Estimated"):
# detected start of new minicsv
# (lines starting with Estimated)
# If we have an old state, store minicsv in the dict.
if (state):
= minicsv
dict_by_state[state]
# Either way now start a new minicsv and get the state.
= re.search('Estimated crime in (.*)$',line)
matches = matches.group(1)
state = "" #reset
minicsv elif (line != "\n"): # skips empty lines.
+= line
minicsv
pprint.pprint(dict_by_state)
which outputs:
{'Alabama': 'Year,Population,Violent crime total\n'
'2010,4785401,18363\n'
'2011,4803689,20166\n'
'2012,4822023,21693\n',
'Alaska': 'Year,Population,Violent crime total\n'
'2010,714146,4537\n'
'2011,723860,4416\n'
'2012,731449,4412\n',
'Arizona': 'Year,Population,Violent crime total\n'
'2010,6413158,26528\n'
'2011,6467315,26789\n'
'2012,6553255,28108\n'}
Now we’ve built up a dictionary where the keys are state names and the values are the mini-csvs (each as a single long string). Now we can use our familiar csv parsing code to parse each “mini” csv.
Only wrinkle is that our csv is in a string rather than a file but python has a way to treat a string as a file. io.StringIO(somestring) can be used in place of a file.
import io
import csv
= [] # build up a list of dicts.
outrows
# get key and value.
for state, state_csv in dict_by_state.items():
= io.StringIO(state_csv)
string_as_file = csv.DictReader(string_as_file)
myCSVReader
for row in myCSVReader:
# add state key to row read from the mini-csv
"State"] = state
row[# or could insert into database
outrows.append(row)
pprint.pprint(outrows)
If you have headers and footers etc, you might need some additional if/else as you parse the records.
Handle multiple headers in columns or cells merged across columns
If your Excel sheet has multiple rows of headers formed by merged cells, save as a csv which will look something like this. You don’t need to unmerge the headers in Excel, just export a csv.
,Agriculture,,,,Industry,,,,Services,,,
,Male,,Female,,Male,,Female,,Male,,Female,
,1990-92,2011-14,1990-92,2011-14,1990-92,2011-14,1990-92,2011-14,1990-92,2011-14,1990-92,2011-14
Afghanistan,..,..,..,..,..,..,..,..,..,..,..,..
See long explanation at Handling Multi Index CSVs.
Handle missing values in rows or cells merged across rows, usually first column
Reuse a value from the last row by not resetting a variable. Some CSV formatted for reading carry values down, usually resulting from vertically merged cells in the first columns (or rowspan in html). e.g.:
rank,country,year,olympic_gold
1,Russia,1980,43
2,Australia,1980,3
,Bahamas,1980,3
4,Thailand,1980,1
See how Bahamas is missing a rank? That’s because the 2 in the line above was applied to both of them with a merged and vertically centered cell. But when you save as csv that goes away and only one row has the value. Thankfully it’s the top row, so we can let the value “fall through”. All we need to do is pull it out of the row into a variable and not reset that variable if row["rank"]
is blank.
for row in myCSVReader:
if (row["rank"]): # if we have a new rank (blank string is false)
= row["rank"] # Set to 2 for Australia, not changed for Bahamas
rank # Now use the rank variable for rank, others straight from row.
= { "rank": rank, "country": row["country"], "year": row["year"] }
param_dict ...
Handle multiple values inside a cell
Use str.split()
to pull them into a list. This happens when a field is actually a list but the user didn’t want to break up the data into relational tables
student,committee
James,Kevin;Bob;Carsten;Don
= """INSERT INTO student_committee(student,member)
sql_insert VALUES (%(student)s,%(prof)s)"""
for row in myCSVReader:
= row["student"]
student = row["committee"].split(";")
committee_list for member in committee_list:
= { "student": student, "prof": member }
param_dict cursor.execute(sql_insert, param_dict)
Split a complex field into multiple columns
Use regex to get the separate parts. Regex practice here: https://regexone.com/ and all the documentation you could ever want here: http://www.regular-expressions.info/
import re
...for row in myCSVReader:
#row["show_code"] = "s03e10"
= re.search('s(\d+)e(\d+)',row["show_code"])
matches
"season"] = matches.group(1)
param_dict["episode"] = matches.group(2)
param_dict[
cursor.execute(sql,param_dict)
Get a zip code out of an address field
business,address,phone
Daisy Chain,"22344 Newfound Road, Austin, 78722, USA",512 345 4456
Use a regex looking for five digits. Can backfire if house number is also five digits so use re.findall()
and take last item since zip is at end of address.
import re
for row in myCSVReader:
= re.findall("\d{5}",row["address"])
zip_candidates # can't use zip as variable name as it is also a python function
= zip_candidates[-1:] # slicing with -1 gets last item. just_zip
Remove dollar signs and commas from field
Use re.sub
or str.replace
. Removing them is the same as replacing them with an empty string ""
. Only trick is remembering that in a regex a literal $ is obtained with \$
. You can make a group of characters to match with square brackets and match both the $ sign and any commas. The regex can match anywhere in the string so it can replace all the commas etc.
import re
= re.sub('[\$,]','',row["money"]) cleaned_money
Or you can use str.replace
which is simpler (doesn’t require a regex) but requires two calls, one for the $ sign and one for the comma.
= row["money"].replace("$","")
cleaned_money
= cleaned_money.replace(",","")
cleaned_money
"money"] = cleaned_money param_dict[
Format a field using commas or doller signs
Can do this either in the SQL query or in Python.
-- FORMAT adds commas and two decimal places, adding .00
-- or truncating to two decimal places.
SELECT FORMAT(your_table.money,2)
FROM your_table
-- Now add a $ in front
SELECT CONCAT('$', FORMAT(your_table.money,2)) as display_money
FROM your_table
In Python we specify each part using the .format micro language.
"money"] = 1230403
row[# .format {} is the placeholder
# {:,} formats the number with commas in groups of three
# {:.2f} formats the number as a float to 2 decimal places
# you can combine these and put a literal $ in front.
"display_money"] = "${:,.2f}".format(row["money"]) row[
See https://pyformat.info for some more info, although it’s missing the comma thing which is documented here: https://www.python.org/dev/peps/pep-0378/
Handle different date formats on the way into the database
Create a python datetime object, pass to parameterized queries. If column is DATE
rather than DATETIME
time portion will be dropped on insert.
import datetime
# create using ints of year, month, day, hours, minutes, seconds
= datetime.datetime(year,month,day)
myPythonDate
= """
sql_insert INSERT INTO performances(venue_id,band_id,start)
VALUES (%(venue_id)s,%(band_id)s,%(mydate)s)"""
= {"venue_id": 7,
param_dict "band_id": 23,
"start_time": myPythonDate}
cursor.execute(sql_insert,param_dict)
You can also create dates from strings, using datetime.strptime
:
= "03/30/2013 14:30pm"
incoming_string = datetime.strptime(incoming_string,"%m/%d/%Y %H:%M%p") myPythonDate
Or there is a library that tries to guess the format of a date. This is problematic when the format is ambiguous (e.g., 2/3/2012 ) but is especially useful when you have heterogenious datetime formats in one file (often the case with manual entry).
import dateutil.parser
= dateutil.parser.parse(incoming_string) myPythonDate
To provide an error when encountering an ambiguous date format this code from Zonedabone on Stackoverflow.
import dateutil.parser
import pprint
from dateutil import parser
def parse(string, agnostic=True, **kwargs):
if agnostic or parser.parse(string, **kwargs) == parser.parse(string, yearfirst=True, **kwargs) == parser.parse(string, dayfirst=True, **kwargs):
return parser.parse(string, **kwargs)
else:
print("The date was ambiguous: {}".format(string))
# raise ValueError("The date was ambiguous: %s" % string)
= ["2017-09-23 12:32pm",
dates "5-5-2012",
"5-12-2012",
"2019/03/18",
"3 May 2009",
"May 1st, 2009"]
for date in dates:
= parse(date, agnostic=False)
date if (date):
print(date.strftime("%Y-%m-%d %H:%M:%S"))
Record the time I created a record
For SQL use NOW()
, for python use datetime.datetime.now()
= """
sql_insert INSERT INTO log_data(message,log_time)
VALUES (%(message)s,NOW())
"""
= { "message": some_message }
param_dict cursor.execute(sql_insert, param_dict)
import datetime
= { "message": some_message,
outrow "log_time": datetime.datetime.now() }
# time at creation, not writing to csv.
outrows.append(outrow)# or if time writing to csv is crucial, add log_time just before writing
= { "message": some_message }
outrow
outrows.append(outrow)
...for outrow in outrows:
"log_time": datetime.datetime.now() })) myCsvWriter.writerow(row.update( {
Get data from filenames
Use a regex to pull a field from the filename in a variable. Useful to combined with reading many csvs of same format (glob.iglob
)
= "2008data.csv"
filename = re.search('(\d+)data.csv',filename)
matches = matches.group(1)
year
with open(filename) as csvfile:
= csv.DictReader(csvfile)
myCSVReader for row in myCSVReader:
# add year from filename to each row data
"year"] = year
row[# assuming some sql defined above cursor.execute(sql,row)
Process lots of csv with the same format
Use Path.glob
which returns a list of files that you can iterate over. Very useful to combine with getting data from filenames (see above). (Older code uses glob.iglob
for same effect.)
from pathlib import Path
# dir_of_files/somedatafile1.csv
# dir_of_files/somedatacontinued.csv
# dir_of_files/evenmoredata.csv
for filename in Path('dir_of_files').glob('*.csv'):
with open(filename) as csvfile:
= csv.DictReader(csvfile)
myCSVReader for row in myCSVReader:
cursor.execute(sql,row)
Combining filename reading and processing a directory
# dir_of_files/texas.csv
# dir_of_files/arizona.csv
# dir_of_files/california.csv
import csv
from pathlib import Path
for filename in Path('dir_of_files').glob('*.csv'):
with open(filename) as csvfile:
= Path(filename).stem
state
= csv.DictReader(csvfile)
myCSVReader for row in myCSVReader:
"state"] = state
row[ cursor.execute(sql, row)
Handle files that cause a UnicodeEncodeError character encoding issues
Three things to get to match up: 1) the collation of the database, 2) the encoding used by python to open the file, and 3) the charset of the database connection.
Often the best approach is to open the csv file in your text editor and save it as utf-8, then set everything to handle utf-8 (although note that pymysql calls it utf8 and mysql uses the collation utf8_unicode_ci, or utf8-bin).
= pymysql.connect(host="localhost",
connection # all the usual parameters (user/db/etc)
..., ="utf8")
charset
...with open('tricky_file.csv', encoding="utf-8", errors="surrogateescape") as infile:
See also: MySQL collations: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html Python UnicodeEncodeError error processing options http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
Handle data for multiple years or periods
Sometimes data is gathered for multi-year periods, such as a measurement for sports audience with a year value of 2003-2005
. We often face the challenge of comparing this data with data collected for single years, or compare to other periods, such as 2004-2006
. Notice that I’m calling this a Period
rather than a year. We want to do comparisons per year, so we have to map between Period and Year, and make some estimates.
First we create a Period entity with a name: 2003-2005
becomes Period(id, name)
with a row of (1,2003-2005)
. Then we can map that Period to Year. One way would be to have an attribute for Period of start_year
and end_year
. Another option might be easier for querying and grouping would be to say Period habtm Year
.
Then we have to decide how the data was collected. Two options are common: either it’s a total for that period (how many people attended any game within the period?) or it’s an average for the years in the period (the value represents the yearly attendance over this period).
That suggests a third modeling option: Audience has_many PeriodMeasure. PeriodMeasure has_many Year through YearlyEstimate
. Where PeriodMeasure has (id,name,type,value) where type is either “average” or “annual”. YearlyEstimate has (id,period_measure_id,year_id,estimate).
Database queries
My query using OR is returning unexpected rows
In SQL AND
has precedence over OR
which can lead to strange results. If you are specifying alternative values for one filter it’s simpler to use IN
. Rather than
WHERE year = 2004 OR year = 2008 OR year = 2009
(which will work but is not preferred) say
WHERE year IN (2004,2008,2009)
In addition to being shorter and clearer it is also safer if you have other conditions in your WHERE
clause. You might try:
WHERE name = "james" AND year = 2004 OR year = 2006
But that will not get records for James from 2004 or 2006. It will get all records from 2006 and just the James records from 2004. That is because AND
is resolved before OR
. What the server sees is:
WHERE ( name = "james" AND year = 2004 ) OR year = 2006
when you meant (and these parens change the order):
WHERE name = "james" AND ( year = 2004 OR year = 2006 )
So just say:
WHERE name = "james" AND year IN (2004,2006)
This is particularly problematic if you have a join in the WHERE
clause because it can create false records.
SELECT *
FROM observations, species
WHERE observations.species_id = species.id AND
year = 2004 OR year = 2008 OR year = 2009
But that will give unexpected results because the AND
pulls the first year to it and leaves the other years on their own. You can see this with parens (this is what the query above actually does):
SELECT *
FROM observations, species
WHERE (
= species.id
observations.species_id AND year = 2004
)OR
year = 2008
OR
year = 2009
This includes rows that never existed (because they are generated by the cross-join in the FROM
clause but not eliminated by the join condition in the WHERE
). What you want would require parens around the OR clause to force it to be resolved first.
SELECT *
FROM observations, species
WHERE observations.species_id = species.id
AND
year = 2004 OR year = 2008 OR year = 2009 ) (
But it’s simpler and safer to say:
SELECT *
FROM observations, species
WHERE observations.species_id = species.id
AND year IN (2004,2008,2009)
Understand three forms of COUNT
The three forms of COUNT are: 1. COUNT(*)
counts the number of rows (in a table or in a group). 2. COUNT(DISTINCT column)
counts the unique values in a column. 3. COUNT(column)
counts the number of non-NULL values in a column.
I often advise people to avoid COUNT(column)
because I find the semantics confusing in a query. I mean that what people are intending to get is unclear. More, though, I haven’t come across a time when it is necessary.
Whenever you are tempted to use COUNT(column)
, especially when using GROUP BY
ask yourself whether COUNT(*)
is clearer and what you actually mean. Usually when we are grouping we are asking “How many somethings have this value” (e.g., How can performances does a band do? or How many tickets did each person buy?). These are questions about the number of rows, not the number of non-NULL values.
ps. I’m very interested to find an example where COUNT(column)
is the ideal thing, so please email me if you see one. Only thing I can think of is something like counting records with relationships. e.g. a table like people(id,name,sig_other)
--How many people have sig others?
SELECT COUNT(sig_other)
FROM people
But I think that would be much clearer as:
SELECT COUNT(*) as sig_other_count
FROM people
WHERE sig_other IS NOT NULL
See: http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count and http://www.w3schools.com/sql/sql_func_count.asp
Query a period from a list of start dates
Sometimes we want to match an event up with a period in which it happened. For example, we want to say what phase of the moon a crime occurred in. For the crimes we have a specific date, but for the moon phases we might only have the start date. e.g,
start_datetime,phase
2017-01-05 04:20:00,First Quarter
2017-01-12 01:30:00,Full Moon
2017-01-19 02:51:00,Last Quarter
2017-01-27 10:47:00,New Moon
2017-02-03 12:03:00,First Quarter
How could we query this table to find the phase of the moon for the date 2017-01-22
? Looking above we, as humans, can say that it falls between the third and fourth rows, so we can conclude that the moon was in the “Last Quarter” on the 22nd. But how would we do this in sql?
If we had both the start and end dates we could write this as:
SELECT phase
FROM moon_phases
WHERE `2017-01-22` BETWEEN start_datetime AND end_datetime
but we don’t have the end_datetime
. Or rather we have the end_datetime but only by reading the table in order. Remember that SQL queries examine each row, returning a TRUE
or FALSE
value. So we can’t ask queries that are “across rows” meaning we can’t use the start_date of the “next” row as the end_date of the previous row.
We can, though, reframe the question: Can we find the last phase that started before our date of interest?
SELECT phase
FROM moon_phases
WHERE start_datetime < "2017-01-22"
ORDER BY start_datetime DESC -- newest at top
LIMIT 1
Ah, that works.
However that works for individual dates, but it doesn’t lend itself to joins, or for finding the relevant phase for a list of crimes. We could use python to execute this query once for each row in the database, using a parameterized query for each row and storing the result in a “moon_phase” column in the crimes table.
But we can use it in a correlated sub-query, which executes once for each row (unlike a simple sub-query that executes only once).
SELECT *
FROM crimes
JOIN moon_phases
ON moon_phases.id = (
SELECT moon_phases.id
FROM moon_phases
WHERE moon_phases.start_datetime <= crimes.crime_datetime
ORDER BY moon_phases.start_datetime DESC
LIMIT 1 )
Another possible approach is to add an end_date column to the table. Alter the table to add a new column end_date
.
ALTER TABLE moon_phases
ADD end_datetime DATE;
If the periods have a regular length (e.g., 30 days) this can be much simpler (one could also use DATE_ADD
and INTERVAL
in a SELECT
query directly.)
UPDATE moon_phases
SET end_datetime = DATE_ADD(start_datetime, INTERVAL 30 DAY)
However moon phases aren’t regular in duration (and the general case of periods from an ordered list of dates is interesting):
We can set the end_datetime using python code. First we get the ids of all the periods in order:
= """SELECT start_date
sql FROM moon_phases
ORDER BY start_date ASC"""
cursor.execute(sql)
= cursor.fetchall() # list of dicts
results
# make into a simple list of datetimes
= []
just_datetimes for row in results:
'start_datetime']) just_datetimes.append(row[
Now we can offset this list to find the end_dates. We make two lists, the first one is all start_dates, so it is missing the last item. The second list is all end_dates, so it is missing the first item. Then we use the zip
function which takes one thing from each list. Here I illustrate just using the days just to type less.
= [5, 12, 19, 27, 3]
ordered_list = ordered_list[:-1] # [5, 12, 19, 27]
start_dates = ordered_list[1:] # [12, 19, 27, 3]
end_dates zip(start_dates, end_dates) # [(5, 12), (12, 19), (19, 27), (27, 3)]
Putting this all together:
= """SELECT start_datetime
sql FROM moon_phases
ORDER BY start_datetime ASC"""
cursor.execute(sql)
= cursor.fetchall() # list of dicts
results
# make into a simple list of datetimes
= []
just_datetimes for row in results:
'start_datetime'])
just_datetimes.append(row[
= """UPDATE moon_phases
update_sql SET end_datetime = %(end_datetime)s
WHERE start_datetime = %(start_datetime)s
"""
= just_datetimes[:-1]
start_dates = just_datetimes[1:]
end_dates
for start_date, end_date in zip(start_dates, end_dates):
# subtract 1 second from end_date to avoid overlap
= end_date - datetime.timedelta(seconds=1)
end_date = {"start_datetime": start_date,
param_dict "end_datetime": end_date}
cursor.execute(update_sql, param_dict)
Note that this leaves the final period without an end_datetime. That is unavoidable because that data is not in the database.
The correlated sub-query solution is better in that respect because it would put events in the last period.
Interacting with the database from python
Reading many rows from database without bringing them into memory
cursor.fetchall()
is useful but returns all the results into memory in Python. Avoid this with large queries as you might hit errors like pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query
. Instead you can use cursor directly as an iterator.
= "SELECT * FROM venues" # SQL queries are just a string.
sql
cursor.execute(sql)#results = cursor.fetchall() don't include this line
= ['id','name','capacity'] # manually
csv_column_order # csv_column_order = [i[0] for i in cursor.description] # automatically
with open('exported-venues.csv', 'w') as csvfile:
= csv.DictWriter(csvfile, fieldnames = csv_column_order)
myCsvWriter
myCsvWriter.writeheader()for row in cursor: # rather than 'for row in results'
myCsvWriter.writerow(row)
Get columns names in order while using DictReader
DictReader is convenient but we lose the order of the columns as returned by the database server. The cursor has this information, but it is a little hidden. You can get them in a list (thus preserving order) using this code:
= [i[0] for i in cursor.description] column_names
Insert multiple rows into the database based on values in different columns in a single row (aka pivot longer)
Use an inner for loop over the columns within a row. e.g., a CSV about Gold Medals
country,1984,1988,1992
Australia,32,33,34
USA,60,23,34
for row in myCSVReader:
= row["country"] # some item that applies to all.
country for key,value in row.items(): # row is a dict : { "1984": 32, "1988"}
if (key == "country"):
continue # skip this item.
= { "country": country,
param_dict "year": key,
"gold_medals": value}
cursor.execute(sql_insert,param_dict)
Afterwards these data would be represented in a “long” format:
id, country, year, gold_medals
1, Australia, 1984, 32
2, Australia, 1988, 33
3, Australia, 1992, 34
4, USA, 1984, 60
5, USA, 1988, 23
6, USA, 1992, 34
See Data Transforms and the School Absences homework (which is similar, except iterating over a string in the row rather than the row items).
Handle names with different forms
In the database, create an entity for the thing and an entity for the names. e.g., Person has_many PersonName
, PersonName belongs_to Person
. or Country has_many CountryName
, CountryName belongs_to Country
. Use that data to create a dictionary that enables quick conversion from any form of the name to either a standard name or the person/country id.
= ["usa","United States of America", "US"]
us_names = { "usa": "usa",
countryMap "United States of America": "usa",
"US": "usa" }
for alt_name in us_names:
print("Even given {}".format(alt_name))
#map alt_name to "usa" by using it as a key to the dictionary
print("We can get {}".format(countryMap[alt_name]))
See code and database schema at Miscellaneous Tips
Handle state names and abbreviations
State names and abbreviations are examples of different forms of names. See above for general approach, but for US states specifically there is really useful data available. This used to be on a website called statetable but I’ve cached a copy here from data.world: https://data.world/markmarkoh/us-state-table
id | name | abbreviation | fips_state | assoc_press | standard_federal_region | census_region | census_region_name | census_division | census_division_name | circuit_court |
---|---|---|---|---|---|---|---|---|---|---|
1 | Alabama | AL | 1 | Ala. | IV | 3 | South | 6 | East South Central | 11 |
2 | Alaska | AK | 2 | Alaska | X | 4 | West | 9 | Pacific | 9 |
3 | Arizona | AZ | 4 | Ariz. | IX | 4 | West | 8 | Mountain | 9 |
This includes abbreviations and useful stuff like which region of the country a state is in, which allows you to do you analyses via region which can be easier to understand that by state.
Avoid inserting duplicate records
Use three sql queries: a SELECT
an UPDATE
and an INSERT
. SELECT
to see if a record already exists, if it does then an UPDATE
(or nothing), if it doesn’t then a SELECT
.
= "SELECT id from venues WHERE name = %(venue_name)s"
sql_select = "UPDATE venues SET capacity = %(capacity)s WHERE id = %(found_id)s"
sql_update = """ INSERT INTO venues(name,capacity)
sql_insert VALUE (%(venue_name)s,%(capacity)s)
RETURNING id"""
= { "name": "AMD",
param_dict "capacity": 2000 }
cursor.execute(sql_select,param_dict)
if (cursor.rowcount == 0): # none found by SELECT
cursor.execute(sql_insert,param_dict)= cursor.fetchone()['id'] # from INSERT ... RETURNING id
venue_id else:
= cursor.fetchone()['id'] # from select
venue_id # update if additional fields are being added.
"id"] = venue_id
param_dict[ cursor.execute(sql_update,param_dict)
See also class notes on (CSV to SQL via Python)[http://howisonlab.github.io/datawrangling/class_modules/12_csv_to_sql_via_python/#find-or-create-an-id-csv-contains-mix-of-data-about-new-and-existing-entities]
Lookup foreign keys while inserting data
tl;dw Use the “find or create” sequence.
Ultimately we want to insert into the performances
table, but to do that we need the band_id
and the venue_id
. However, we don’t yet have those. Instead we have the venue_name
and the band_name
. So we have to convert those to ids. We convert those by first checking if they are already in the database (using a SELECT
). If they are, then we have the id. If they aren’t then we use an INSERT
to create the row, and get the id that way. Either way, we end up with the id we need and can use those in the insert into performances
.
= """INSERT INTO performances(start,band_id,venue_id)
sql_insert_performance VALUE (%(band_id)s, %(venue_id)s, %(start_time)s)"""
= "SELECT id from venues WHERE name = %(venue_name)s"
sql_select_venue = "SELECT id from bands WHERE name = %(band_name)s"
sql_select_band_name
= "INSERT INTO venues(name) VALUE (%(venue_name)s"
sql_insert_venue = "INSERT INTO bands(name) VALUE (%(band_name)s"
sql_insert_band_name
# data normally would come from a CSV
= [ { "band_name": "Cut Copy",
incoming_data "venue_name": "AMD",
"start_time": "2016-03-12 18:00:00"} ]
for row in incoming_data:
# find or create venue
cursor.execute(sql_select_venue, row)= cursor.fetchall()
results
if (len(results) == 0):
cursor.execute(sql_insert_venue, row)= cursor.lastrowid # get id of newly created venue.
venue_id else:
= results[0]["id"] # get found id.
venue_id
# find or create band
cursor.execute(sql_select_band, row)= cursor.fetchall()
results if (len(results) == 0):
cursor.execute(sql_insert_band, row)= cursor.lastrowid # get id of newly created venue.
band_id else:
= results[0]["id"] # get found id.
band_id
# add ids to param_dict
= {"band_id" = band_id,
param_dict "venue_id" = venue_id,
"start_time" = row["start_time"]}
cursor.execute(sql_insert_performance, param_dict)
You might think it worthwhile to create a reuseable find_or_create
function, but it is very tricky to do that in general, it is almost worthwhile to do a find_or_create
function for each table if you always do lookups by a specific column. Honestly, though, by the time that is worthwhile you should consider using an Object-Relational Mapping library, such as SQLAlchemy.
See CSV to SQL via Python materials.
Join database tables without foreign keys
Foreign keys are a convenience but not required for joining tables, any field with matching values will work. e.g.,
-- state_gdp(state_abbr,gdp)
-- state_weather(state_abbr,weather)
SELECT *
FROM state_gdp, state_weather
WHERE state_gdp.state_abbr = state_weather.state_abbr
or
-- addresses(id,business_name,address,zip)
-- incomes(id,zip,avg_income)
SELECT *
FROM addresses, incomes
WHERE addresses.zip = incomes.zip
-- Now count the businesses in each zip and the income
SELECT zip, COUNT(*) as num_businesses_in_zip, avg_income as avg_income_in_zip
FROM addresses, incomes
WHERE addresses.zip = incomes.zip
GROUP BY zip
So you don’t need to create primary key/foreign key relationships although they are very useful and required if there are related objects with no field to match by.
Speed up mysql queries-avoiding Lost connection to MySQL server during query error
Ensure you have defined indexes on the columns you use in your joins. The server has a guard to ensure that queries don’t block up the server for others. That guard kills long-running queries. Indexes make it much easier for mysql to do the joins. You can define these through phpmyadmin using the Index icon to the right of the column row on the structure tab (it might be hidden under the “More” button if you don’t have enough wide on your screen).
You can tell if your indexes are being used by adding EXPLAIN
to the front of your query (check the possible_key and key output, you want to see something there). See more here: https://www.sitepoint.com/using-explain-to-write-better-mysql-queries/ and http://dev.mysql.com/doc/refman/5.7/en/using-explain.html
So why not just add indexes to everything? They take up space and have to be updated for each insert so there are many trade offs.
Speed up mysql inserts
Help, it takes ages for my data to insert! See Importing large files
First try is turning off autocommit:
# Before first execute.
"SET autocommit = 0")
cursor.execute(for row in myCSVReader:
cursor.execute(sql, row)
#All execute done.
"COMMIT") cursor.execute(
Second try, for very large files, is using python to create a csv that exactly matches the database table (name of file and names of headers) and mysqlimport from the terminal (not via python). This will ask for your mysql password before doing the insert.
mysqlimport -u testuser -p --local --fields-terminated-by=',' \
--fields-enclosed-by='"' --ignore-lines=1 \
testuser_empty_music_festival ./tickets.csv
btw, don’t even think about using phpmyadmin for this, the web interface does not work with uploading large files.
Addresses and GIS
Get the distance between two addresses
This requires using external libraries, either Maps or GIS. I show an example using Open Street Map. See files below (documentation in the .py file)
Analysis
Complex analyses are beyond the scope of this course, but here are just a few simple things that might be useful.
Assess correlation between two variables
The SciPy and Numpy modules offer many stats functions. Here’s correlation, you need to provide two lists of equal length to compare. These will likely be columns from your database query. https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html
import scipy.stats
# extract two columns from your query.
= []
crimes = []
gdp for row in results:
"crime"])
crimes.append(row["gdp"])
gdp.append(row[
# pearsonr and spearmanr return correlation and confidence interval. Get just the correlation using [0]
# pearsonr uses magnitude
print(scipy.stats.pearsonr(crimes,gdp)[0])
# spearmanr uses just rank
print(scipy.stats.spearmanr(crimes,gdp)[0])
Note that easier access to columns etc can be obtained through the pandas
module, but that’s beyond the scope of this course.
Why don’t you teach …
What is a list comprehension and why is it not taught
Python offers a syntax to simplify a common idiom:
# Pull all "some_key" out of the list of dicts
= []
new_list for item_dict in myListOfDicts:
= item_dict["some_key"]
value_for_key new_list.append(value_for_key)
Could be written as:
= [ item_dict["some_key"] for item_dict in myListOfDicts ] new_list
That’s called a List Comprehension, see more here http://www.pythonforbeginners.com/basics/list-comprehensions-in-python. I don’t teach it in the course because a) I grew up on other languages, and b) I find it confusing to follow the step by step logic. Anyway, it’s a useful thing to know about.
Why don’t you teach the JOIN keyword (aka ANSI JOINs, aka explicit joins)
In the course we learn joins that look like this:
SELECT *
FROM person, tickets
WHERE person.id = tickets.person_id
Another way of writing that is:
SELECT *
FROM person JOIN tickets
ON person.id = tickets.person_id
The first way is variously called “implicit join”, “old-style join” and the second way is variously called “explicit join”, “ANSI style join”. You may even here the first way called “the awful, broken, don’t ever do it way” or the “wrong way” and the new style the “right way”. See http://www.sqlservercentral.com/blogs/brian_kelley/2009/09/30/the-old-inner-join-syntax-vs-the-new-inner-join-syntax/
I teach the “old-style” or “implicit” notation because I explain joins as a logical extension to filtering data using the WHERE
clause: joins, regardless of syntax, can be conceptually understood as row-by-row filters on a cross-product (regardless of whether they are implemented that way).
The downsides are well known:
- The server gets slammed by cross-products without joining expressions in the
WHERE
clause, - Unexpected rows when using
OR
in combination with “implicit” joins. See My query using OR is returning unexpected rows - Difficult extension to other types of joins, particularly
OUTER
joins. This difficult extension toOUTER
join bothers me, as they do come up in projects, and I may change my approach.
Why don’t you teach pandas or tablib
I teach the course using import csv
, rather than use a module that can import csv files directly. This means students can’t do:
import pandas as pd
= pd.read_csv("data/mydata.csv") myData
but instead have to do:
import csv
with open('data/mydata.csv') as csvfile:
= csv.DictReader(csvfile)
myCSVReader for row in myCSVReader:
...
The reason is fivefold:
csv
allows me to teach about iteration and dictionaries vs lists.csv
allows us to fix data on the way in, including skipping rows and splitting fields, using basic python syntax. (I’m not saying the syntax is simpler, just more basic).csv
has a fallback to treat each row independently (usingcsv.reader
and each row as a list) which is very useful in dealing with borked csv files. True fact: over 50% of csv files are borked in some way and when handling data from Excel files prepared in organizations that figure approaches 100%.- I don’t have time to teach the
pandas
syntax and conceptual model of tabular data. - Honestly, I almost always use
R
anddplyr
for analysis after I’ve fixed the CSV usingpython
, so I don’t knowpandas
well.
Why do you not teach proper Entity-Relationship diagrams
There are many notations for Data Modeling. Wikipedia has a great description of the options and examples of different notations.
I use a slightly modified version of Unified Modeling Language (UML) class diagrams. I use this because:
- I think it’s simpler :)
- I learned it (I also learned “crow’s foot” notation)
- It maps well onto Object-Relational Models (ORM) frameworks.
- It provides a scaffold onto further UML modeling for object-oriented programming.
The most similar to my presentation is in Clare Churcher’s very accessible and inexpensive book “Beginning database design: from novice to professional” Amazon link, UT libraries catalog link
Churcher uses UML in much the same way I do (although she uses n
for “many” where I use *
) and labels the line only in one direction. I find labeling it in both directions (which is used in one of the ISO data modeling notations) helps with reading the diagram out loud (and remembering which set of cardinality numbers go with which direction).
Churcher does talk about modeling what I call “has_many through” but she calls that an “intermediate table” for “many-to-many” with attributes.
Churcher uses a different set of conventions for naming tables and attributes (I use the ActiveRecord conventions popularized by Ruby on Rails), but everything in Churcher’s book should be familiar enough.
Where does this “has_many”, “belongs_to” Relational Vocabulary come from?
The relational vocabulary that I teach comes from the ActiveRecord Object Relational Mapping framework, made famous by Ruby on Rails.
The conventions about tables being pluralized and id
and foreign_id
also come from ActiveRecord.
I find it a convenient way to teach a set of conventions to encourage consistency.
I also hope it is a useful scaffold for students going forward towards web development.