Appendix A — Data Wrangling FAQ.

This is an FAQ for James Howison’s Data Wrangling course in the Information School at the University of Texas at Austin. It’s an introductory course that teaches enough python to read csv and wrangle data into mysql, then back out again to conduct some analysis.

Tip

For Fall 2023 we switched to using duckdb. This implies some small changes to some of the code below. I’m trying to work through to show both versions, but as I work be aware of:

  1. You may need to change cursor.execute to con.execute (DuckDB doesn’t require creating a cursor first, but postgres and mysql did)

  2. Where you see something like %(keyname)s change that to $keyname. The %(keyname)s format is what the python library to talk to postgres uses as a placeholder.

  3. You can ignore the pprint.pprint stuff. That was used before notebooks (when outputting to the terminal).

Python packages

Jupyter does not have a package I need

You can install packages locally on your Docker container using:

import sys
!{sys.executable} -m pip install <package_name>

Replace <package_name> with the name of the package you’d like to install. e.g., glob3.

Getting Data

Convert an html table to csv

http://www.convertcsv.com/html-table-to-csv.htm Careful to select the actual table you want, avoid using the “All” option in the dropdown box. Wikipedia pages sometimes have multiple tables with headers that should be columns (e.g., Season number in https://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes_(seasons_1%E2%80%9320)#Episodes.

You could scrape and parse the HTML but that’s out of scope for this class. You could also save each converted table to a file with the season in the name (e.g., simpsons-season-1.csv) and then combine the tips below on Reading data from filenames and the glob.iglob from Read files in a directory.

Convert a table in a pdf to csv

http://tabula.technology/ or https://pdftables.com/

Read a csv file

Use csv library and either csv.DictReader or csv.reader.

import csv
with open('your_file.csv') as csvfile:
  # With headers in the file
  myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
  # No headers in file, supply your own in a list
  headers = ["id","year","state_abbr","gdp_millions"]
  myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"',
                               fieldnames = headers)

  # No headers or uneven rows. Each row a list rather than a dict.
  myCSVReader = csv.reader(csvfile, delimiter=",", quotechar='"')

  # move row by row through the file
  for row in myCSVReader:
      pprint.pprint(row)

See Reading CSV materials.

Work through each ‘cell’ or ‘field’ in tabular data

In some cases we want to read a csv and work from top left to bottom right, across each field in a row, then on to the next row. Just as we’d read a book.

with open('your_file.csv') as csvfile:
    myCSVReader = csv.DictReader(csvfile)
    columns_in_order = myCSVReader.fieldnames  # no parens
    # vertical moves (row by row)
    for row in myCSVReader:
        # horizontal moves (column by column within a row)
        for header in columns_in_order:
            # header is a string used a key into the row dictionary
            # e.g., row["capacity"]
            field_value = row[header]
            print(field_value)

Very similar if you want to process the results from an SQL query in this fashion. The difference is where you get the columns_in_order from:


results = cursor.fetchall()  # list of dicts

columns_in_order = [i[0] for i in cursor.description]  # awkward but works

# vertical moves (dict by dict in list)
for row in results:
    # horizontal moves (column by column within a row)
    for header in columns_in_order:
        # header is a string used a key into the row dictionary
        # e.g., row["capacity"]
        field_value = row[header]
        print(field_value)

Handle files without headers

Two options.

  1. Use csv.reader rather than DictReader. csv.reader produces a list for each row rather than a dict. So you have to use indexes to indicate which column you want.

  2. Provide a list of strings to use as headers to DictReader using the fieldnames parameter. If you provide that then DictReader will process each row into a dictionary with those keys (and will process the first row as data). You shouldn’t do this if there are headers in the file.

with open('csv_with_preamble.csv') as infile:
    headers = ["name", "capacity"]
    myCSVReader = csv.DictReader(infile,fieldnames=headers)
    for row in myCSVReader:
        pprint.pprint(row)

Handle files with repeated identical headers

The DictReader turns columns into keys of dicts. But we can’t have repeated keys for dicts, so you’ll end up with only one of any set of repeated columns. For example, this file has two “points” columns.

game_city,visiting_team,points,home_team,points
Austin,San Antonio,2,Austin,4

As a human we can tell that one is meant to be the visiting_team_points and one is meant to be the home_team_points but only the order of the columns tells us that.

The simplest option is to edit the headers in the csv, but if you are parsing many files or files received automatically that’s not possible. You can use the same techniques as files without headers (manually declare fieldnames or use csv.reader which returns a list for each row, rather than a dictionary.)

Use a tab separated file

Tab separated files can be parsed with the same code as csv files, but with the delimeter set to “:

import csv

with open('Some_tab_separated_file.tsv_or_csv_or_whatever') as tsvfile:
  myTSVReader = csv.DictReader(tsvfile, delimiter= "\t") # or csv.reader
  for row in myTSVReader:
    ...

Use a fixed width file

A file that uses the number of characters to show where fields separate can be parsed using string slicing on each line.

NAME                STATE     TELEPHONE  
John Smith          WA        418-711-4111
Mary Hartford       CA        319-519-4341
Evan Nolan          IL        219-532-c301
with open('some_fixed_file') as fixed_file: # file ending doesn't matter
  # Need character position for each column. Editors usually show this
  # just move your cursor to the start and subtract 1 (0 index)
  for line in fixed_file:
    name = line[:20].rstrip() # rstrip removes spaces at end
    state = line[20:30].rstrip()
    telephone = line[30:].rstrip()
    ...

Geocoding to get state or county for an address

We can use geopy and a geocoding service to take addresses, even malformed or partial ones, and convert them to structured data. This is useful when you have an address (or a lat/long) and want to get an enclosing geographic unit (e.g., County/State)

First check whether it is installed:

%pip install geopy

Then you can use like this:

from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Photon

geolocator = Photon(user_agent="myGeocoder")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

address = "1616 Guadalupe Austin TX"  # change that string to anything.

location = geocode(address)
# print(location.raw)
county = location.raw['properties']['county']
state = location.raw['properties']['state']

print(f"{address} is in {county} which is a county of {state}")

Rather than printing out the county and state you can use them in an SQL insert. The RateLimiter is important because if you run this on a large dataset you don’t want to hammer the Photon server (which is a freely provided service). More services available at geopy website.

Ensuring the RateLimiter works is a matter of ensruing that the RateLimiter is created outside the block where you iterate. ie ensure that you only create the RateLimiter once, then use it many times.

from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Photon

address_list = ["1616 Guadalupe Austin TX", 'Travelling the World on Web3', 'Noord-Brabant', 'PG/MD', 'Toronto, Ontario', 'East, England', 'Johannesburg, South Africa', 'الله ', ' Outchea', 'Road to Riches ', 'Mercury', 'cactuses n shit, colorado ', 'fr_US.lproj', 'Wisconsin', '757', 'Abuja,Nigeria', 'Belfast, Northern Ireland', 'Nigeria', 'GTA ', 'Dekalb County', "i think i'm better now"]

geolocator = Photon(user_agent="myGeocoder")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

for address in address_list: 
    print(f"Attempting to convert: {address}")
    
    location = geocode(address) 
    print(location.raw)
    
    # exact structure of location.raw depends on country
    # ie non-US address unlikely to have county
    if 'county' in location.raw['properties']:
        county = location.raw['properties']['county']
        if 'state' in location.raw['properties']:
            state = location.raw['properties']['state']

            print(f"{address} is in {county} which is a county of {state}")
    
    print("\n\n") # spacing to make output readable

Better still is to seperate the geocoding call from the code that attempts to parse the result.

This is useful because parsing is often buggy and this avoids calling the service more than once

Below I do this all in python you could but location.raw into a TEXT column in the database each time, then try parsing and store the result in another column (using UPDATE)

from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Photon
import time # to show when calls are made to geocode

"""
helper function to lookup address
must create the geocode function outside the function for the 
rte limiter to work
"""
def map_location(address): 
    print(time.perf_counter())
    print(f"Attempting to convert: {address}")
    location = geocode(address)
    return location

geolocator = Photon(user_agent="myGeocoder")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

address_list = ["1616 Guadalupe Austin TX", 'Travelling the World on Web3', 'Noord-Brabant', 'PG/MD', 'Toronto, Ontario', 'East, England', 'Johannesburg, South Africa', 'الله ', ' Outchea', 'Road to Riches ', 'Mercury', 'cactuses n shit, colorado ', 'fr_US.lproj', 'Wisconsin', '757', 'Abuja,Nigeria', 'Belfast, Northern Ireland', 'Nigeria', 'GTA ', 'Dekalb County', "i think i'm better now"]

geocode_results = [map_location(i) for i in address_list]

# could call geocode directly, without map_location
# disadvantage is that there is no output.
# geocode_results = [geocode(i) for i in address_list]

Once you’ve gathered the results you can parse them, you can execute this cell many times without calling back to the service.

# No geocoding in this cell, just parsing the results
# ideally we never geocode the same string twice.
for location in geocode_results: 
    print(location.raw,'\n')
    
    # exact structure of location.raw depends on country
    # ie non-US address unlikely to have county
    if 'county' in location.raw['properties']:
        county = location.raw['properties']['county']
        if 'state' in location.raw['properties']:
            state = location.raw['properties']['state']

            print(f"{location}\n County: {county} which is in State: {state}")
    
    print("\n\n") # spacing to make output readable

Inserting data into the database

Drop a column from a csv en route to the database

Just choose not to insert that column by not mentioning it in the INSERT. Let’s say we want to drop the id column here:

id,name,capacity
1,AMD,700
2,Honda,2000
sql_insert = """INSERT INTO my_venues( venue_name, capacity )
                             VALUES ( %(name)s, %(capacity)s )"""
for row in data:
  # doesn't matter that row also has a key "id" because it's not
  # called for in the sql_insert, ie there is no %(id)s in the
  # values part.
  cursor.execute(sql_insert,row)

Skip a comment or preamble before the headers in my csv file

Use next() on the file object before applying the DictWriter.

My great dataset
Copyright 2007, big important person.
name,age,occupation
james,39,professor
heshan,36,dancer
with open('csv_with_preamble.csv') as infile:
  next(infile) # toss My great dataset
  next(infile) # toss Copyright 2007

  # next row is headers so that's all good.
  myCSVReader = csv.DictReader(infile) # or csv.reader
  for row in myCSVReader:
    ...

Skip footers or other junk at the bottom of my csv file

Use break in an if statement while iterating. Check the row for a characteristic that shows end of the data, then call break. Often the footnote will all go into the first field, but use print(row) to figure it out.

name,age,occupation
james,39,professor
heshan,36,dancer
Note: Occupation was found in a survey.
with open('csv_with_footnote.csv') as infile:
  # next row is headers so that's all good.
  myCSVReader = csv.DictReader(infile) # or csv.reader
  for row in myCSVReader:
    if (row["name"][:4] == "Note"): # string slice to compare start of string.
      break # break out of for loop, no more rows will be read.
    ... # process data rows as usual.

Sometimes data ends with a blank row which you can detect with any()

name,age,occupation
james,39,professor
heshan,36,dancer
,,
Some commentary starts here
with open('csv_with_footnote.csv') as infile:
  # next row is headers so that's all good.
  myCSVReader = csv.DictReader(infile) # or csv.reader
  for row in myCSVReader:
    if (not any(row)): # Are they all empty? Works on lists and dicts.
      break # break out of for loop, no more rows will be read.
    ... # process data rows as usual.

Skip a comment, or other data, in the middle of my csv file

Use continue, after test the first field for starting with the comment character.

name,age,occupation
james,39,professor
#heshan,36,dancer
danielle,21,student
with open('csv_with_comments.csv') as infile:
  # next row is headers so that's all good.
  myCSVReader = csv.DictReader(infile) # or csv.reader
  for row in myCSVReader:
    if (row["name"][:1] == "#") : # or row[0][:1] if using csv.reader
      continue # skip rest of this iteration, go onto next line
    ... # process data rows as usual.

Handle files with multiple data sections

Some CSV files have multiple data sections, each with a little header. Read these as plain text files and then parse each “minicsv” using StringIO to treat that section as a file.

For example The US DOJ Universal Crime Reporting Stats does this. If you select more than one state you get data like (with some headers/footers removed):

Estimated crime in Alabama

Year,Population,Violent crime total
2010,4785401,18363
2011,4803689,20166
2012,4822023,21693

Estimated crime in Alaska

Year,Population,Violent crime total
2010,714146,4537
2011,723860,4416
2012,731449,4412

Estimated crime in Arizona

Year,Population,Violent crime total
2010,6413158,26528
2011,6467315,26789
2012,6553255,28108

Each state’s data is repeated in a separate little csv embedded in the larger file.

Eventually we want to create a state column to go along with the others:

State,Year,Population,Violent crime total

To do this we can split up the file and then read each of as though they were separate CSVs. After the split this ends up very similar to the approach we’d use if these were separate files in a directory.

import re
import pprint

# open the file
fileobject = open("CrimeStatebyState.csv")

dict_by_state = {}  # "state": "minicsv"

minicsv = ""
state = ""

for line in fileobject: # Go line by line, not using csv library.
    # new record or past last record.
    if (line[:9] == "Estimated"):
        # detected start of new minicsv
        # (lines starting with Estimated)
        # If we have an old state, store minicsv in the dict.
        if (state):
            dict_by_state[state] = minicsv

        # Either way now start a new minicsv and get the state.
        matches = re.search('Estimated crime in (.*)$',line)
        state = matches.group(1)
        minicsv = "" #reset
    elif (line != "\n"):  # skips empty lines.
        minicsv += line

pprint.pprint(dict_by_state)

which outputs:

{'Alabama': 'Year,Population,Violent crime total\n'
            '2010,4785401,18363\n'
            '2011,4803689,20166\n'
            '2012,4822023,21693\n',
 'Alaska': 'Year,Population,Violent crime total\n'
           '2010,714146,4537\n'
           '2011,723860,4416\n'
           '2012,731449,4412\n',
 'Arizona': 'Year,Population,Violent crime total\n'
            '2010,6413158,26528\n'
            '2011,6467315,26789\n'
            '2012,6553255,28108\n'}

Now we’ve built up a dictionary where the keys are state names and the values are the mini-csvs (each as a single long string). Now we can use our familiar csv parsing code to parse each “mini” csv.

Only wrinkle is that our csv is in a string rather than a file but python has a way to treat a string as a file. io.StringIO(somestring) can be used in place of a file.

import io
import csv

outrows = []  # build up a list of dicts.

# get key and value.
for state, state_csv in dict_by_state.items():
    string_as_file = io.StringIO(state_csv)
    myCSVReader = csv.DictReader(string_as_file)

    for row in myCSVReader:
        # add state key to row read from the mini-csv
        row["State"] = state
        outrows.append(row)  # or could insert into database

pprint.pprint(outrows)

If you have headers and footers etc, you might need some additional if/else as you parse the records.

Handle multiple headers in columns or cells merged across columns

If your Excel sheet has multiple rows of headers formed by merged cells, save as a csv which will look something like this. You don’t need to unmerge the headers in Excel, just export a csv.

,Agriculture,,,,Industry,,,,Services,,,
,Male,,Female,,Male,,Female,,Male,,Female,
,1990-92,2011-14,1990-92,2011-14,1990-92,2011-14,1990-92,2011-14,1990-92,2011-14,1990-92,2011-14
Afghanistan,..,..,..,..,..,..,..,..,..,..,..,..

See long explanation at Handling Multi Index CSVs.

Handle missing values in rows or cells merged across rows, usually first column

Reuse a value from the last row by not resetting a variable. Some CSV formatted for reading carry values down, usually resulting from vertically merged cells in the first columns (or rowspan in html). e.g.:

rank,country,year,olympic_gold
1,Russia,1980,43
2,Australia,1980,3
,Bahamas,1980,3
4,Thailand,1980,1

See how Bahamas is missing a rank? That’s because the 2 in the line above was applied to both of them with a merged and vertically centered cell. But when you save as csv that goes away and only one row has the value. Thankfully it’s the top row, so we can let the value “fall through”. All we need to do is pull it out of the row into a variable and not reset that variable if row["rank"] is blank.

for row in myCSVReader:
  if (row["rank"]): # if we have a new rank (blank string is false)
    rank = row["rank"] # Set to 2 for Australia, not changed for Bahamas
  # Now use the rank variable for rank, others straight from row.
  param_dict = { "rank": rank, "country": row["country"], "year": row["year"] }
  ...

Handle multiple values inside a cell

Use str.split() to pull them into a list. This happens when a field is actually a list but the user didn’t want to break up the data into relational tables

student,committee
James,Kevin;Bob;Carsten;Don
sql_insert = """INSERT INTO student_committee(student,member)
                                  VALUES (%(student)s,%(prof)s)"""
for row in myCSVReader:
  student = row["student"]
  committee_list = row["committee"].split(";")
  for member in committee_list:
    param_dict = { "student": student, "prof": member }
    cursor.execute(sql_insert, param_dict)

Split a complex field into multiple columns

Use regex to get the separate parts. Regex practice here: https://regexone.com/ and all the documentation you could ever want here: http://www.regular-expressions.info/

import re
...
for row in myCSVReader:
  #row["show_code"] = "s03e10"
  matches = re.search('s(\d+)e(\d+)',row["show_code"])

  param_dict["season"] = matches.group(1)
  param_dict["episode"] = matches.group(2)

  cursor.execute(sql,param_dict)

Get a zip code out of an address field

business,address,phone
Daisy Chain,"22344 Newfound Road, Austin, 78722, USA",512 345 4456

Use a regex looking for five digits. Can backfire if house number is also five digits so use re.findall() and take last item since zip is at end of address.

import re
for row in myCSVReader:
  zip_candidates = re.findall("\d{5}",row["address"])
  # can't use zip as variable name as it is also a python function
  just_zip = zip_candidates[-1:] # slicing with -1 gets last item.

Remove dollar signs and commas from field

Use re.sub or str.replace. Removing them is the same as replacing them with an empty string "". Only trick is remembering that in a regex a literal $ is obtained with \$. You can make a group of characters to match with square brackets and match both the $ sign and any commas. The regex can match anywhere in the string so it can replace all the commas etc.

import re
cleaned_money = re.sub('[\$,]','',row["money"])

Or you can use str.replace which is simpler (doesn’t require a regex) but requires two calls, one for the $ sign and one for the comma.

cleaned_money = row["money"].replace("$","")

cleaned_money = cleaned_money.replace(",","")

param_dict["money"] = cleaned_money

Format a field using commas or doller signs

Can do this either in the SQL query or in Python.

-- FORMAT adds commas and two decimal places, adding .00
-- or truncating to two decimal places.
SELECT FORMAT(your_table.money,2)
FROM your_table
-- Now add a $ in front
SELECT CONCAT('$', FORMAT(your_table.money,2)) as display_money
FROM your_table

In Python we specify each part using the .format micro language.

row["money"] = 1230403
# .format {} is the placeholder
# {:,} formats the number with commas in groups of three
# {:.2f} formats the number as a float to 2 decimal places
# you can combine these and put a literal $ in front.
row["display_money"] = "${:,.2f}".format(row["money"])

See https://pyformat.info for some more info, although it’s missing the comma thing which is documented here: https://www.python.org/dev/peps/pep-0378/

Handle different date formats on the way into the database

Create a python datetime object, pass to parameterized queries. If column is DATE rather than DATETIME time portion will be dropped on insert.

import datetime

# create using ints of year, month, day, hours, minutes, seconds
myPythonDate = datetime.datetime(year,month,day)

sql_insert = """
INSERT INTO performances(venue_id,band_id,start)
VALUES (%(venue_id)s,%(band_id)s,%(mydate)s)"""

param_dict = {"venue_id": 7,
              "band_id": 23,
              "start_time": myPythonDate}
cursor.execute(sql_insert,param_dict)

You can also create dates from strings, using datetime.strptime:

incoming_string = "03/30/2013 14:30pm"
myPythonDate = datetime.strptime(incoming_string,"%m/%d/%Y %H:%M%p")

Or there is a library that tries to guess the format of a date. This is problematic when the format is ambiguous (e.g., 2/3/2012 ) but is especially useful when you have heterogenious datetime formats in one file (often the case with manual entry).

import dateutil.parser
myPythonDate = dateutil.parser.parse(incoming_string)

To provide an error when encountering an ambiguous date format this code from Zonedabone on Stackoverflow.

import dateutil.parser
import pprint

from dateutil import parser

def parse(string, agnostic=True, **kwargs):
    if agnostic or parser.parse(string, **kwargs) == parser.parse(string, yearfirst=True, **kwargs) == parser.parse(string, dayfirst=True, **kwargs):
        return parser.parse(string, **kwargs)
    else:
        print("The date was ambiguous: {}".format(string))
        # raise ValueError("The date was ambiguous: %s" % string)

dates = ["2017-09-23 12:32pm",
         "5-5-2012",
         "5-12-2012",
         "2019/03/18",
         "3 May 2009",
         "May 1st, 2009"]

for date in dates:
    date = parse(date, agnostic=False)
    if (date):
        print(date.strftime("%Y-%m-%d %H:%M:%S"))

Record the time I created a record

For SQL use NOW(), for python use datetime.datetime.now()

sql_insert = """
INSERT INTO log_data(message,log_time)
VALUES (%(message)s,NOW())
"""
param_dict = { "message": some_message }
cursor.execute(sql_insert, param_dict)
import datetime
outrow = { "message": some_message,
           "log_time": datetime.datetime.now() }
           # time at creation, not writing to csv.
outrows.append(outrow)
# or if time writing to csv is crucial, add log_time just before writing
outrow = { "message": some_message }
outrows.append(outrow)
...
for outrow in outrows:
  myCsvWriter.writerow(row.update( { "log_time": datetime.datetime.now() }))

Get data from filenames

Use a regex to pull a field from the filename in a variable. Useful to combined with reading many csvs of same format (glob.iglob)

filename = "2008data.csv"
matches = re.search('(\d+)data.csv',filename)
year = matches.group(1)

with open(filename) as csvfile:
  myCSVReader = csv.DictReader(csvfile)
  for row in myCSVReader:
    # add year from filename to each row data
    row["year"] = year
    cursor.execute(sql,row) # assuming some sql defined above

Process lots of csv with the same format

Use Path.glob which returns a list of files that you can iterate over. Very useful to combine with getting data from filenames (see above). (Older code uses glob.iglob for same effect.)

from pathlib import Path

# dir_of_files/somedatafile1.csv
# dir_of_files/somedatacontinued.csv
# dir_of_files/evenmoredata.csv
for filename in Path('dir_of_files').glob('*.csv'):
    with open(filename) as csvfile:
      myCSVReader = csv.DictReader(csvfile)
      for row in myCSVReader:
        cursor.execute(sql,row)

Combining filename reading and processing a directory

# dir_of_files/texas.csv
# dir_of_files/arizona.csv
# dir_of_files/california.csv

import csv
from pathlib import Path

for filename in Path('dir_of_files').glob('*.csv'):
    
    with open(filename) as csvfile:
      
      state = Path(filename).stem
        
      myCSVReader = csv.DictReader(csvfile)
      for row in myCSVReader:
        row["state"] = state
        cursor.execute(sql, row)

Handle files that cause a UnicodeEncodeError character encoding issues

Three things to get to match up: 1) the collation of the database, 2) the encoding used by python to open the file, and 3) the charset of the database connection.

Often the best approach is to open the csv file in your text editor and save it as utf-8, then set everything to handle utf-8 (although note that pymysql calls it utf8 and mysql uses the collation utf8_unicode_ci, or utf8-bin).

connection = pymysql.connect(host="localhost",           
                             ..., # all the usual parameters (user/db/etc)
                             charset="utf8")
...
with open('tricky_file.csv', encoding="utf-8", errors="surrogateescape") as infile:

See also: MySQL collations: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html Python UnicodeEncodeError error processing options http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

Handle data for multiple years or periods

Sometimes data is gathered for multi-year periods, such as a measurement for sports audience with a year value of 2003-2005. We often face the challenge of comparing this data with data collected for single years, or compare to other periods, such as 2004-2006. Notice that I’m calling this a Period rather than a year. We want to do comparisons per year, so we have to map between Period and Year, and make some estimates.

First we create a Period entity with a name: 2003-2005 becomes Period(id, name) with a row of (1,2003-2005). Then we can map that Period to Year. One way would be to have an attribute for Period of start_year and end_year. Another option might be easier for querying and grouping would be to say Period habtm Year.

Then we have to decide how the data was collected. Two options are common: either it’s a total for that period (how many people attended any game within the period?) or it’s an average for the years in the period (the value represents the yearly attendance over this period).

That suggests a third modeling option: Audience has_many PeriodMeasure. PeriodMeasure has_many Year through YearlyEstimate. Where PeriodMeasure has (id,name,type,value) where type is either “average” or “annual”. YearlyEstimate has (id,period_measure_id,year_id,estimate).

Database queries

My query using OR is returning unexpected rows

In SQL AND has precedence over OR which can lead to strange results. If you are specifying alternative values for one filter it’s simpler to use IN. Rather than

WHERE year = 2004 OR year = 2008 OR year = 2009

(which will work but is not preferred) say

WHERE year IN (2004,2008,2009)

In addition to being shorter and clearer it is also safer if you have other conditions in your WHERE clause. You might try:

WHERE name = "james" AND year = 2004 OR year = 2006

But that will not get records for James from 2004 or 2006. It will get all records from 2006 and just the James records from 2004. That is because AND is resolved before OR. What the server sees is:

WHERE ( name = "james" AND year = 2004 ) OR year = 2006

when you meant (and these parens change the order):

WHERE name = "james" AND ( year = 2004 OR year = 2006 )

So just say:

WHERE name = "james" AND year IN (2004,2006)

This is particularly problematic if you have a join in the WHERE clause because it can create false records.

SELECT *
FROM observations, species
WHERE observations.species_id = species.id AND
  year = 2004 OR year = 2008 OR year = 2009

But that will give unexpected results because the AND pulls the first year to it and leaves the other years on their own. You can see this with parens (this is what the query above actually does):

SELECT *
FROM observations, species
WHERE (
        observations.species_id = species.id
        AND year = 2004
      )
      OR
        year = 2008
      OR
        year = 2009

This includes rows that never existed (because they are generated by the cross-join in the FROM clause but not eliminated by the join condition in the WHERE). What you want would require parens around the OR clause to force it to be resolved first.

SELECT *
FROM observations, species
WHERE observations.species_id = species.id
        AND
      ( year = 2004 OR year = 2008 OR year = 2009 )

But it’s simpler and safer to say:

SELECT *
FROM observations, species
WHERE observations.species_id = species.id
  AND year IN (2004,2008,2009)

Understand three forms of COUNT

The three forms of COUNT are: 1. COUNT(*) counts the number of rows (in a table or in a group). 2. COUNT(DISTINCT column) counts the unique values in a column. 3. COUNT(column) counts the number of non-NULL values in a column.

I often advise people to avoid COUNT(column) because I find the semantics confusing in a query. I mean that what people are intending to get is unclear. More, though, I haven’t come across a time when it is necessary.

Whenever you are tempted to use COUNT(column), especially when using GROUP BY ask yourself whether COUNT(*) is clearer and what you actually mean. Usually when we are grouping we are asking “How many somethings have this value” (e.g., How can performances does a band do? or How many tickets did each person buy?). These are questions about the number of rows, not the number of non-NULL values.

ps. I’m very interested to find an example where COUNT(column) is the ideal thing, so please email me if you see one. Only thing I can think of is something like counting records with relationships. e.g. a table like people(id,name,sig_other)

--How many people have sig others?
SELECT COUNT(sig_other)
FROM people

But I think that would be much clearer as:

SELECT COUNT(*) as sig_other_count
FROM people
WHERE sig_other IS NOT NULL

See: http://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count and http://www.w3schools.com/sql/sql_func_count.asp

Query a period from a list of start dates

Sometimes we want to match an event up with a period in which it happened. For example, we want to say what phase of the moon a crime occurred in. For the crimes we have a specific date, but for the moon phases we might only have the start date. e.g,

start_datetime,phase
2017-01-05 04:20:00,First Quarter
2017-01-12 01:30:00,Full Moon
2017-01-19 02:51:00,Last Quarter
2017-01-27 10:47:00,New Moon
2017-02-03 12:03:00,First Quarter

How could we query this table to find the phase of the moon for the date 2017-01-22? Looking above we, as humans, can say that it falls between the third and fourth rows, so we can conclude that the moon was in the “Last Quarter” on the 22nd. But how would we do this in sql?

If we had both the start and end dates we could write this as:

SELECT phase
FROM moon_phases
WHERE `2017-01-22` BETWEEN start_datetime AND end_datetime

but we don’t have the end_datetime. Or rather we have the end_datetime but only by reading the table in order. Remember that SQL queries examine each row, returning a TRUE or FALSE value. So we can’t ask queries that are “across rows” meaning we can’t use the start_date of the “next” row as the end_date of the previous row.

We can, though, reframe the question: Can we find the last phase that started before our date of interest?

SELECT phase
FROM moon_phases
WHERE start_datetime < "2017-01-22"
ORDER BY start_datetime DESC -- newest at top
LIMIT 1

Ah, that works.

However that works for individual dates, but it doesn’t lend itself to joins, or for finding the relevant phase for a list of crimes. We could use python to execute this query once for each row in the database, using a parameterized query for each row and storing the result in a “moon_phase” column in the crimes table.

But we can use it in a correlated sub-query, which executes once for each row (unlike a simple sub-query that executes only once).

SELECT *
FROM crimes
  JOIN moon_phases
    ON moon_phases.id = (
        SELECT moon_phases.id
        FROM moon_phases
        WHERE moon_phases.start_datetime <= crimes.crime_datetime
        ORDER BY moon_phases.start_datetime DESC
        LIMIT 1 )

Another possible approach is to add an end_date column to the table. Alter the table to add a new column end_date.

ALTER TABLE moon_phases
    ADD end_datetime DATE;

If the periods have a regular length (e.g., 30 days) this can be much simpler (one could also use DATE_ADD and INTERVAL in a SELECT query directly.)

UPDATE moon_phases
SET end_datetime = DATE_ADD(start_datetime, INTERVAL 30 DAY)

However moon phases aren’t regular in duration (and the general case of periods from an ordered list of dates is interesting):

We can set the end_datetime using python code. First we get the ids of all the periods in order:

sql = """SELECT start_date
         FROM moon_phases
         ORDER BY start_date ASC"""

cursor.execute(sql)

results = cursor.fetchall()  # list of dicts

# make into a simple list of datetimes
just_datetimes = []
for row in results:
    just_datetimes.append(row['start_datetime'])

Now we can offset this list to find the end_dates. We make two lists, the first one is all start_dates, so it is missing the last item. The second list is all end_dates, so it is missing the first item. Then we use the zip function which takes one thing from each list. Here I illustrate just using the days just to type less.

ordered_list = [5, 12, 19, 27, 3]
start_dates = ordered_list[:-1]  # [5, 12, 19, 27]
end_dates = ordered_list[1:]  # [12, 19, 27, 3]
zip(start_dates, end_dates)  # [(5, 12), (12, 19), (19, 27), (27, 3)]

Putting this all together:

sql = """SELECT start_datetime
         FROM moon_phases
         ORDER BY start_datetime ASC"""

cursor.execute(sql)

results = cursor.fetchall()  # list of dicts

# make into a simple list of datetimes
just_datetimes = []
for row in results:
    just_datetimes.append(row['start_datetime'])

update_sql = """UPDATE moon_phases
                SET end_datetime = %(end_datetime)s
                WHERE start_datetime = %(start_datetime)s
             """

start_dates = just_datetimes[:-1]
end_dates = just_datetimes[1:]

for start_date, end_date in zip(start_dates, end_dates):
    # subtract 1 second from end_date to avoid overlap
    end_date = end_date - datetime.timedelta(seconds=1)
    param_dict = {"start_datetime": start_date,
                  "end_datetime": end_date}
    cursor.execute(update_sql, param_dict)

Note that this leaves the final period without an end_datetime. That is unavoidable because that data is not in the database.

The correlated sub-query solution is better in that respect because it would put events in the last period.

Interacting with the database from python

Reading many rows from database without bringing them into memory

cursor.fetchall() is useful but returns all the results into memory in Python. Avoid this with large queries as you might hit errors like pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query. Instead you can use cursor directly as an iterator.

sql = "SELECT * FROM venues" # SQL queries are just a string.
cursor.execute(sql)
#results = cursor.fetchall() don't include this line

csv_column_order = ['id','name','capacity'] # manually
# csv_column_order = [i[0] for i in cursor.description] # automatically
with open('exported-venues.csv', 'w') as csvfile:
  myCsvWriter = csv.DictWriter(csvfile, fieldnames = csv_column_order)
  myCsvWriter.writeheader()
  for row in cursor: # rather than 'for row in results'
    myCsvWriter.writerow(row)

Get columns names in order while using DictReader

DictReader is convenient but we lose the order of the columns as returned by the database server. The cursor has this information, but it is a little hidden. You can get them in a list (thus preserving order) using this code:

column_names = [i[0] for i in cursor.description]

Insert multiple rows into the database based on values in different columns in a single row (aka pivot longer)

Use an inner for loop over the columns within a row. e.g., a CSV about Gold Medals

country,1984,1988,1992
Australia,32,33,34
USA,60,23,34
for row in myCSVReader:
  country = row["country"] # some item that applies to all.
  for key,value in row.items(): # row is a dict : { "1984": 32, "1988"}
    if (key == "country"):
      continue # skip this item.
    param_dict = { "country": country,
                   "year": key,
                   "gold_medals": value}
    cursor.execute(sql_insert,param_dict)

Afterwards these data would be represented in a “long” format:

id, country, year, gold_medals
1, Australia, 1984, 32
2, Australia, 1988, 33
3, Australia, 1992, 34
4, USA,       1984, 60
5, USA,       1988, 23
6, USA,       1992, 34

See Data Transforms and the School Absences homework (which is similar, except iterating over a string in the row rather than the row items).

Handle names with different forms

In the database, create an entity for the thing and an entity for the names. e.g., Person has_many PersonName, PersonName belongs_to Person. or Country has_many CountryName, CountryName belongs_to Country. Use that data to create a dictionary that enables quick conversion from any form of the name to either a standard name or the person/country id.

us_names = ["usa","United States of America", "US"]
countryMap = { "usa": "usa",
               "United States of America": "usa",
               "US": "usa" }
for alt_name in us_names:
  print("Even given {}".format(alt_name))
  #map alt_name to "usa" by using it as a key to the dictionary
  print("We can get {}".format(countryMap[alt_name]))

See code and database schema at Miscellaneous Tips

Handle state names and abbreviations

State names and abbreviations are examples of different forms of names. See above for general approach, but for US states specifically there is really useful data available. This used to be on a website called statetable but I’ve cached a copy here from data.world: https://data.world/markmarkoh/us-state-table

id name abbreviation fips_state assoc_press standard_federal_region census_region census_region_name census_division census_division_name circuit_court
1 Alabama AL 1 Ala. IV 3 South 6 East South Central 11
2 Alaska AK 2 Alaska X 4 West 9 Pacific 9
3 Arizona AZ 4 Ariz. IX 4 West 8 Mountain 9

This includes abbreviations and useful stuff like which region of the country a state is in, which allows you to do you analyses via region which can be easier to understand that by state.

Avoid inserting duplicate records

Use three sql queries: a SELECT an UPDATE and an INSERT. SELECT to see if a record already exists, if it does then an UPDATE (or nothing), if it doesn’t then a SELECT.

sql_select = "SELECT id from venues WHERE name = %(venue_name)s"
sql_update = "UPDATE venues SET capacity = %(capacity)s WHERE id = %(found_id)s"
sql_insert = """ INSERT INTO venues(name,capacity)
                      VALUE (%(venue_name)s,%(capacity)s)
                 RETURNING id"""
param_dict = { "name": "AMD",
               "capacity": 2000 }
cursor.execute(sql_select,param_dict)

if (cursor.rowcount == 0): # none found by SELECT
  cursor.execute(sql_insert,param_dict)
  venue_id = cursor.fetchone()['id'] # from INSERT ... RETURNING id
else:
  venue_id = cursor.fetchone()['id'] # from select
  # update if additional fields are being added.
  param_dict["id"] = venue_id
  cursor.execute(sql_update,param_dict)

See also class notes on (CSV to SQL via Python)[http://howisonlab.github.io/datawrangling/class_modules/12_csv_to_sql_via_python/#find-or-create-an-id-csv-contains-mix-of-data-about-new-and-existing-entities]

Lookup foreign keys while inserting data

See CSV to SQL via Python

tl;dw Use the “find or create” sequence.

Ultimately we want to insert into the performances table, but to do that we need the band_id and the venue_id. However, we don’t yet have those. Instead we have the venue_name and the band_name. So we have to convert those to ids. We convert those by first checking if they are already in the database (using a SELECT). If they are, then we have the id. If they aren’t then we use an INSERT to create the row, and get the id that way. Either way, we end up with the id we need and can use those in the insert into performances.

sql_insert_performance = """INSERT INTO performances(start,band_id,venue_id)
                VALUE (%(band_id)s, %(venue_id)s, %(start_time)s)"""

sql_select_venue = "SELECT id from venues WHERE name = %(venue_name)s"
sql_select_band_name = "SELECT id from bands WHERE name = %(band_name)s"

sql_insert_venue = "INSERT INTO venues(name) VALUE (%(venue_name)s"
sql_insert_band_name = "INSERT INTO bands(name) VALUE (%(band_name)s"


# data normally would come from a CSV
incoming_data = [ { "band_name": "Cut Copy",
                    "venue_name": "AMD",
                    "start_time": "2016-03-12 18:00:00"} ]
                    
for row in incoming_data:
  # find or create venue        
  cursor.execute(sql_select_venue, row)
  results = cursor.fetchall()
  
  if (len(results) == 0):
    cursor.execute(sql_insert_venue, row)
    venue_id = cursor.lastrowid # get id of newly created venue.
  else:
    venue_id = results[0]["id"] # get found id.

  # find or create band
  cursor.execute(sql_select_band, row)
  results = cursor.fetchall()
  if (len(results) == 0):
    cursor.execute(sql_insert_band, row)
    band_id = cursor.lastrowid # get id of newly created venue.
  else:
    band_id = results[0]["id"] # get found id.

  # add ids to param_dict
  param_dict = {"band_id" = band_id,
                "venue_id" = venue_id,
                "start_time" = row["start_time"]}

  cursor.execute(sql_insert_performance, param_dict)

You might think it worthwhile to create a reuseable find_or_create function, but it is very tricky to do that in general, it is almost worthwhile to do a find_or_create function for each table if you always do lookups by a specific column. Honestly, though, by the time that is worthwhile you should consider using an Object-Relational Mapping library, such as SQLAlchemy.

See CSV to SQL via Python materials.

Join database tables without foreign keys

Foreign keys are a convenience but not required for joining tables, any field with matching values will work. e.g.,

-- state_gdp(state_abbr,gdp)
-- state_weather(state_abbr,weather)
SELECT *
FROM state_gdp, state_weather
WHERE state_gdp.state_abbr = state_weather.state_abbr

or

-- addresses(id,business_name,address,zip)
-- incomes(id,zip,avg_income)
SELECT *
FROM addresses, incomes
WHERE addresses.zip = incomes.zip

-- Now count the businesses in each zip and the income
SELECT zip, COUNT(*) as num_businesses_in_zip, avg_income as avg_income_in_zip
FROM addresses, incomes
WHERE addresses.zip = incomes.zip
GROUP BY zip

So you don’t need to create primary key/foreign key relationships although they are very useful and required if there are related objects with no field to match by.

Speed up mysql queries-avoiding Lost connection to MySQL server during query error

Ensure you have defined indexes on the columns you use in your joins. The server has a guard to ensure that queries don’t block up the server for others. That guard kills long-running queries. Indexes make it much easier for mysql to do the joins. You can define these through phpmyadmin using the Index icon to the right of the column row on the structure tab (it might be hidden under the “More” button if you don’t have enough wide on your screen).

You can tell if your indexes are being used by adding EXPLAIN to the front of your query (check the possible_key and key output, you want to see something there). See more here: https://www.sitepoint.com/using-explain-to-write-better-mysql-queries/ and http://dev.mysql.com/doc/refman/5.7/en/using-explain.html

So why not just add indexes to everything? They take up space and have to be updated for each insert so there are many trade offs.

Speed up mysql inserts

Help, it takes ages for my data to insert! See Importing large files

First try is turning off autocommit:

# Before first execute.
cursor.execute("SET autocommit = 0")
for row in myCSVReader:
  cursor.execute(sql, row)

#All execute done.
cursor.execute("COMMIT")

Second try, for very large files, is using python to create a csv that exactly matches the database table (name of file and names of headers) and mysqlimport from the terminal (not via python). This will ask for your mysql password before doing the insert.

mysqlimport -u testuser -p --local --fields-terminated-by=',' \
           --fields-enclosed-by='"' --ignore-lines=1 \
           testuser_empty_music_festival ./tickets.csv  

btw, don’t even think about using phpmyadmin for this, the web interface does not work with uploading large files.

Addresses and GIS

Get the distance between two addresses

This requires using external libraries, either Maps or GIS. I show an example using Open Street Map. See files below (documentation in the .py file)

Analysis

Complex analyses are beyond the scope of this course, but here are just a few simple things that might be useful.

Assess correlation between two variables

The SciPy and Numpy modules offer many stats functions. Here’s correlation, you need to provide two lists of equal length to compare. These will likely be columns from your database query. https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html

import scipy.stats
# extract two columns from your query.
crimes = []
gdp = []
for row in results:
  crimes.append(row["crime"])
  gdp.append(row["gdp"])

# pearsonr and spearmanr return correlation and confidence interval. Get just the correlation using [0]
# pearsonr uses magnitude
print(scipy.stats.pearsonr(crimes,gdp)[0])
# spearmanr uses just rank
print(scipy.stats.spearmanr(crimes,gdp)[0])

Note that easier access to columns etc can be obtained through the pandas module, but that’s beyond the scope of this course.

Why don’t you teach …

What is a list comprehension and why is it not taught

Python offers a syntax to simplify a common idiom:

# Pull all "some_key" out of the list of dicts
new_list = []
for item_dict in myListOfDicts:
  value_for_key = item_dict["some_key"]
  new_list.append(value_for_key)

Could be written as:

new_list = [ item_dict["some_key"] for item_dict in myListOfDicts ]

That’s called a List Comprehension, see more here http://www.pythonforbeginners.com/basics/list-comprehensions-in-python. I don’t teach it in the course because a) I grew up on other languages, and b) I find it confusing to follow the step by step logic. Anyway, it’s a useful thing to know about.

Why don’t you teach the JOIN keyword (aka ANSI JOINs, aka explicit joins)

In the course we learn joins that look like this:

SELECT *
FROM person, tickets
WHERE person.id = tickets.person_id

Another way of writing that is:

SELECT *
FROM person JOIN tickets
  ON person.id = tickets.person_id

The first way is variously called “implicit join”, “old-style join” and the second way is variously called “explicit join”, “ANSI style join”. You may even here the first way called “the awful, broken, don’t ever do it way” or the “wrong way” and the new style the “right way”. See http://www.sqlservercentral.com/blogs/brian_kelley/2009/09/30/the-old-inner-join-syntax-vs-the-new-inner-join-syntax/

I teach the “old-style” or “implicit” notation because I explain joins as a logical extension to filtering data using the WHERE clause: joins, regardless of syntax, can be conceptually understood as row-by-row filters on a cross-product (regardless of whether they are implemented that way).

The downsides are well known:

  1. The server gets slammed by cross-products without joining expressions in the WHERE clause,
  2. Unexpected rows when using OR in combination with “implicit” joins. See My query using OR is returning unexpected rows
  3. Difficult extension to other types of joins, particularly OUTER joins. This difficult extension to OUTER join bothers me, as they do come up in projects, and I may change my approach.

Why don’t you teach pandas or tablib

I teach the course using import csv, rather than use a module that can import csv files directly. This means students can’t do:

import pandas as pd
myData = pd.read_csv("data/mydata.csv")

but instead have to do:

import csv
with open('data/mydata.csv') as csvfile:
  myCSVReader = csv.DictReader(csvfile)
  for row in myCSVReader:
    ...

The reason is fivefold:

  1. csv allows me to teach about iteration and dictionaries vs lists.
  2. csv allows us to fix data on the way in, including skipping rows and splitting fields, using basic python syntax. (I’m not saying the syntax is simpler, just more basic).
  3. csv has a fallback to treat each row independently (using csv.reader and each row as a list) which is very useful in dealing with borked csv files. True fact: over 50% of csv files are borked in some way and when handling data from Excel files prepared in organizations that figure approaches 100%.
  4. I don’t have time to teach the pandas syntax and conceptual model of tabular data.
  5. Honestly, I almost always use R and dplyr for analysis after I’ve fixed the CSV using python, so I don’t know pandas well.

Why do you not teach proper Entity-Relationship diagrams

There are many notations for Data Modeling. Wikipedia has a great description of the options and examples of different notations.

I use a slightly modified version of Unified Modeling Language (UML) class diagrams. I use this because:

  1. I think it’s simpler :)
  2. I learned it (I also learned “crow’s foot” notation)
  3. It maps well onto Object-Relational Models (ORM) frameworks.
  4. It provides a scaffold onto further UML modeling for object-oriented programming.

The most similar to my presentation is in Clare Churcher’s very accessible and inexpensive book “Beginning database design: from novice to professional” Amazon link, UT libraries catalog link

Churcher uses UML in much the same way I do (although she uses n for “many” where I use *) and labels the line only in one direction. I find labeling it in both directions (which is used in one of the ISO data modeling notations) helps with reading the diagram out loud (and remembering which set of cardinality numbers go with which direction).

Churcher does talk about modeling what I call “has_many through” but she calls that an “intermediate table” for “many-to-many” with attributes.

Churcher uses a different set of conventions for naming tables and attributes (I use the ActiveRecord conventions popularized by Ruby on Rails), but everything in Churcher’s book should be familiar enough.

Where does this “has_many”, “belongs_to” Relational Vocabulary come from?

The relational vocabulary that I teach comes from the ActiveRecord Object Relational Mapping framework, made famous by Ruby on Rails.

The conventions about tables being pluralized and id and foreign_id also come from ActiveRecord.

I find it a convenient way to teach a set of conventions to encourage consistency.

I also hope it is a useful scaffold for students going forward towards web development.