Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

In this class we’ll put all the elements we need for our projects together. We’ll read our csv files, split out data to relevant tables, manage foreign/primary keys and then run a query to export data for our analysis.

The relevant screencast is here: CSV to MySQL

The full code used in the screencast is at the bottom of the page along with the csv data to be saved into files.

There are five things to handle:

  1. csv has different names than database columns.
  2. csv has additional columns we don’t want to import.
  3. csv contains names but we want ids.
  4. csv contains data to be split into multiple tables.
  5. csv contains mix of data about new and existing entities.

csv has same names as database columns

We can use row directly with same names in both parts of SQL query.

"id","name","capacity"
"1","AMD","700"
"2","Honda","2000"
"3","Austin Kiddie Limits presented by H-E-B","2300"
with open('venues-header.csv') as csvfile:
    myCSVReader = csv.DictReader(csvfile)

    sql = """INSERT INTO venues(name,capacity)
                VALUE (%(name)s,%(capacity)s)"""

    for row in myCSVReader:
        # use row directly when csv headers match column names.
        cursor.execute(sql, row)

csv has different names than database columns

We can change the names in the placeholder SQL to match the headers in the CSV.

"id","some_name_in_csv","capacity_in_csv"
"1","AMD","700"
"2","Honda","2000"
"3","Austin Kiddie Limits presented by H-E-B","2300"
with open('venues-header-diff-headers.csv') as csvfile:
    myCSVReader = csv.DictReader(csvfile)

    # change names in placeholder to match names in csv file.
    sql = """INSERT INTO venues(name,capacity)
          VALUE (%(some_name_in_csv)s,%(capacity_in_csv)s)"""

    for row in myCSVReader:
        # use row directly
        cursor.execute(sql, row)

csv has additional columns we don’t want to import

Only the keys in the placeholder sql are sent to MySQL. So you can drop columns just by not mentioning their keys in the placeholder SQL. That is true regardless of how the names match or don’t match.

Add additional values into the query.

We can add additional values into the query by creating a param_dict which is a mix of values from row and other places.

Here we add a new type column to the database and include it as a constant.

with open('venues-header.csv') as csvfile:
    # tell python about the specific csv format
    myCSVReader = csv.DictReader(csvfile)

    # Query has type added, must also set up in database.
    sql = """INSERT INTO venues(name,capacity,type)
                  VALUE (%(name)s,%(capacity)s,%(type)s)"""
    # move row by row through the file
    for row in myCSVReader:
        param_dict = {'name': row["csv_name"],
                      'capacity': row["csv_capacity"],
                      'type': "outdoor"}
        cursor.execute(sql, param_dict)

csv contains names but we want ids.

We have to use a SELECT to find the foreign key before doing an INSERT. The id that we get back is additional data, just like the constant above (“outdoor”) so we create a param_dict (just a convenient variable name).

with open('venues-header-diff-headers.csv') as csvfile:
    # tell python about the specific csv format
    myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

    sql = "INSERT INTO venues(name,capacity,type) VALUE (%(name)s,%(capacity)s,%(type)s)"
    # move row by row through the file
    for row in myCSVReader:
        param_dict = {'name': row["csv_name"],
                      'capacity': row["csv_capacity"],
                      'type': "outdoor"}
        cursor.execute(sql, param_dict)

csv contains names but we need foreign keys

Use SELECT to get needed ids before executing the INSERT.

# "Auditorium Shores","Young the Giant","2013-04-13 12:45:00"

sql_select_venue = "SELECT id from venues WHERE name = %(venue_name)s"
sql_select_band_name = "SELECT id from bands WHERE name = %(band_name)s"

sql_insert_performance = """INSERT INTO performances(start,band_id,venue_id)
                VALUE (%(start_time)s, %(band_id)s, %(venue_id)s)
"""

with open('new_performances_existing_venues.csv') as csvfile:
    # tell python about the specific csv format
    myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

    for row in myCSVReader:
        cursor.execute(sql_select_venue, row)
        results = cursor.fetchone()
        venue_id = results['id']

        cursor.execute(sql_select_band, row)
        results = cursor.fetchone()
        band_id = results['id']

        # we have what we need.
        param_dict = {'venue_id': venue_id,
                      'band_id': band_id,
                      'start_time': row['start']}
        cursor.execute(sql_insert_performance, param_dict)

csv contains mix of data about new and existing entities.

We have to use a SELECT with an if/else to see if an entry already exists. If it does we get the foreign key, if it doesn’t we create an entry for it.

See full discussion here.

The csv files are available for copying:

venues-header-diff-headers.csv

"id","csv_name","csv_capacity"
"1","AMD","700"
"2","Honda","2000"
"3","Austin Kiddie Limits presented by H-E-B","2300"
"4","Austin Ventures","2000"
"5","BMI","2000"
"6","Vista Equity","2000"
"7","Google+","2000"
"8","Bud Light","2000"
"9","Auditorium Shores","2000"

new_venues_and_bands.csv

venue_name,capacity,band_name,datetime
"Rocking New Venue",2000,"Ratcat","2013-04-12 12:45:00"
"Rocking New Venue",2000,"INXS","2013-04-12 14:45:00"
"Another new Venue",3000,"INXS","2013-04-12 12:45:00"

new_perf_mix_old_new.csv

venue_name,band_name,datetime
"BMI","Ratcat","2013-04-13 12:45:00"
"BMI","INXS","2013-04-13 14:45:00"
"Totally New Venue","Ratcat","2013-04-14 1:45:00"
"BMI","Totally New Band","2013-04-14 14:45:00"
import pymysql
import pprint
import csv

############################
# Moving from a csv to mysql
############################

# First you need to go to class_music_festival --> operations and copy just the
# database structure to <yourusername>_empty_music_festival


# Open the connection to the database (must be one you can write to!)
connection = pymysql.connect(host="mariadb",           
                             user="root",             
                             passwd="",   
                             db="testuser_empty_music_festival", # name of the db
                             autocommit=True,
                             cursorclass=pymysql.cursors.DictCursor)

cursor = connection.cursor()

# empty some tables so that we can run this from a known state.
full_query = "TRUNCATE venues; TRUNCATE performances; TRUNCATE bands;"
cursor.execute(full_query)

#  Read the CSV, then as we move row by row we can put it into the database
with open('venues-header-diff-name.csv') as csvfile:
    # tell python about the specific csv format
    myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

    # move row by row through the file
    for row in myCSVReader:
        pprint.pprint(row)
        # sql with placeholders for dict_params
        sql = "INSERT INTO venues(name, capacity) VALUE (%(dict_name)s,%(dict_capacity)s)"

        # If the csv headers are the same as the column names, then you
        # can use: cursor.execute(sql, row)
        # But if you have to change the headers then you can map them
        param_dict = { "dict_name": row["csv_name"], "dict_capacity": row["csv_capacity"] }
        cursor.execute(sql, param_dict)

# empty some tables so that we can run this from a known state.
full_query = "TRUNCATE venues; TRUNCATE performances; TRUNCATE bands;"
cursor.execute(full_query)

# It's even simpler if you use the column names from the csv in the placeholder sql.
# then you can just pass row instead of a param_dict. What matters to python is the
# names of the keys of the dict, now where it came from.
with open('venues-header.csv') as csvfile:
    # tell python about the specific csv format
    myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

    # move row by row through the file
    for row in myCSVReader:
        pprint.pprint(row)
        # sql with placeholders for dict_params
        sql = "INSERT INTO venues(name, capacity) VALUE (%(name)s,%(capacity)s)"
        cursor.execute(sql, row)

########################
# One CSV to multiple tables.
#######################

# Often our CSV rows contain information that needs to go into more than one table.
# e.g., consider this csv (two new venues, two new bands, performing at same time)

# venue_name,capacity,band_name,datetime
# "Rocking New Venue",2000,"Ratcat","2013-04-12 12:45:00"
# "Another new Venue",3000,"INXS","2013-04-12 12:45:00"

# We have to INSERT a new Venue and remember the id, then INSERT a New band
# and remember its id and only then can we INSERT the new performance.

# empty some tables so that we can run this from a known state.
full_query = "TRUNCATE venues; TRUNCATE performances; TRUNCATE bands;"
cursor.execute(full_query)

with open('new_venues_and_bands.csv') as csvfile:
    myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

    # move row by row through the file
    for row in myCSVReader:    
        pprint.pprint(row)
        print("---------creating venue")

        venue_sql = """INSERT INTO venues(name, capacity) VALUE
                                       (%(venue_name)s,%(capacity)s)"""

        cursor.execute(venue_sql, row)  
        # store the id for the new venue
        new_venue_id = cursor.lastrowid
        print("New id for venue: {}".format(new_venue_id))

        # Now create the band
        print("creating band for venue_id: {}".format(new_venue_id))

        band_sql = "INSERT INTO bands(name) VALUE (%(band_name)s)"
        cursor.execute(band_sql, row)

        new_band_id = cursor.lastrowid
        print("New id for band: {}".format(new_band_id))

        print("insert performance, using new band and venue ids")
        # Ok, now we have the new foreign keys we need to create the performance. Phew.
        perf_sql = """INSERT INTO performances(start,band_id,venue_id) VALUE
                            (%(start_time)s, %(new_band_id)s, %(new_venue_id)s)"""

        param_dict = { "start_time": row["datetime"],
                       "new_venue_id": new_venue_id,
                       "new_band_id": new_band_id }

        cursor.execute(perf_sql, param_dict)

########################
# That's fine, as long as each row is a new venue, but usually that's not the case
# And we need to look up the relevant foreign keys, rather than getting them with the
# cursor.lastrowid.  
#
# Let's insert two shows, both at AMD and both by the same band but at different times.
# We'll assume the band and the venue is already in the database.

# venue_name,band_name,datetime
# "BMI","Ratcat","2013-04-12 12:45:00"
# "BMI","INXS","2013-04-12 02:45:00"

#######################################


full_query = "TRUNCATE performances;"
cursor.execute(full_query)    
# insert existing BMI venue (RATCAT and INXS inserted above)
full_query = "INSERT INTO venues VALUE (5, 'BMI', 2000)"
cursor.execute(full_query)    

with open('new_performances_existing_venues.csv') as csvfile:
    myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
    # move row by row through the file
    for row in myCSVReader:
        pprint.pprint(row)
        sql = "SELECT id FROM venues WHERE name = %(venue_name)s"
        cursor.execute(sql, row)
        # Only one result expected, use cursor.fetchone()
        result = cursor.fetchone()
        venue_id = result['id']

        sql = "SELECT id FROM bands WHERE name = %(band_name)s"
        cursor.execute(sql, row)
        result = cursor.fetchone()
        band_id = result['id']

        # Ok, Can now insert performances just like above (we have the ids we need)
        perf_sql = """INSERT INTO performances(start,band_id,venue_id) VALUE
                            (%(start_time)s, %(new_band_id)s, %(new_venue_id)s)"""

        param_dict = { "start_time": row["datetime"],
                       "new_venue_id": venue_id,
                       "new_band_id": band_id }

        cursor.execute(perf_sql, param_dict)

#######################################
# But what if you don't know whether you need to create a new venue or not?
# then you try to find an existing record, but if you can't then you
# have to create one.  You know whether you have to create a new one
# if your SELECT doesn't return any rows, which you check with an IF statement.
# Yes, it's a bit of a chore!
#######################################


print("Processing new_perf_mix_old_new.csv")
with open('new_perf_mix_old_new.csv') as csvfile:
    myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
    # move row by row through the file
    for row in myCSVReader:
        pprint.pprint(row)
        sql = "SELECT id FROM venues WHERE name = %(venue_name)s"
        cursor.execute(sql, row)

        # How many results?
        if (cursor.rowcount == 1):
            # Only one result expected, use cursor.fetchone()
            result = cursor.fetchone()
            venue_id = result['id']
        else:
            venue_sql = """INSERT INTO venues(name) VALUE
                                       (%(venue_name)s)"""
            cursor.execute(venue_sql, row)  
            # store the id for the new venue
            venue_id = cursor.lastrowid

        sql = "SELECT id FROM bands WHERE name = %(band_name)s"
        cursor.execute(sql, row)
        if (cursor.rowcount == 1):
            result = cursor.fetchone()
            band_id = result['id']
        else:
            band_sql = "INSERT INTO bands(name) VALUE (%(band_name)s)"
            cursor.execute(band_sql, row)
            band_id = cursor.lastrowid


        # Ok, Can now insert performances just like above (we have the ids we need)
        perf_sql = """INSERT INTO performances(start,band_id,venue_id) VALUE
                            (%(start_time)s, %(new_band_id)s, %(new_venue_id)s)"""

        param_dict = { "start_time": row["datetime"],
                       "new_venue_id": venue_id,
                       "new_band_id": band_id }

        cursor.execute(perf_sql, param_dict)