Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

We will connect to mysql using the pymysql library. There are two parts to this, each with a screencast.

  1. Connect to mysql server and execute query.
  2. Process query results and write to a CSV file

The full code used in the first screencast is available at the bottom of this page.

To access mysql from python we will use the pymysql library. There are a number of different libraries that people use, so be aware that we’re using pymsql when you are searching for relevant help threads etc. StackOverflow has a tag specifically for questions about this way of accessing MySQL from python tag:pymsql.

The steps to accessing mysql from python are conceptually the same as when we access via phpmyadmin:

  1. We specify the server that we want to talk to.
  2. We log in using our mysql username/password (for the Docker setup this is just “root” with no password)
  3. We choose the database we want to work with.
  4. We write our SQL query (just as we did for phpmyadmin)
  5. We execute the query (as we did on the phpmyadmin tab)
  6. We obtain and process the results.
import pymysql.cursors

# First set up the connection to the server
connection = pymysql.connect(
            host="mariadb",
            user="root",  # mysql user
            passwd="",  # mysql passd
            db="class_music_festival",
            autocommit=True,
            cursorclass=pymysql.cursors.DictCursor
            )

This specifies the host as mariadb because that is the name of the docker image that is running our database. We then provide our user and password and select our database (our old friend class_music_festival). The cursorclass parameter specifies that we’d like to work with lists of Dicts, just as we do when working with CSV. Finally autocommit enables us to avoid additional code.

Once we are connected to the server we open up a with block. As we did with reading and writing files using the with block means that we don’t have to manually close our connections. This creates a cursor which is the object we use to communicate with the mysql server.

with connection.cursor() as cursor:
    # SQL queries are just a string.
    sql = "SELECT * FROM venues"
    cursor.execute(sql)
    results = cursor.fetchall()  # list of dicts

    # examine query and results
    print(sql)
    print(results) # see, a list of dicts

Given that our results are a list of dicts (which we confirmed with pprint.pprint above and observing the outcomes) we can iterate over the results and do anything we have done with rows so far (such as use pprint, use .format, use an if statement etc.)

with connection.cursor() as cursor:
    # SQL queries are just a string.
    sql = "SELECT * FROM venues"
    cursor.execute(sql)
    results = cursor.fetchall()  # list of dicts

    for row in results:
        output = "Venue: {name} has Capacity: {capacity}"
        print(output.format(**row))

Since we have results as a list of dicts in RAM we can also write the results of the query to disk using the csv library, using the code we have already learned.

with connection.cursor() as cursor:
    sql = "SELECT * FROM venues"
    cursor.execute(sql)
    results = cursor.fetchall()  # list of dicts all in ram

import csv #  usually place this at top of script

with open('exported-from-sql-query.csv', 'w') as csvfile:
    column_names = list(results[0].keys()) #  or manual list
    myCsvWriter = csv.DictWriter(csvfile,
                                 fieldnames = column_names)

    myCsvWriter.writeheader()

    # write the rows. We iterate through results.
    for row in results:
        myCsvWriter.writerow(row)

print("Done writing csv")

Fetchall or iterating over the results is good for queries that return lots of results, but if we’re just getting a single value such as a COUNT(*) of rows, we can also use fetchone() which returns just a single dictionary, rather than a list of dictionaries.

sql = "SELECT COUNT(*) as ticket_count FROM tickets"

cursor.execute(sql)
results = cursor.fetchone()

print(sql)
pprint.pprint(results)

# we can also print just a single field.
print(results['ticket_count'])

Whenever using COUNT(*) or COUNT(DISTINCT column) (or indeed any aggregate function) remember to use an alias, since the alias becomes the key of the dictionary for the row. It’s a lot nicer and less error prone to access results['ticket_count'] than results['COUNT(*)'].

Exercises

You probably want to use “save as” or copy and comment to save each of these.

  1. Output all records from the people table into a file called “people.csv”
  2. From the “class_cars” database, output only the 4 cylinder cars made in 2008 into a file called “cars.csv”
  3. Using print output just the count of the 4 cylinder cars made in 2008. Use fetchone()
  4. Create a new database on phpmyadmin called <your_user>_austin_restaurants and import the csv file at /export/home/u11/jhowison/data_wrangling_class_files/7_csv/Restaurant_Inspection_Scores.csv (Note that you can copy that full path into a variable and then use the variable in the open call.)
  5. Write an SQL query to get the bottom 100 restaurant inspection scores and use python to query mysql and save to “restaurants_to_avoid.csv”

Avoiding bring all records into RAM

In the approaches above we use cursor.fetchall() pulls all the results from mysql into python memory. This is convenient, especially because you can use pprint.pprint to confirm that you are getting the results from MySQL that you expect.

However if you were to have millions of results, you are bringing them all into RAM. If the amount of your results exceeds the RAM available, python will crash. If you are concerned about that you can skip a step, not create results, and iterate over the results row by row by iterating over cursor directly. I don’t mind which approach you use.

with connection.cursor() as cursor:
    sql = "SELECT * FROM venues"
    cursor.execute(sql)

    for row in cursor:
        output = "Venue: {name} has Capacity: {capacity}"
        print(output.format(**row))

Using that approach makes writing to csv a little more indented, because you have to write “as you go” rather than after you’ve collected all the results. This is because you are working inside two with blocks at the same time.

with connection.cursor() as cursor:
    # SQL queries are just a string.
    sql = "SELECT * FROM venues"
    cursor.execute(sql)

    with open('exported-from-sql-query.csv', 'w') as csvfile:
        # Obtain field_names from pymysql
        # See http://stackoverflow.com/questions/5010042/mysql-get-column-name-or-alias-from-query
        column_names = [i[0] for i in cursor.description]

        myCsvWriter = csv.DictWriter(csvfile,
                                     fieldnames=column_names)

        myCsvWriter.writeheader()

        for row in cursor:
            myCsvWriter.writerow(row)

print("Done writing csv")

The full code used in the second screencast is here:

##########################
# Here we execute a query and output it to a CSV file.
##########################

# pymysql.cursors is the library that will talk to mysql
import pymysql
# csv helps us write out the csv files.
import csv
# convenience methods for debugging
import pprint

# First set up the connection to the server
connection = pymysql.connect(host="mariadb",
                      user="root",
                      passwd="",
                      db="class_music_festival",
                      autocommit=True,
                      cursorclass=pymysql.cursors.DictCursor)

# as with opening a file we can use with to open the connection
# the cursor is the object through which we talk to the sql server.
with connection.cursor() as cursor:
    sql = """
        SELECT DAYNAME(purchases.date), SUM(tickets.price)
        FROM tickets, purchases
        WHERE tickets.purchase_id = purchases.id
        GROUP BY DAYNAME(purchases.date)
        """
#  SELECT * FROM venues" # SQL queries are just a string.
    cursor.execute(sql)
    results = cursor.fetchall()

    # fetchall() returns a list of dicts, where each dict has the columns from
    # the table as keys. (That's because we asked for a DictCursor in the connect call.)
    # The next section of this file is the same as our csv_writer.py file.

    # We're going to write out to a CSV file.  Remember the first line
    # of the csv is the headers. We have to do that before moving row by row.
    # You can manually specify a specific order for the CSV output
    # eg.csv_column_order = ['id','name','capacity']
    # But here we get the column names from the keys of the first item.

    # add a city: "Austin" into the output for each

    csv_column_order = list(results[0].keys())
    #   csv_column_order.append("city")
    # add "city" to end of the list.

    with open('exported.csv', 'w', newline='') as csvfile:
    # Note that here we ask for a DictWriter, which works with the Dicts
    # provided by the DictCursor.
        myCsvWriter = csv.DictWriter(csvfile, delimiter=',',
                                          quotechar='"',
                                          fieldnames = csv_column_order)

        # write the header row (it gets those from the fieldnames)
        myCsvWriter.writeheader()

        # and then each of the other results, row by row.
        for row in results:
            # if (row["name"] == "AMD"):
            #     row['city'] = "Austin"
            # else:
            #     row['city'] = "San Antonio"
            # add into the row city: Austin
            myCsvWriter.writerow(row)