Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

We will read CSV using the csv library. Relevant screencast is here: Reading CSV Files

When we read the csv it arrives at Python as a list of dictionaries. Thus when we read this csv in python it looks like the structure below:

id,name,capacity
1,AMD,700
8,Bud Light,2000
3,Austin Kiddie,2300
4,Austin Ventures,2000
venues = [
           {'capacity': 700, 'id': 1, 'name': 'AMD'},
           {'id': 8, 'capacity': 2000,  'name': 'Bud Light'},
           {'capacity': 2300, 'id': 3, 'name': 'Austin Kiddie'},
           {'id': 4, 'capacity': 2000,  'name': 'Austin Ventures'}
          ]

This is a list with one dictionary per row. Each row has all of the headers.

Because it arrives as a list, we retain whatever order we asked for in an SQL query, so later we can use ORDER BY to order the rows however we want. We don’t need to sort in python.

We will iterate through the results using a for loop. The first thing we should do is print each row using pprint

import pprint

for row in venues:
    print("Next row:")
    pprint.pprint(row)

This enables us to see how to access individual fields:

Next row:
{'capacity': 700, 'id': 1, 'name': 'AMD'}
Next row:
{'capacity': 2000, 'id': 8, 'name': 'Bud Light'}
Next row:
{'capacity': 2300, 'id': 3, 'name': 'Austin Kiddie'}
Next row:
{'capacity': 2000, 'id': 4, 'name': 'Austin Ventures'}

So we can print out the capacities by using the key from the row dictionary: row['capacity'] (note that there is no difference here between using single and double quotes)

for row in venues:
    print("Next row:")
    print(row["capacity"])

Exercises:

  1. Print the names of each venue
  2. Print the names and capacities of each venue in a sentence.

When we have access to a field (which we show we have by printing) we can do anything we can do to any string or int in python. For example we can make the name print out in lower case using the str.lower() method.

for row in venues:
    print("Next row:")
    print(row["name"].lower())

Exercises:

  1. Make the Venue names print in upper case.
  2. Print the capacity minus 120 seats
  3. Print the length of the venue names.

Reading from a CSV file

Above we manually defined our data as a list of dicts. But we want to read from a csv file directly. We do this with code from a library called csv that way we don’t have to handle all the intricate rules about things like quotes inside fields etc.

Copy this data and save to a file called venues.csv in the same folder as your script this code will read and print out each row as a dict.

"id","name","capacity"
"1","AMD","700"
"2","Honda","2000"
"3","Austin Kiddie Limits presented by H-E-B","2300"
"4","Austin Ventures","2000"
"5","BMI","2000"
"6","Vista Equity","2000"
"7","Google+","2000"
"8","Bud Light","2000"
"9","Auditorium Shores","2000"
"10","Auditorium Shores2","2000"
import pprint
import csv

with open('venues.csv') as csvfile:
  # tell python about the specific csv format
  myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

  # move row by row through the file as though it was a list.
  for row in myCSVReader:
    # Each row arrives in Python as a Dict
    pprint.pprint(row)

Note that all the code handling the file is indented with a with block. Now we can do all the things we did before inside the for loop. Now reading from the file implement the exercises above.

Additional exercises:

  1. Find restaurants with bad inspection scores. Go to https://data.austintexas.gov/ and find restaurant inspection scores. Download the file and upload it to your account on Holden. (You can do this by opening the file locally, creating a file on the server, then copying the data into it.) First just print out each row as a dict, then use an if statement to only print those with scores below 90 points.