21 Data Wrangling: Reading/Writing CSV

21.1 Reading CSV files

We will read CSV using the csv library and then using for loops and dictionaries to look at each line.

Note that we won’t be using python commands like read_csv instead we’ll be learning to do it more manually. The advantage of this is two-fold: it reinforces basic programming and it provides a fall back option for poorly formatted csv files (of which, sadly, there are many in the world!).

When we read the csv it arrives at Python as a list of dictionaries, one per line.

So that’s a single list (think the file) which has one item for each line. Each of the items in the list (the lines in the file) are dictionaries. In those dictionaries the column headers are the keys, and the actual data is in the values (think “cells” in a spreadsheet).

If we consider this csv

id,name,capacity
1,AMD,700
8,Bud Light,2000
3,Austin Kiddie,2300
4,Austin Ventures,2000

in python it looks like the structure below:

venues = [
           {'capacity': 700, 'id': 1, 'name': 'AMD'},
           {'id': 8, 'capacity': 2000,  'name': 'Bud Light'},
           {'capacity': 2300, 'id': 3, 'name': 'Austin Kiddie'},
           {'id': 4, 'capacity': 2000,  'name': 'Austin Ventures'}
          ]

This is a list with one dictionary per row. Each row repeats all of the headers as keys for the dictionaries. That looks repetitive, but the computer knows how to optimize it :)

At this stage in the course, you have encountered four ways to represent a rectangular data structure:

SQL uses tables with rows and columns. Each piece of data is informally called a field. Each column has a strict datatype (number, text, date) for all the fields in that column.
csv uses a file with lines and columns, where a comma separates the columns on each line. A special line at the top provides names for the columns (called headers). There is no way to enforce datatypes within columns.
Excel uses worksheets with rows and columns referred to with numbers (for the rows) and letters (for the columns), each piece of data is called a “cell”. We can also use a special row at the top for column headers. Excel doesn’t usually enforce datatypes within columns (although it can be done with formatted cells.)
Python csv library uses a list. Inside the list each row is a dictionary. Each dictionary has keys and values. There is no special first line for headers. Instead, the keys are equivalent to the headers, and the values are equivalent to “fields” in SQL or “cells” in Excel. The datatype is represented by whether the object at each value is a python string, a number type, or a datetime object.

Representation
SQL	Database	Table	Row	Column	Field	Datatype
Excel	File/Workbook	Sheet	Row (numbered)	Column (letter)	Cell	(formatting)
csv file	(directory)	file	Line in the file	Parts separated by commas	(field)	-
Python csv library	-	List `[ ... ]`	dictionary `{ ... }`	keys in dictionary “header”:	values in dictionary “header”: value	Python datatype of values (string, int, datetime)

22 Sorting and iterating.

Because it arrives as a list, we retain whatever order we asked for in an SQL query, so we can use ORDER BY in an SQL query to order the rows however we want. We don’t need to sort in python.

We will iterate through the results using a for loop. The first thing we should do is print each row.

for row in venues:
    print("Next row:")
    print(row)

This enables us to see how to access individual fields:

Next row:
{'capacity': 700, 'id': 1, 'name': 'AMD'}
Next row:
{'capacity': 2000, 'id': 8, 'name': 'Bud Light'}
Next row:
{'capacity': 2300, 'id': 3, 'name': 'Austin Kiddie'}
Next row:
{'capacity': 2000, 'id': 4, 'name': 'Austin Ventures'}

So we can print out the capacities by using the key from the row dictionary: row['capacity'] (note that there is no difference here between using single and double quotes)

for row in venues:
    print("Next row:")
    print(row["capacity"])

Exercises (you should do in your own notebook)

Print the names of each venue
Print the names and capacities of each venue in a sentence.

When we have access to a field (which we show we have by printing) we can do anything we can do to any string or int in python. For example we can make the name print out in lower case using the str.lower() method.

for row in venues:
    print("Next row:")
    print(row["name"].lower())

Exercises:

Make the Venue names print in upper case.
Print the capacity minus 120 seats
Print the length of the venue names.

23 Reading from a CSV file

Above we manually defined our data as a list of dicts. But we want to read from a csv file directly. We do this with code from a library called csv that way we don’t have to handle all the intricate rules about things like quotes inside fields etc.

In Jupyter, create a new Text file in the same folder as your Python notebook. Copy the data below into it, and name the file venues.csv. This code will read and print out each row as a dict.

"id","name","capacity"
"1","AMD","700"
"2","Honda","2000"
"3","Austin Kiddie Limits presented by H-E-B","2300"
"4","Austin Ventures","2000"
"5","BMI","2000"
"6","Vista Equity","2000"
"7","Google+","2000"
"8","Bud Light","2000"
"9","Auditorium Shores","2000"
"10","Auditorium Shores2","2000"

import csv

with open('venues.csv') as csvfile:
  # tell python about the specific csv format
  myCSVReader = csv.DictReader(csvfile, delimiter=",", quotechar='"')

  # move row by row through the file as though it was a list.
  for row in myCSVReader:
    # Each row arrives in Python as a Dict
    print(row)

Note that all the code handling the file is indented with a with block. Now we can do all the things we did before inside the for loop. Now reading from the file implement the exercises above.

Additional exercises:

Find restaurants with bad inspection scores. Go to https://data.austintexas.gov/ and find restaurant inspection scores. Download the file and upload it to your Jupyter. First just print out each row as a dict, then use an if statement to only print those with scores below 90 points.

Previous semester screencast is here: Reading CSV Files

23.1 Writing CSV files

We will also write CSV using the csv library.

The full code used in the screencast is at the bottom of this page.

As with reading CSV files (and perhaps more so) writing them can be tricky (because of fields with commas, newlines, or quoting characters with in the fields). We definitely want to use a library and the csv module provides that. We will not, and you should never, try to manually construct CSV files using strings.

As with reading we will work to write a list of dictionaries to the CSV.

import csv

# A list of dicts.
venues = [{'capacity': 700, 'id': 1, 'name': 'AMD'},
{'capacity': 2000, 'id': 2, 'name': 'Honda'},
{'capacity': 2300, 'id': 3, 'name': 'Austin Kiddie Limits'},
{'capacity': 2000, 'id': 4, 'name': 'Austin Ventures'}]

To help the CSV writer access the data you have to declare a list of column names (and the order of those column names defines the order of the columns in the CSV file).

column_names = ['id','name','capacity']
# If you don't care about column order
# can also use column_names = venues[0].keys()

As with reading csv files we use a with block and provide some basic information to create a csv.DictWriter just as we created a csv.DictReader.

with open('exported.csv', 'w') as csvfile:
    myCsvWriter = csv.DictWriter(csvfile,
                                 delimiter=',',
                                 quotechar='"',
                                 fieldnames = column_names)

    myCsvWriter.writeheader() # uses contents of column_names

    # write the rows. We iterate through the list of dicts.
    for row in venues:
        # every row must have all keys from column_names, otherwise an error is thrown.
        myCsvWriter.writerow(row)

print("Done writing csv")

The print command ends the for loop and the with block (because it is not indented). Python knows that you are done with myCsvWritern and csvfile. It closes the file, making it available for others to read.

You can look at the file using the Jupyter browser.

Note that the row dictionary can have extra keys, they are just unused and don’t go into the CSV file.

Full csv_writer.py:

import csv

# A list of dicts.
venues = [{'capacity': 700, 'id': 1, 'name': 'AMD'},
{'capacity': 2000, 'id': 2, 'name': 'Honda'},
{'capacity': 2300, 'id': 3, 'name': 'Austin Kiddie Limits'},
{'capacity': 2000, 'id': 4, 'name': 'Austin Ventures'}]

column_names = ['id','name','capacity']

with open('exported.csv', 'w') as csvfile:

    myCsvWriter = csv.DictWriter(csvfile,
                                 delimiter=',',
                                 quotechar='"',
                                 fieldnames = column_names)

    myCsvWriter.writeheader() # uses contents of column_names

    for row in venues:
        myCsvWriter.writerow(row)

print("Done writing csv.")

Exercises:

Change the order of columns, so that capacity is written first.
Use an if statement to avoid writing rows with a capacity less than 2000.
Add a column called “venue_type” to the column_names list. Inside the for manually add a key to each row, so that row[“venue_type”] is set to “outdoor”.

A screencast using older infrastructure (not notebooks) is here. The code is the same: Writing CSV Files.