Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

Lists and Dicts #

We store data in variables. Regular variables (like strings or numbers) just hold a single item. But often we want to hold more then one item together. That’s what lists and dicts (dictionaries) are used for.

Pill Box (24888074919)

Lists and dicts are groups of variables. Think of them like a pill box broken up into slots for different days. You can put things into each slot and each slot has a name.

For Lists, the slots are named by numbers (0, 1, 2).

For Dicts, the slots are named with strings (e.g., “Mon”, “Tues”, “Wed”).

Lists #

courseList = ["Introductory Zoology", "Zoology Lab", "Penguin Studies"]

We can access the parts of the list using this syntax:

print("My second course is " + courseList[1])  # ==> Zoology Lab

Wait … what now? courseList[1] shows the 2nd item? Why is that not courseList[2]? The answer is because we start counting from 0, so courseList[0] is the 1st item. Why? It’s about offsets in memory, see https://en.wikipedia.org/wiki/Zero-based_numbering But really it’s just one of those things. list[0] is the first item. The number is called the “index”.

We can add items to the List with an empty index. And the computer will assign the next (kinda like AUTOINCREMENT in MySQL)

courseList.append("GnuMath")  #==> now courseList[3] holds the string "GnuMath"

Dicts are the same as Lists except that instead of numbers as indexes they have words. They use curly braces instead of square ones. The word indexes are called “keys”. Each key points to a “value”.

courseDict = { "101c": "Intoductory Zoology",
	           "210s": "Zoology Lab",
	           "315p": "Penguin Studies" }

The keys have to be unique, but the values don’t have to be unique. Unlike Lists, Dicts are not always ordered, but this changed recently so the keys are now ordered by default.

print("One of my courses is " + courseDict["210c"])

Adding things works in a very similar way, but you have to provide a key (Python can’t guess, as it can with adding items to a List

courseDict["325m"] = "GnuMath"

There are good examples for Lists and Dicts at: http://learnpythonthehardway.org/book/ex39.html (don’t worry at all about the “Make your own Dictionary Module” part

Foreach (iterating over lists) #

One thing that we do with List and Dicts is go through item by item and do things with them. That way we can write a short piece of code and have it work on each item. Eventually we’ll use this to process csv files and the results of SQL queries.

courseList = ["Introductory Zoology", "Zoology Lab", "Penguin Studies"]

for currItem in courseList:
    print("You are enrolled in a " + currItem + " course")

Note that the variable currItem is not a special word, it’s just a variable name, so you can use anything. Use something that makes sense to you. Whatever you use gets assigned to the next item in the list each time the block of code is run.

for myCourse in courseList:
    print("You are enrolled in " + myCourse)

Unsurprisingly we can also iterate over a dictionary.

show_info = {} # empty dictionary
show_info["band_name"] = "Beardyman"
show_info["venue"] = "BMI"

for key in show_info:
    print(key)

Produces:

band_name
venue

So when we iterate in this basic way over a dictionary we get each of the keys (in a random order). We can actually use these keys to get the values (following the principle of replacement).

show_info = {} # empty dictionary
show_info["band_name"] = "Beardyman"
show_info["venue"] = "BMI"

for key in show_info:
    print("The " + key + " is " + show_info[key]) 

When key is set to "band_name" then show_info[key] is the exact same as writing show_info["band_name"] (another example of the principle of replacement).

Sorting a list #

Note that a List is its own order, but we can re-order it and then use that order that’s called “sorting”. The easiest thing to sort by is alphabetical order note that numbers can also be sorted this way (alphabetical plus numbers is called “lexical” order.)

As Information School students we know that keeping the “original order” is sometimes very important!

mySortedList = sorted(courseList) # that creates a new list, leaving the original list

for currItem in mySortedList:
    print("You are enrolled in " + currItem)

but if you don’t care about the original order you can use: courseList.sort() # this sorts the list “in place”, changing the order

courseList.sort()
for myCourse in courseList:
    print("You are enrolled in " + myCourse)

Quick reference for lists and dicts #

In the database part of the course we talked about four operations using the acronum CRUD (Create, Read, Update, and Delete). That’s a convenient way to remember the basic operations for lists and dicts.

"""Show CRUD operations for lists."""
# Create an empty list
my_list = []
# Create a manually declared list
my_list = ["First item", "Second item"]
# Add an item
my_list.append("New item")
# Read an item using an index. Remember they are 0 indexed so
# first item is actually 0.
print(my_list[1])
# You can also reference from the end of the list using negative numbers
print(my_list[-1])
# Change an item at a position using assignment to the index
my_list[1] = "Changed second item in the list"
# Remove an item from the list
del my_list[1]
removed_item = my_list.pop(1) # pop returns the item.
# Delete the whole list
del my_list # you rarely end up needing to do this.

Now for dicts. Dicts are created with curly braces {} but read with square ones [].

"""Show basic operations for dicts."""
# Create empty dict
my_dict = {}
# Create manually declared dict
my_dict = {"a key": "a value",
           "another key": "another value"}
# Add an item
my_dict["new key"] = "new value"
# Read an item. Use the key in square brackets. Not curly brackets.
# Yes, that's pretty inconsistent, isn't it.
print(my_dict["a key"])
# Change the value at a key using assignment to the key.
my_dict["another key"] = "A new value"
# Remove an item from the dict. Removes key and value.
del my_dict["new key"]
removed_value = my_dict.pop("new key") # returns removed value
# Delete a whole dictionary
del my_dict # one rarely does this.

Further understanding iteration (optional). #

We briefly touched on using a for loop to iterate over items in a list. The material below provides additional insight into how those work, but it is optional, you can use for loops in this course without understanding this material.

A while loop is a more manual way to iterate, compared to a for loop.

The while loop is also explained in this previous semester Screencast on While Loops. The screencast uses the code below.

This code celebrates with “hip, hip, hurray” but you can customize it for greater anticipation (e.g., “hip, hip, hip, hip, hurray”) by changing todo. The test on line 13 (todo > done) is repeated after each line 15.

todo = 3
done = 0
while(todo > done):
    print("hip")
    done = done + 1
print("hurray")

A few questions to ask yourself:

  1. Why does this only print “hip” twice and not three times?
  2. Why does hurray only print once, regardless of what you number you set todo to?

The figures used in the screencast, showing the state of variables is below:

The for loop #

The for loop allows us to iterate over lists and dictionaries. It is simpler form of the while loop.

my_list = ["zero", "one", "two", "three"]
for item in my_list:
    print(f"The next item is: {item}")

could be rewritten manually using a while loop. We have to point to each item in the list one by one. We know how to point to a single item using my_list[0] and my_list[1] so we use a number as a counter in the while loop, comparing it to the length of the list len(my_list).

my_list = ["zero", "one", "two", "three"]
counter = 0
while (counter < len(my_list)):
    
    print("The next item is: ")
    print(my_list[counter])
    
    counter = counter + 1

Notice we have to increment the counter variable at the end of each loop: counter = counter + 1.

We can make this even more similar to the for loop above by using a temporary variable item = my_list[counter] and f-strings.

my_list = ["zero", "one", "two", "three"]
counter = 0
while (counter < len(my_list)):
    item = my_list[counter]
    
    print(f"The next item is: {item}")
    
    counter = counter + 1

While the while loop helps us understand what the for loop is doing, we will use the for structure a lot as we move down through the lines of a csv file (and sometimes across the fields on each line as well).

More in this screencast: Iterating over Lists and Dicts.