Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

Lists and Dicts

We store data in variables. Regular variables (like strings or numbers) just hold a single item. But often we want to hold more then one item together. That’s what lists and dicts (dictionaries) are used for.

I introduce Lists and Dicts in this screencast, the code used is built up on this page.

Lists

Lists and dicts are groups of variables. Think of them like a pillbox broken up into slots for different days (or a filling cabinet with folders). You can put things into each slot. Each slot either has # numbered position (in a List) or a name (in a Dict).

courseList = ["Introductory Zoology", "Zoology Lab", "Penguin Studies"]

We can access the parts of the list using this syntax:

print("My second course is " + courseList[1])  # ==> Zoology Lab

Wait … what now? courseList[1] shows the 2nd item? Why is that not courseList[2]? The answer is because we start counting from 0, so courseList[0] is the 1st item. Why? It’s about offsets in memory, see https://en.wikipedia.org/wiki/Zero-based_numbering But really it’s just one of those things. list[0] is the first item. The number is called the “index”.

We can add items to the List with an empty index. And the computer will assign the next (kinda like AUTOINCREMENT in MySQL)

courseList.append("GnuMath")  #==> now courseList[3] holds the string "GnuMath"

Dicts are the same as Lists except that instead of numbers as indexes they have words. They use curly braces instead of square ones. The word indexes are called “keys”. Each key points to a “value”.

courseDict = { "101c": "Intoductory Zoology",
	           "210s": "Zoology Lab",
	           "315p": "Penguin Studies" }

The keys have to be unique, but the values don’t have to be unique. Unlike Lists, Dicts don’t have any order (you can’t say that an item is 2nd in a Dict) even if it was defined 2nd in the list (the reason, if you are interested, is because the key is stored as a hash: https://en.wikipedia.org/wiki/Hash_function)

print("One of my courses is " + courseDict["210c"])

Adding things works in a very similar way, but you have to provide a key (Python can’t guess, as it can with adding items to a List

courseDict["325m"] = "GnuMath"

There are good examples for Lists and Dicts at: http://learnpythonthehardway.org/book/ex39.html (don’t worry at all about the “Make your own Dictionary Module” part

Foreach (iterating over lists)

One thing that we do with List and Dicts is go through item by item and do things with them. That way we can write a short piece of code and have it work on each item. Eventually we’ll use this to process csv files and the results of SQL queries.

courseList = ["Introductory Zoology", "Zoology Lab", "Penguin Studies"]

for currItem in courseList:
    print("You are enrolled in a " + currItem + " course")

Note that the variable currItem is not a special word, it’s just a variable name, so you can use anything. Use something that makes sense to you.

for myCourse in courseList:
    print("You are enrolled in " + myCourse)

Unsurprisingly we can also iterate over a dictionary.

show_info = {} # empty dictionary
show_info["band_name"] = "Beardyman"
show_info["venue"] = "BMI"

for key in show_info:
    print(key)

Produces:

band_name
venue

So when we iterate in this basic way over a dictionary we get each of the keys (in a random order). We can actually use these keys to get the values (following the principle of replacement).

show_info = {} # empty dictionary
show_info["band_name"] = "Beardyman"
show_info["venue"] = "BMI"

for key in show_info:
    print("The " + key + " is " + show_info[key]) 

When key is set to "band_name" then show_info[key] is the exact same as writing show_info["band_name"] (another example of the principle of replacement).

Sorting a list

Note that a List is its own order, but we can re-order it and then use that order that’s called “sorting”. The easiest thing to sort by is alphabetical order note that numbers can also be sorted this way (alphabetical plus numbers is called “lexical” order.)

As Information School students we know that keeping the “original order” is sometimes very important!

mySortedList = sorted(courseList) # that creates a new list, leaving the original list

for currItem in mySortedList:
    print("You are enrolled in " + currItem)

but if you don’t care about the original order you can use: courseList.sort() # this sorts the list “in place”, changing the order

courseList.sort()
for myCourse in courseList:
    print("You are enrolled in " + myCourse)

Quick reference for lists and dicts

In the database part of the course we talked about four operations using the acronum CRUD (Create, Read, Update, and Delete). That’s a convenient way to remember the basic operations for lists and dicts.

"""Show CRUD operations for lists."""
# Create an empty list
my_list = []
# Create a manually declared list
my_list = ["First item", "Second item"]
# Add an item
my_list.append("New item")
# Read an item using an index. Remember they are 0 indexed so
# first item is actually 0.
print(my_list[1])
# You can also reference from the end of the list using negative numbers
print(my_list[-1])
# Change an item at a position using assignment to the index
my_list[1] = "Changed second item in the list"
# Remove an item from the list
del my_list[1]
removed_item = my_list.pop(1) # pop returns the item.
# Delete the whole list
del my_list # you rarely end up needing to do this.

Now for dicts. Dicts are created with curly braces {} but read with square ones [].

"""Show basic operations for dicts."""
# Create empty dict
my_dict = {}
# Create manually declared dict
my_dict = {"a key": "a value",
           "another key": "another value"}
# Add an item
my_dict["new key"] = "new value"
# Read an item. Use the key in square brackets. Not curly brackets.
# Yes, that's pretty inconsistent, isn't it.
print(my_dict["a key"])
# Change the value at a key using assignment to the key.
my_dict["another key"] = "A new value"
# Remove an item from the dict. Removes key and value.
del my_dict["new key"]
removed_value = my_dict.pop("new key") # returns removed value
# Delete a whole dictionary
del my_dict # one rarely does this.

Looping over lists and dicts (“Iteration” or “for” loops)

One thing that we can do with lists and dicts is to iterate over them. See what I mean in this screencast: Iterating over Lists and Dicts.

Nested structures (Lists of lists) or Lists of dicts

Lists and Dicts can hold other lists and dicts. See Solar System Neighborhood handout