25 Data Wrangling: Data Transforms

At this point you have what you need to read csvs, use INSERT statements (together with “find or create” SELECTs or UPDATE) to import data to your database, use a SELECT query to get your data back out and then save those results to CSV for use in your analysis program.

But almost always the data in the input CSV doesn’t line up with the structure in your database. These are called “transformations”. Below I highlight some transformations but you’ll find many useful tidbits on the FAQ page. Take a look at those techniques as you work on importing your CSV files. They are derived from situations seen in previous class projects. The text below has screencasts that have a different infrastructure (they use pymysql rather than duckdb) but the python logic in the code is the same.

25.1 Read pivot table style files or years as column titles.

Often we need to read in summarized data summarized data, where the columns are groups and the rows are different groups and the cells are values. This is variously called “wide data” or “pivot style” data. One frequently seen is having years as column titles.

country_name,1992,1996,2000,2004
Australia,3,1,10,20
USA,34,33,29,40

We can store this directly in SQL but it makes things like grouping by year very hard. So we want to store this in sql, we want a table that looks like (often called a “long” data format and this operation is known as “pivot longer”). This puts the values into rows and keeps them out of headers.

country_name,year,count
Australia,1992,3
Australia,1996,1
USA,1992,34
USA,1996,33
...

We can do this with python and reading a csv, but we have to be conscious of what we have in each row and often have to handle columns differently. We have to write some logic to make the conversion. Most importantly we iterate across each row, and execute an insert for each field of the repeated columns.

import csv

filename = 'year-posts.csv'

with open(filename, "w") as text_file:
    text_file.write("""
country_name,1992,1996,2000,2004
Australia,3,1,10,20
USA,34,33,29,40
""".lstrip())

with open(filename) as csvfile:

    myCSVReader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
    for row in myCSVReader:
        print(row)

We will have a dict per row of the input file. If we print(row) we can see that (as usual) the column headers become keys, so the years are actually in the keys and the counts are in the values. However, there is one item in the dict that is different. It came from the first column of the csv and it holds the country_name.

{'country_name': 'Australia', '1992': '3', '1996': '1', '2000': '10', '2004': '20'}

We can get the event_type to use later and remove it from the row dict at the same time, leaving the dict as pairs of years and counts. The Event type is the same for all inserts from that row.

        event_type = row.pop("country_name")
        print("After: consistent types in keys")        
        print(row)

row.pop actually removes the key/value pair (and returns the value), leaving us:

{'1992': '3', '1996': '1', '2000': '10', '2004': '20'}

And we can now process the rest of the Dict by iterating across the key, value pairs. This requires calling row.items() which allows us to iterate through both the key and values at the same time.

        for key, value in row.items(): # move horizontally across remaining columns
            param_dict = {"country_name": event_type,
                          "year": key,
                          "count": value}
            print(param_dict)
            # con.execute(sql, param_dict)

If we were writing into a database, we would have an INSERT placeholder query and call con.execute rather than `print(para_dict). That would be called not once per row of the input, rather it is called once per remaining column (after we remove the event_type column.). You can think of this as moving across the columns of the CSV, calling the con.execute once per column inside a row.

25.2 Handling dates.

Dates and datetimes can be difficult because they come in many different formats: e.g.,

2:30pm on 30th of March, 2013
2013-03-30 2:30 PM
03/30/2013 14:30pm

But whatever format data arrives in mysql wants datetimes formatted like 2013-03-30 14:30:00. In addition, when we have queried data from MySQL we have to decide on the format with which we will use the datetime.

Python has a datetime datatype which can help do conversions, especially in combination with parameterized queries ($keyname and param_dict).

25.2.1 Incoming dates

We make a Python datetime object from the incoming string we read from the CSV, then we pass that object to the sql server via param_dict. We have to use little template codes for different parts of the date (like %Y for the four-digit year).

I show some of those that you’ll need below, but you can see more in Python strftime Documentation. Note that the same codes are used for incoming and outgoing:

Incoming: datetime.strptime (“string p arse time”)
Outgoing: datetime.strftime (“string f time”)

# avoids having to write datetime.datetime.strptime
from datetime import datetime

incoming = "03/30/2013 14:30pm"

myPythonDate = datetime.strptime(incoming, "%m/%d/%Y %H:%M%p")

To use this in reading a CSV like performances.csv:

band_id,venue_id,start
5,4,04/30/2014 14:30pm
6,4,04/30/2014 15:30pm

Setup code

!pip install jupysql --quiet
!pip install duckdb-engine --quiet

import duckdb

class ClassDictCursor():
    def __init__(self, duckdb_con):
        self.con = duckdb_con
        self.column_names = [desc[0] for desc in self.con.description]
        
    def __iter__(self):
        return self
        
    def __next__(self):
        if (next_result := self.con.fetchone()):
            return dict(zip(self.column_names, next_result))
        else:
            raise StopIteration
            
%reload_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
%sql duckdb:///duckdb-file.db

import csv
from datetime import datetime

filename = 'performances.csv'

with open(filename, "w") as text_file:
    text_file.write("""
band_id,venue_id,start_time
5,4,04/30/2014 14:30pm
6,4,04/30/2014 15:30pm
""".lstrip())


with duckdb.connect('duckdb-file.db') as con:
    
    con.execute("""
DROP TABLE IF EXISTS temp_performances CASCADE;
CREATE TABLE temp_performances (
    band_id INT,
    venue_id INT,
    start_at DATETIME
)
    """)
    
    with open(filename) as csvfile:
        myCSVReader = csv.DictReader(csvfile)

        for row in myCSVReader:
            print(row)

            template = "%m/%d/%Y %H:%M%p" # 2013-04-05 10:09pm
            start_datetime = datetime.strptime(row['start_time'], template)
            # start_datetime is a python datetime object (not a string or int).

            param_dict = {'band_id': int(row['band_id']),
                          'venue_id': int(row['venue_id']),
                          'start_datetime': start_datetime}

            con.execute("""
                            INSERT INTO temp_performances(
                                band_id, 
                                venue_id, 
                                start_at)
                            VALUES (
                                $band_id, 
                                $venue_id, 
                                $start_datetime
                            )
                            """, param_dict) # datetime converts to expected db format

You can then see that the datetime in the database shows the format expected by the database.

Formatting a datetime object to output to a CSV, file, or to print to the screen uses the same template codes but uses the datetime.strftime method.

# Can create datetimes from ints:
# year, month, day, hours, minutes, sec
my_date = datetime.datetime(2017, 04, 03, 12, 5, 1)

my_format = "%Y-%m-%d %H:%M:%S"

print(mydate.strftime(my_format))

You can use that to write in a particular format to a CSV, you convert the datetime object to a string before passing to myCSVWriter.writerow(row).

25.3 Getting current time.

We can get the current time both in python and in MySQL. We can use NOW() in the SELECT clause of a MySQL query. In python we can use current_time = datetime.now() See this for postgress details on NOW()

25.4 Exercises:

use print to figure out what class datetimes read from the database come into Python as. Are they already datetime objects or do we need to convert them?
Create a datetime object for the current time and output it in:
1. In SQL datetime format
2. Like “January 3, 2012 at 2:30pm”
3. Like “3 Jan 2012 at 14:30 hours”

26 Splitting out data

Sometimes data is mushed together inside a string and to build databases or to conduct an analysis we need to break up the string and extract the information.

e.g., “hat,cat,mat,bat” or “s3e10” or “s3e109”

We’ll look at three techniques: str.split(), string slicing, and regex.

A simple example is when the string has a consistent separator. Then we can use str.split(), which is useful when a single character is used such as with a simple comma-separated string (use csv for more complex stuff!)

codedString = "hat,cat,mat,bat"
myList = codedString.split(",")

print(myList)

But str.split() is only useful if the separator is always the same. Many times it is not. A good example is naming TV episodes. Consider this string: “s3e10” which stands for Season 3, Episode 10. Let’s say that we want to split this string up, so that we have the 3 in a season variable and the 10 in an episode variable.

One way uses Python “slicing”. This lets you treat a string like it was a list of characters (starting at position 0, because computers). We can count out the characters we want. We can also select ranges of characters with start_pos (included):end_pos (excluded).

codedString = "s3e10"
season = codedString[1]  # the second character in the string
episode = codedString[3:5]

print(f"Season: {season}, Episode: {episode}")

The string slicing syntax has some useful defaults. If you leave out the start then it starts at the beginning of the string, leave out the end then it goes to the end.

episode = codedString[3:]
print(f"Season: {season}, Episode: {episode}")

You can also count backward from the end of a string.

episode = codedString[-2:]
print(f"Season: {season}, Episode: {episode}")

Note that you can’t change a letter by assigning to a position (strings can only be read like lists, not changed like lists.) You can, though, use a for loop over a string to do something for each character.

Slicing is useful but limited unless the character positions of the string don’t change at all. Consider: s03e10 vs s10e10 vs s3e109 (long season with three digit episode numbers, def. not a UK series!)

What we need to be more flexible is a way to say “get me everything after the e, until you hit an s, then get the everything until the end of the string. You could do this by iterating through the characters in the string, but there’s a more general way called Regular Expressions, usually abbrev. to regex and in code it’s abbreviated to re (the re module).

Regex use a template to match against strings, using codes to stand for types of characters. For example \d matches a digit, and \d+ matches any number of digits (must be at least 1). To pull an element out you can surround the part you want in parens.

import re

codedString = "s03e10"
matches = re.search('s(\d+)e(\d+)', codedString)
# matches then has group methods. group(0) is the whole string
# So in this case we get the elements from the parens left to right
# starting at 1.
season = matches.group(1)
episode = matches.group(2)

print(f"Season: {season}, Episode: {episode}")

Because \d+ matches any number of digits (at least 1) it will work with other strings too.

codedString = "s3e109"
matches = re.search('s(\d+)e(\d+)', codedString)
# matches then has group methods. group(0) is the whole string
# So in this case we get the elements from the parens left to right
# starting at 1.
season = matches.group(1)
episode = matches.group(2)

print(f"Season: {season}, Episode: {episode}")

Learning the syntax for Regex can take a while, but is very powerful. It can also be fun, like a hard game. For example we could also get a title. \s represents a single space. and the . “dot” character matches any character at all, so that .+ means “get everything” the $ is the end of a line. So here we define the title as everything from colon-space to the end of the line.

codedString = "s03e10: The excitement of parsing"

matches = re.search('s(\d+)e(\d+):\s(.+)$',codedString)
season = matches.group(1)
episode = matches.group(2)
title = matches.group(3)

print(f"Season: {season}, Episode: {episode}, Title: {title}")

Now, to learn to write better Regex try these exercise sites (hint: Python uses “Perl-compatible Regex syntax” (PCRE) which is the most common, anyway, but worth knowing. Just because it says “in perl” doesn’t mean it won’t work in Python.

Handling dates. See Screencast
Multiple pieces of data in columns you want to split up (e.g., s03e04 to Season: 3, Episode: 4), using Regular Expressions or Regex. See Screencast
Pivot table style input files (summary data on multiple variables, you want to transform the schema). A common example is having years as headers, such as GDP figures (country,2014,2015,2016). The faq shows some useful code for handling these and there is an older screencast available.
Reading data out of filenames. e.g., different files for different states or years. See FAQ entry.
“Record style” input files (data about one record on multiple lines). See older Screencast, the code is the same.